Sunday, March 22, 2015
Some Handy Tech and Statistics Jargon
In February I went to the Analyst Institute Persuasion Retreat, which was absolutely amazing. (If you don't know what the Analyst Institute is, you should.) At the same time even with a (albeit limited) history of graduate level statistics I found myself intimidated by the jargon. I figured if I was, so were others so I enlisted the help of friends in defining some basic, and not-so-basic terms that were bandied about. Enjoy!
Cryptography - how we keep important data, like credit card transactions, safe from prying eyes.
Cookies- Used by tech and data geniuses to track where you go on the web.
P-score/P-value- Say do did a statistical test using a sample population and you confirm or reject your hypothesis using that test. The p-score is the chance that even though you ran the test right, the conclusion you came to is actually wrong, kind of like a margin of error.
T-test- Statistical test used to determine a p-score
N- Size of your sample in a statistical test
Neyman sampling- A survey tool used to find the number of people needed to represent people just like them in the whole populationAn example: Suppose the population is 10% African American; random sampling may put 0%, 15% or whatever percentage of African Americans in your sample. Neyman sampling would mean you would only select from your pool of African Americans (randomly) until you get to 10% of your sample size, then stop.
Simpson's paradox- This paradox occurs when a statistically significant trend occurs when looking at groups of data, but disappears when you look at those groups individually – or vice-versa. The most famous case was researched by Bickel et al, and had to do with gender bias in 1970s grad school admissions at U.C. Berkeley (go Bears). When looking at all of the grad schools together, women had a much lower overall acceptance rate, but when looking at individual grad schools (i.e. English, engineering, public policy) women had similar by-school acceptance rates; the paper concluded that the lower overall acceptance rate came from women applying to schools which were more selective (without mentioning why women didn’t apply to STEM grad schools.)
R package- a piece of software, like Excel, that is used to perform complex statistical tests
C code- As in the programming language. Some people do statistical analysis by writing a proprietary code each time.
Heterogeneity- You want heterogeneity within your sample so that you can use it to make an inference about the population at large. This is why people use Neyman sampling.
Bayestree It's a statistics thing that helps you determine how to organize your data. For example, "Imagine you're trying to predict life expectancy of animals using characteristics across species. Your variables might include: Number of offspring per birth, isWarmBlooded, weight, isMammal, avgTempOfClimate, isOceanDwelling, and probably others I'm not thinking of. There will be collinearity between these variables (I'm guessing Mammals have higher weights than non-mammals, etc). A Bayestree would help you identify the hierarchy of the variables (for example, isMammal is actually a subset of those who are warm blooded, if I've got my science right). The resulting hierarchy can then aid in variable selection/combination."
Big thank you to Adam Briskin-Limehouse, Mario Ben Bonafacio and especially Will Matthews who was the only person who could explain to me what a Bayestree is.