Eric here, sporadic guest poster on Bad Mom, Good Mom. I am a laboratory scientist working in Colorado.
Last week, BMGM aired one of her pet peeves, confusing correlation with causality.
My own statistical pet peeve? The oft-abused concept of p-value. Probably a majority of practicing laboratory scientists routinely misinterpret p-values. I'm not talking mere bit players either: last month the NYT reported on a Phase III human-subjects AIDS vaccine trial, run in Thailand by the US Army and the NIH. Naive thinking about statistics led to the publicized conclusions of the study being almost surely crap.
We'll come back to how we know the conclusions are crap, but first let's do a thought experiment. Imagine taking one two-headed penny and mixing it in with a jar of 999 ordinary pennies. Shake the jar up and pull out one penny. Don't look at it yet! Let's do some scientific studies.
Study #1: In Which We Collect No Data At All
Q1: Before you've looked at the penny you took out, what is the probability that the coin you are holding has two heads?
A1: You got it -- one in a thousand, or 0.1%.
Study #2: In Which We Collect Deterministic Data
Now look at both sides of the penny. Suppose you notice that on both sides, there is a head!
Q2: Now, what is the probability you are holding a two-headed penny?
A2: Yep -- unity, or 100%. OK, we're ready for something more difficult!
Study #3: In Which We Collect Some Odd-Seeming Statistical Data
Throw your two-headed coin back in, shake the jar, and again reach in and grab a single penny. Don't look at it yet! Now suppose you flip the coin four times and get a slightly unusual result: four heads in a row.
Q3: OK, given you flipped four heads in a row, now what would you say is the probability that your penny has two heads?
Well, if we were doing biomedical research, the first thing we do when we encounter statistical data is calculate the p-value, right? Turns out that if you took an ordinary (not two-headed) penny and flip in four times, then the probability you will get heads four times in a row is one in 16, or about 6%. So now we can (correctly!) define p-value by example: four heads in a row is a p-value 0.06 measurement.
Can we turn this idea on its head and say, "If we flip four heads in a row, then there is only a 6% chance the coin is not a two-headed coin"? Many practicing scientists would say "yes", but the correct answer is no, NO, Goddammit, NOOOOOO!
In our Study #3, picking a two-headed coin out of the jar is a very rare thing to do, one in 1000, whereas picking an ordinary coin out and flipping four heads in a row is only a slightly odd thing to do, (999/1000)(1/16), or about one in 16. Thus we get:
A3: the probability you are holding a two-headed coin is very small, (0.001)/(0.001+(999/1000)/16), or about 16 over 1000, only 1.6%. You are 98% likely not to be holding a two-headed coin!
Bottom line: your seemingly significant, p = 0.06 measurement of four heads in a row was not strong evidence of a two-headed coin, and saying otherwise would be, in the technical jargon of the trained laboratory scientist, "bullshit".
Perhaps your research adviser told you the p-value meant "probability your measurement is merely random noise". Is s/he always wrong about that?"
Nah, the old geezer got it right once in a while, if only by accident. To find out about that exception, stay tuned for part two of this post!
**Thanks to Jonathan Golub of Slog for providing the point-of-departure for this two-part post. Always lively, readable and informative, Golub is, along with BMGM herself, one of my favorite bloggers on science and science policy. Like all prolific science writers, he has on rare occasions oversimplified and on very rare occasions, totally screwed up.