Monday, November 16, 2009

What is the P-Value of Bullshit? Part One

Eric here, sporadic guest poster on Bad Mom, Good Mom. I am a laboratory scientist working in Colorado.

Last week, BMGM aired one of her pet peeves, confusing correlation with causality.

My own statistical pet peeve? The oft-abused concept of p-value. Probably a majority of practicing laboratory scientists routinely misinterpret p-values. I'm not talking mere bit players either: last month the NYT reported on a Phase III human-subjects AIDS vaccine trial, run in Thailand by the US Army and the NIH. Naive thinking about statistics led to the publicized conclusions of the study being almost surely crap.

We'll come back to how we know the conclusions are crap, but first let's do a thought experiment. Imagine taking one two-headed penny and mixing it in with a jar of 999 ordinary pennies. Shake the jar up and pull out one penny. Don't look at it yet! Let's do some scientific studies.

Study #1: In Which We Collect No Data At All

Q1: Before you've looked at the penny you took out, what is the probability that the coin you are holding has two heads?
A1: You got it -- one in a thousand, or 0.1%.

Study #2: In Which We Collect Deterministic Data

Now look at both sides of the penny. Suppose you notice that on both sides, there is a head!
Q2: Now, what is the probability you are holding a two-headed penny?
A2: Yep -- unity, or 100%. OK, we're ready for something more difficult!

Study #3: In Which We Collect Some Odd-Seeming Statistical Data

Throw your two-headed coin back in, shake the jar, and again reach in and grab a single penny. Don't look at it yet! Now suppose you flip the coin four times and get a slightly unusual result: four heads in a row.
Q3: OK, given you flipped four heads in a row, now what would you say is the probability that your penny has two heads?

Well, if we were doing biomedical research, the first thing we do when we encounter statistical data is calculate the p-value, right? Turns out that if you took an ordinary (not two-headed) penny and flip in four times, then the probability you will get heads four times in a row is one in 16, or about 6%. So now we can (correctly!) define p-value by example: four heads in a row is a p-value 0.06 measurement.

Can we turn this idea on its head and say, "If we flip four heads in a row, then there is only a 6% chance the coin is not a two-headed coin"? Many practicing scientists would say "yes", but the correct answer is no, NO, Goddammit, NOOOOOO!

In our Study #3, picking a two-headed coin out of the jar is a very rare thing to do, one in 1000, whereas picking an ordinary coin out and flipping four heads in a row is only a slightly odd thing to do, (999/1000)(1/16), or about one in 16. Thus we get:

A3: the probability you are holding a two-headed coin is very small, (0.001)/(0.001+(999/1000)/16), or about 16 over 1000, only 1.6%. You are 98% likely not to be holding a two-headed coin!

Bottom line: your seemingly significant, p = 0.06 measurement of four heads in a row was not strong evidence of a two-headed coin, and saying otherwise would be, in the technical jargon of the trained laboratory scientist, "bullshit".

Perhaps your research adviser told you the p-value meant "probability your measurement is merely random noise". Is s/he always wrong about that?"

Nah, the old geezer got it right once in a while, if only by accident. To find out about that exception, stay tuned for part two of this post!

**Thanks to Jonathan Golub of Slog for providing the point-of-departure for this two-part post. Always lively, readable and informative, Golub is, along with BMGM herself, one of my favorite bloggers on science and science policy. Like all prolific science writers, he has on rare occasions oversimplified and on very rare occasions, totally screwed up.


  1. To be fair to the people who did the AIDS vaccine study, they haven't actually published their conclusions yet. The reporting I've seen in science sources all includes caveats about the analyses being preliminary.

    I agree with badmomgoodmom's original point that a lot of the studies that get reported in the media don't really show what they say they show- in general, I don't believe a gene "causes" something until someone has proof of a mechanism. But again I'll play devil's advocate: even if the genetic study is actually only uncovering a correlation, that may be useful. For instance, the gene might make a good biomarker for the disease, which may make clinical trials easier to run (among other things).

  2. I should add: of course if the statistics are bunk and the researchers haven't even found a correlation, then the result is pretty much useless.

    But I don't consider myself an expert on statistics at all, so I won't try to comment on that aspect!

  3. The study has been published, in NEJM. I include a link in part two, which will appear shortly!
    The big change in the article vis-a-vis the press conference (on which the NYT piece is based) is the authors bowed to pressure and allowed that the p-value might actually be _even_ _larger_ than 0.04, depending on how they limit the sample. They stand fast on the implicit assertion that p=0.04 is small enough to provide meaningful support for a "two-headed coin" hypothesis.

  4. OK, I hadn't seen the paper yet. I'm way behind on my science reading- the newborn is sleeping well for a newborn, but I'm still too sleep deprived to do much beyond read the news and views section of Science!

    (And since neither AIDS nor vaccines is directly relevant to my current work, this paper will sadly stay near the bottom of my "to read" pile.)

  5. Anonymous02:27

    This comment has been removed by a blog administrator.


Comments are open for recent posts, but require moderation for posts older than 14 days.