Tuesday, November 17, 2009

What is the P-Value of Bullshit? Part Two

Another guest post from Eric

Recall in part one of "What is the P-Value of Bullshit?" we did a thought experiment, called "Study #3" in which we encountered a measurement, four heads in a row, with p-value as low as .06, which was nonetheless almost certainly due to random statistics.

Study #4: In Which the Naive Interpretation of P-value Is Partially Redeemed

Surely there must be at least some scenarios in which we can interpret the p-value as "probability of our result being bullshit"? Well, yes, and here's one: Suppose our jar of 1000 pennies now contains 500 two-headed pennies and 500 ordinary pennies. So now a two-headed penny becomes a mundane hypothesis, no longer a "way out there" sort of thing. Suppose we pick a penny at random, and do our four flips and get four heads, a p=0.06 result.

Q4: Now what's the probability now that we are not holding a two-headed penny?
       The arithmetic goes [(1/2)/16]/[(1/2+(1/2)/16)] = 0.0588

A4: Almost 6%, about the same number as our p-value!

So in Study #4, in which we test a "mundane" hypothesis, your old thesis adviser's naive interpretation of p-value works pretty well.

As a general rule, in order to believe a measurement is real, you should look for a p-value that is small compared to how "out there" the result is you're trying to confirm. If you're testing a mundane hypothesis that is as likely to be true as not, then p=0.05 is probably good enough for you. But if you are trying to confirm that you have hit some one-in-a-thousand two-headed jackpot, then you'd better wait until you get a p value of safely less than 1/1000 (e.g., better flip 13 heads in a row, not just four!) Incidentally, the philosophy of this post, and of this approach to hypothesis confirmation, is based on Bayesian statistics.

You might complain that a sliding criterion for adequate p-value makes the believability of a statistical measurement a matter of subjective judgment. After all, usually we don't know ahead of time that we are fishing for a precisely one-in-a-thousand payoff, and we can only estimate how far "out there" our original hypothesis is.

My response to your complaint: tough tootsie roll. No one ever said doing science was going to be easy. You can blindly apply p-value analysis, and be a hack, or you can bring some careful thought to a problem, and be a real scientist. And speaking of hacks...

For a real-life example, let's go back to that NYT story, which had to do with two candidate AIDS vaccines. Each vaccine had been previously tested and shown quite decisively to be ineffective. The US Army and the NIH jointly decided to sponsor a placebo-controlled, Phase III human subjects trial on the use of the two vaccines in combination.Has the idea of using two vaccines in combination, when each is shown to be ineffective on its own, ever worked?

Dozens of AIDS scientists protested that this hypothesis was such a long shot that testing it amounted to a huge waste of AIDS-battle resources.Was it a one-in-a-thousand long shot, like our two-headed penny hypothesis? Who knows? But in any case it was surely one-in-25 or worse. The NIH and the Army pushed ahead, lined up 16,000 volunteers and spent $100 million, and in the end published a p = 0.04 result* claiming that the combined vaccine worked a wee little bit, providing immunity to only one in every three who got the full combined dose.

Does p = 0.04 mean that the probability that the published result is due to statistical noise is only 4%? The scientists interviewed in the NYT seemed to think so, but alas the study is most likely a 100 million dollars worth of statistical noise: bullshit.

At this very moment, somewhere in the world a scientist is testing a long-shot hypotheses: does eating a diet of only artichokes cure breast cancer?Are red-headed children more responsive to acetaminophen? There are thousands of such investigations going on. They are long shots, but every once in a while a seemingly bizarre hypothesis turns out to be true, so what the hell, no harm in checking it out?

Problem is, with many thousands of long-shot studies going on at any one time, by random chance you will get hundreds of "p = .04" results supporting hypotheses that are in fact incorrect. If you're from the naive school of p-value interpretation, you'll celebrate your p = 0.04 result by publishing a paper, or better, holding a press conference!

And if you are stats-challenged science journalist, you'll write the bullshit up for the New York Times.

*The "p=0.04" number actually comes from a fairly "aggressive" analysis. Playing more strictly by the rules, the study's authors got a still-less-impressive p=0.15.

**Thanks to Jonathan Golub of Slog for providing the point-of-departure for this two-part post. Always lively, readable and informative, Golub is, along with BMGM herself, one of my favorite bloggers on science and science policy. Like all prolific science writers, he has on rare occasions oversimplified and on very rare occasions, totally screwed up.

1 comment:

  1. Statistics in the wrong hands can be really dangerous. I've always looked at health studies with a fair amount of skepticism (which is, unfortunately, not backed up by any deep knowledge of statistics); otherwise I tend to find myself getting whiplash from all the new findings that come out.

    By the way, I came to see you talk at the physics colloquium when I was a grad student at Cornell...and really enjoyed it. Cool stuff!

    ReplyDelete