Slim by chocolate — a severe case of goofed p-hacking

21 November, 2016 at 17:12 | Posted in Statistics & Econometrics | 2 Comments

679eFrank randomly assigned the subjects to one of three diet groups. One group followed a low-carbohydrate diet. Another followed the same low-carb diet plus a daily 1.5 oz. bar of dark chocolate. And the rest, a control group, were instructed to make no changes to their current diet. They weighed themselves each morning for 21 days, and the study finished with a final round of questionnaires and blood tests …

Both of the treatment groups lost about 5 pounds over the course of the study, while the control group’s average body weight fluctuated up and down around zero. But the people on the low-carb diet plus chocolate? They lost weight 10 percent faster. Not only was that difference statistically significant, but the chocolate group had better cholesterol readings and higher scores on the well-being survey.

I know what you’re thinking. The study did show accelerated weight loss in the chocolate group—shouldn’t we trust it? Isn’t that how science works?

Here’s a dirty little science secret: If you measure a large number of things about a small number of people, you are almost guaranteed to get a “statistically significant” result. Our study included 18 different measurements—weight, cholesterol, sodium, blood protein levels, sleep quality, well-being, etc.—from 15 people. (One subject was dropped.) That study design is a recipe for false positives.

Think of the measurements as lottery tickets. Each one has a small chance of paying off in the form of a “significant” result that we can spin a story around and sell to the media. The more tickets you buy, the more likely you are to win. We didn’t know exactly what would pan out—the headline could have been that chocolate improves sleep or lowers blood pressure—but we knew our chances of getting at least one “statistically significant” result were pretty good.

Whenever you hear that phrase, it means that some result has a small p value. The letter p seems to have totemic power, but it’s just a way to gauge the signal-to-noise ratio in the data. The conventional cutoff for being “significant” is 0.05, which means that there is just a 5 percent chance that your result is a random fluctuation. The more lottery tickets, the better your chances of getting a false positive. So how many tickets do you need to buy?

P(winning) = 1 – (1 – p)^n

With our 18 measurements, we had a 60% chance of getting some“significant” result with p < 0.05. (The measurements weren’t independent, so it could be even higher.) The game was stacked in our favor.

It’s called p-hacking—fiddling with your experimental design and data to push p under 0.05—and it’s a big problem. Most scientists are honest and do it unconsciously. They get negative results, convince themselves they goofed, and repeat the experiment until it “works.” Or they drop “outlier” data points.

John Bohannon

Statistical inferences depend on both what actually happens and what might have happened. And Bohannon’s (in)famous chocolate con more than anything else underscores the dangers of confusing the model with reality. Or as W.V.O. Quine had it:”Confusion of sign and object is the original sin.”

There are no such things as free-standing probabilities – simply because probabilities are strictly seen only defined relative to chance set-ups – probabilistic nomological machines like flipping coins or roulette-wheels. And even these machines can be tricky to handle. Although prob(fair coin lands heads|I toss it) = prob(fair coin lands head & I toss it)|prob(fair coin lands heads) may be well-defined, it’s not certain we can use it, since we cannot define the probability that I will toss the coin given the fact that I am not a nomological machine producing coin tosses.

No nomological machine – no probability.



  1. I am still going to be a chocolate enthusiast!

  2. I read this example differently. The p-value is supposed to be the probability of getting a result that (according to the rules of the statistics game) ‘is’ at least this significant when the null hypothesis is true. If one neglects the fact that the result was the ‘best’ of many then one is using the wrong p-value.

    Just because the result would have been ‘statistically significant’ under a different set of circumstances does not mean that it really is.

Sorry, the comment form is closed at this time.

Blog at
Entries and comments feeds.