Is p-hacking acceptable for exploratory data analysis? (Part 1)

The problem of multiple comparison is well established in statistics. The multiple comparison problem occurs when many(!) hypothesis tests are performed on the same dataset. By doing so, the chance of a false positive result (e.g., finding a statistically significant result that is actually not) increases.

For example, if performing 100 hypothesis tests each at the 5% level the family-wise error rate is 1-(1-0.05)^{100} \approx 99.4\%. That is, the significance level for the entire set of tests is not 5%, but is 99.4%; an almost certainty.

A few months ago I found a journal article where the authors performed more than 100 hypothesis tests without correcting for multiple comparisons. They found around 5 (I can’t recall exactly) significant results using a 5% significance level. This results is consistent with the assumed 5% error rate (i.e., a significance level of 5% means that if the experiment was repeated 100 times we would expect 5 false positive results). I pointed this out in a letter to the editor, and the authors rebutted my criticism ins several ways. Their main argument was that correcting for multiple comparisons is only needed for confirmatory analyses and not exploratory, and their work was exploratory. They cited a few papers to support their view.

One article they cited is by Bender and Lange. I need to digest this more, but maybe I’ve been too strict in my application of multiple comparisons. More to come.

Leave a Reply

Your email address will not be published. Required fields are marked *