I used to get stock feedback from people asking me to write an article about the amazing stamina-enhancing properties of newt eye or frog toe or whatever . “Send me the results of a peer-controlled, randomized, double-blind trial,” I said, “and I’d be happy to write about it.” But then they started calling my bluff. The same way as everything in your fridge causes and prevents cancerthere’s a study somewhere that proves everything increases stamina.
A new preprint (a journal article that has yet to be peer-reviewed, ironically) from researchers at Queensland University of Technology in Australia explores why this seems to be the case and what can be done about it. David Borg and his colleagues sift through thousands of articles from 18 sports and exercise medicine journals, and uncover eye-opening patterns of what’s published – and perhaps more importantly, what’s not. ‘is not. To make sense of the studies you see and decide if the latest hot performance aid is worth experimenting with, you also need to consider the studies you don’t see.
Traditionally, the cutoff for success in studies has been a p-value of less than 0.05. This means that the results of the experiment seem so promising that there is only a 1 in 20 chance that they would have happened if your new miracle supplement had had no effect. It sounds relatively simple, but the actual interpretation of p-values quickly becomes both complicated and controversial. By an estimate, a study with a p-value just under 0.05 actually has about a one in three chance of being a false positive. Worse, it gives you the misleading impression that just one study can give you a definitive yes/no answer.
As a result, scientists have tried to wean themselves off the “reign of p-value.” Another way to present the results is to use a confidence interval. If I tell you, for example, that Hutcho’s Hot Pills cut your ride time by an average of five seconds, that sounds good. But a confidence interval will give you a better idea of the reliability of this result: although the mathematical definition is nuanced, for practical purposes, you can think of a confidence interval as the range of the most likely results. If the 95% confidence interval is between two and eight seconds faster, that’s promising. If it’s between 25 seconds slower and 30 seconds faster, you’d assume there’s no real effect unless other evidence emerges.
The dangers of the so-called p-hack are well known and often unintentional. For instance, when sports scientists were presented with sample data and asked what their next steps would be, they were much more likely to say they would recruit more participants if the current data was just outside of statistical significance. (p=0.06) than just inside. (p=0.04). Those kinds of decisions, where you stop collecting data as soon as your results seem meaningful, distort the whole literature predictably: you end up with a suspicious number of studies with p just below 0.05.
The use of confidence intervals is believed to help alleviate this problem by moving from the yes/no mindset of p-values to a more probabilistic perspective. But does it really change anything? This is the question Borg and his colleagues attempted to answer. They used a text-mining algorithm to extract 1,599 abstracts of studies that used some type of confidence interval to report their results.
They focused on studies whose results are expressed as ratios. For example, if you are testing whether Hutcho Pills reduces your risk of stress fractures, a odds ratio of 1 would indicate that runners who took the pills were just as likely to be injured as runners who did not take the pills. An odds ratio of 2 would indicate that they were twice as likely to be injured; a ratio of 0.5 would indicate that they were half as likely to be injured. So you might see results like “an odds ratio of 1.3 with a 95% confidence interval between 0.9 and 1.7”. This confidence interval gives you a probabilistic idea of how likely the pills are to have any real effect.
But if you want a more black and white answer, you can also ask if the confidence interval includes 1 (which it does in the previous example). If the confidence interval includes 1, which corresponds to “no effect”, this is roughly equivalent to saying that the p-value is greater than 0.05. So you might suspect that the same values that lead to p-hacking would also lead to a suspicious number of confidence intervals that barely exclude 1. This is precisely what Borg was looking for: upper confidence interval bounds between 0 .9 and 1, and lower limits between 1 and 1.2.
Sure enough, that’s what they found. In unbiased data, they calculate that you would expect about 15% of the lower bounds to be between 1 and 1.2; instead, they found 25 percent. Likewise, they found four times more upper bounds between 0.9 and 1 than expected.
One way to illustrate these results is to plot something called the z-value, which is a statistical measure of the strength of an effect. In theory, if you plot the z-values of thousands of studies, you would expect to see a perfect bell curve. Most outcomes would cluster around zero, and fewer and fewer would have either very strongly positive or very strongly negative effects. Any z value less than -1.96 or greater than +1.96 corresponds to a statistically significant result with p less than 0.05. A z-value between -1.96 and +1.96 indicates a null result with no statistically significant result.
In practice, the bell curve won’t be perfect, but you’d still expect a fairly smooth curve. Instead, here’s what you see if you plot the z-values of the 1,599 studies Borg analyzed:
There is a giant missing piece in the middle of the bell curve, where all the studies with non-significant results should be. There are likely many different reasons for this, both driven by decisions made by researchers and, just as importantly, by decisions made by journals about what to publish and what to reject. It’s not an easy problem to solve, because no journal wants to publish (and no reader wants to read) thousands of studies that conclude, over and over, “We’re not sure this works yet.”
One approach that Borg and his co-authors advocate is the wider adoption of registered reports, in which scientists submit their study plan to a peer-reviewed journal. before performing the experiment. The plan, including how the results will be analyzed, is peer-reviewed, and the journal then promises to publish the results as long as the researchers stick to their stated plan. In psychologythey note, recorded reports produce statistically significant results 44% of the time, compared to 96% for regular studies.
Sounds like a good plan, but it’s not an instant fix: the journal Science and medicine in football, for example, submitted registered reports three years ago but has yet to receive a single submission. In the meantime, it’s up to us journalists, coaches, athletes, interested readers, to apply our own filters a little more diligently when presented with exciting new studies that promise easy wins. It’s a challenge that I have struggled with and often short of. But I now keep in mind this basic rule: a study, on its own, means nothing.
For more sweat science, join me on Twitter and Facebookregister at E-mailand check out my book Enduring: Mind, Body, and the Curiously Elastic Limits of Human Performance.