# We’re Using a Common Statistical Test All Wrong. Statisticians Want to Fix That.

4P-values, or probability values, are “commonly used to test (and dismiss) a ‘null hypothesis’, which generally states that there is no difference between two groups, or that there is no correlation between a pair of characteristics. The smaller the *P* value, the less likely an observed set of values would occur by chance — assuming that the null hypothesis is true. A *P* value of 0.05 or less is generally taken to mean that a finding is statistically significant and warrants publication.” (1)

Moved by growing concerns about the reproducibility of research, the American Statistical Association (ASA) issued a statement in March 2016 to address the widespread misuse of p-values. ¹

In this article, Retraction Watch supplies a set of six principles pulled from the statement and shares its interview with Ron Wasserstein, the ASA’s executive director.

ASA Statement on Statistical Significance and P-values, on behalf of the American Statistical Association Board of Directors

Perhaps chief among the six principles is the third: “Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.”

Beyond the misunderstandings, Wasserstein was troubled by devious applications of the statistic.

“What concerns us even more are the misuses, particularly the misuse of statistical significance as an arbiter of scientific validity. Such misuse contributes to poor decision making and lack of reproducibility and ultimately erodes not only the advance of science but also public confidence in science.”

### References

- Baker M. Statisticians issue warning over misuse of P values: Policy statement aims to halt missteps in the quest for certainty.
*Nature*531(7593): 151, 2016. Available here. - Wasserstein RL and Lazar NA. The ASA’s statement on p-values: context, process, and purpose.
*The American Statistician*, 2016. Available here.

### Comments on The P-Value Crisis

4 Comments

This is great information to know. Especially when reading a study we can now take the p-value as a correlate but not a direct gold standard of the value of the study.

Ron Wasserstein's example of the ASA's second principl:

"P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone." illustrates the misuse of the p-value in simple terms that even I can understand. This misuse of the p-value is prevalent in medical research. It's time to stop committing a Vizzini blunder!

"A p-value, or statistical significance, does not measure the size of an effect or the importance of a result."

One of the beautiful things about the CrossFit approach to quantifying fitness/work capacity is that it obviates the need for the 'post modern' scientific approach, as Coach Glassman put it on a recent comment thread. If you cut your time on a benchmark WOD in half, you've doubled your work capacity. No need to complicate it any further than that with P-values, effect sizes, regression analyses etc. Newtonian kinematics, it turns out, is more than sufficient as an analytic tool!

It's also one of the beautiful things about determining success in terms of pure, functional metrics - do you perform better, feel better, etc. Science is simpler, more compelling, and probably less likely to err when we define our outcomes this simply. We can clearly file interventions/approaches/etc. into meaningful buckets - does it work, does it not. And then either way we can move forward. We might make a mistake on occasion, but we can iterate and adapt so freely the cost of those mistakes is minimized.

Obviously big chunks of science aren't this clear...but this may be a critique, not a fact. Researchers try to precisely quantify the impact of very specific inputs on very specific outputs. Often this REQUIRES statistics because obvious results simply don't exist.

But I'd hope the output of this statistical discussion among the scientific community is not a new way of looking at statistics, but a step back to a scientific process where (as Wasserstein puts it) "thoughtful statistical and scientific reasoning" determine relevance - and that research that only looks significant because it passed a p-value threshold is minimized, not magnified. I think this could lead to better-designed studies, to boot.