Were we discussing “inconsistent and unreliable” psychology studies? Statistician says he might know one reason why.
Nature tells us that “One-quarter of studies that meet commonly used statistical cutoff may be false”:
The plague of non-reproducibility in science may be mostly due to scientists’ use of weak statistical tests, as shown by an innovative method developed by statistician Valen Johnson, at Texas A&M University in College Station.
Johnson compared the strength of two types of tests: frequentist tests, which measure how unlikely a finding is to occur by chance, and Bayesian tests, which measure the likelihood that a particular hypothesis is correct given data collected in the study. The strength of the results given by these two types of tests had not been compared before, because they ask slightly different types of questions.
By Bayesian standards, 17–25% of social science findings “are probably false,” Johnson thinks, and interestingly, he sees this as a bigger problem than biases and scientific misconduct.
That, of course, makes sense. People are more likely to be relying on methods favourable to their beliefs, even if they have the guts to wonder at times why the world conforms so easily, than they are to engage in what they know to be misconduct. Human nature and all.
Revised standards for statistical evidence
Valen E. Johnson
Department of Statistics, Texas A&M University, College Station, TX 77843-3143
Edited by Adrian E. Raftery, University of Washington, Seattle, WA, and approved October 9, 2013 (received for review July 18, 2013) Recent advances in Bayesian hypothesis testing have led to the development of uniformly most powerful Bayesian tests, which represent an objective, default class of Bayesian hypothesis tests that have the same rejection regions as classical significance tests. Based on the correspondence between these two classes of tests, it is possible to equate the size of classical hypothesis tests with evidence thresholds in Bayesian tests, and to equate P values with Bayes factors. An examination of these connections suggest that recent concerns over the lack of reproducibility of scientific studies can be attributed largely to the conduct of significance tests at unjustifiably high levels of significance. To correct this problem, evidence thresholds required for the declaration of a significant finding should be increased to 25–50:1, and to 100–200:1 for the declaration of a highly significant finding. In terms of classical hypothesis tests, these evidence standards mandate the conduct of tests at the 0.005 or 0.001 level of significance.