Further to “Everyone seems to know now that there’s a problem in science research today and “At a British Journal of Medicine blog, a former editor says, medical research is still a scandal,” Ronald Fisher’s p-value measure, a staple of research, is coming under serious scrutiny.
Many will remember Ronald Fisher (1890–1962) as the early twentieth century Darwinian who reconciled Darwinism with Mendelian genetics, hailed by Richard Dawkins as the greatest biologist since Darwin. Hid original idea of p-values (a measure of whether an observed result can be attributed to chance) was reasonable enough, but over time the dead hand got hold of it:
Perhaps the worst fallacy is the kind of self-deception for which psychologist Uri Simonsohn of the University of Pennsylvania and his colleagues have popularized the term P-hacking; it is also known as data-dredging, snooping, fishing, significance-chasing and double-dipping. “P-hacking,” says Simonsohn, “is trying multiple things until you get the desired result” — even unconsciously. It may be the first statistical term to rate a definition in the online Urban Dictionary, where the usage examples are telling: “That finding seems to have been obtained through p-hacking, the authors dropped one of the conditions so that the overall p-value would be less than .05”, and “She is a p-hacker, she always monitors data while it is being collected.”
Such practices have the effect of turning discoveries from exploratory studies — which should be treated with scepticism — into what look like sound confirmations but vanish on replication. Simonsohn’s simulations have shown9 that changes in a few data-analysis decisions can increase the false-positive rate in a single study to 60%. P-hacking is especially likely, he says, in today’s environment of studies that chase small effects hidden in noisy data. It is tough to pin down how widespread the problem is, but Simonsohn has the sense that it is serious. In an analysis10, he found evidence that many published psychology papers report P values that cluster suspiciously around 0.05, just as would be expected if researchers fished for significant P values until they found one.
It all ended in scandals. In some cases, you might just as well have interviewed the researchers as to their private opinions about certain types of people as bother to read their studies. They might as well have been tweeting for popular media about cereal ads. Oops. No, wait, even media can get too big a dose of that kind of thing.
Some want to introduce Bayesianism (plausibility as a measure) or a combination of Bayesianism and p-values. But here’s the big hurdle:
Any reform would need to sweep through an entrenched culture. It would have to change how statistics is taught, how data analysis is done and how results are reported and interpreted. But at least researchers are admitting that they have a problem, says Goodman. “The wake-up call is that so many of our published findings are not true.” We just don’t yet have all the fixes.” More.
Hey, for now, look on the bright side. It still matters to a lot of people that so many of the published findings are not true! So there is hope.
See also: If peer review is still working, why all the retractions?
Hat tip: Stephanie West Allen at Brains on Purpose
Follow UD News at Twitter!