From Dalmeet Singh Chawla at Nature:

Researchers are at odds over when to dub a discovery ‘significant’. In July, 72 researchers took aim at the P value, calling for a lower threshold for the popular but much-maligned statistic. In a response published on 18 September1, a group of 88 researchers have responded, saying that a better solution would be to make academics justify their use of specific P values, rather than adopt another arbitrary threshold.

P values have been used as measures of significance for decades, but academics have become increasingly aware of their shortcomings and the potential for abuse. In 2015, one psychology journal banned P values entirely.More.

The P-value turns on the question of whether there is any relationship between the phenomena being measured. A smaller number is better. The fact that reforms of the system are controversial suggests that, where statistics are concerned, in matters of doubt: Doubt.

*See also:* The war over P-values is now a quagmire, but a fix is suggested

Deep problem created by Darwinian Ron Fisher’s p-values highlighted again

Early Darwinian Ronald Fisher’s p-value measure is coming under serious scrutiny

Misuse of p-values and design in life?

Rob Sheldon explains p-value vs. R2 value in research, and why it matters

If even scientists can’t easily explain p-values… ?

and

Nature: Banning P values not enough to rid science of shoddy statistics

One bit of constructive criticism: It might be best not to link to the Rob Sheldon piece on p-values; it has some issues, which wd400 points out in the comments. There’s no use in adding to the confusion regarding this matter.

Rob Sheldon replies to DaveS at 1:

For someone who likes p-values, I would imagine that they would really dislike my stereotyped slandering of the field. On the other hand, for someone who has been burned by p-values, I probably didn’t say enough bad things about them.

And therein lies the problem. Math is not, as is widely advertised, a value-neutral enterprise that is best done by Dr. Spock. Rather, math is as emotional and subjective a field as, say, betting at the racetrack. Everybody starts with the same data, but no one agrees on the best strategy to win.

If DaveS doesn’t like my characterization, he is free to write his own. The Wikipedia article tries to draw a fine distinction between “improbability” and the statistics of the “null hypothesis.” Various complications can occur when the probabilities are asymmetric or non-Gaussian. One can put in a lot of fancy Bayesian math or build a case that Fisher’s reasoning was flawed, and yet still reach the same conclusion: no one likes what p-values have done to the published literature. Some blame the editors, some the authors, and some the p-value itself. If I had my druthers, I’d widen the blame to all of society.

Philosophically, the 19th century was the era of Gaussian “random” statistics–diffusion, evolution, population genetics–and the 20th century was the era of randomness dogma–Darwinism being the most visible, but also Materialism and Communism. The true innovations of the 20th century, however, were non-Gaussian–quantum mechanics, lasers, computer chips. It is only in the 21st century that these non-Gaussian, non-random, non-local discoveries are penetrating to the 20th century dogmas. This explains the dislocations, the cognitive dissonance generated by non-Darwinian genetics, or by QM computers, or by neural-net machine learning. And it is in this modern world that p-values are being seen for what they are–useless measures of an ideal 19th century materialistic world.

To Dr Sheldon,

Oh, don’t get me wrong. I understand there are very serious problems with p-values. I’m just saying that this particular post has some issues, even with basic nomenclature (for example, calling a “p-value” a “correlation statistic”).

I do think “useless” is taking it too far; I also don’t see what they have to do with materialism (or Communism?!) specifically.

Wow. Rob Sheldon should have done a bit of research before responding…

Probabilities are always non-Gaussian – just look at the range that they are defined over!

I can only guess that Sheldon means that the test statistics are non-Gaussian, but that’s only a problem if the tail probability is calculated assuming a Gaussian probability. But even if the p-value is calculated properly (e.g. if the null distribution of the test statistic is derived through randomisation), the criticisms are still valid. After all, we’ve been using chi-squared and F distributions for a long long time.

Fisher’s or Neyman & Pearson’s? FWIW, I think most professional statisticians are OK with p-values,

if they are used properly: they have their place, but it’s not the place they are used by a lot of scientists.Huh? Population genetics wasn’t a 19th century phenomenon. I’m not sure what this “randomness dogma” is, or how it relates to the 19th century ideas of communism.

I’m not sure I’d say that “the 19th century was the era of Gaussian “random” statistics”: many of the notions of randomness were being developed during that century, but weren’t fully developed until the 20th century. Certainly statistics, as a mathematical subject of study, was in its infancy at the turn of the 20th century.

P-values are a very 20th century phenomenon – they were developed by Fisher, Neyman and Egon Pearson, all of whom were a few years from university when the 20th century broke out.

I’ve read quite a bit of the p-value literature over the years, and have discussed the issues with a range of scientists and statisticians, and I’m afraid I don’t recognise Rob Sheldon’s characterisation of the issues at all.

Dean @ 4 –

Are you suggesting that everyone who mis-uses p-values is a Marxist, or that non-Marxists don’t mis-use p-values? Or that the mis-use of p-values is only a problem when Marxists do it?

Dean – I’ve got some good news for you. Marxists aren’t hiding under every bed. I’m not even aware of any Marxists in the p-values debate – if you know of any, please name names. I’m also not sure what you man by “pushing for the relaxation of the p-value”. Do you mean that the threshold for “significance” should be relaxed (which would mean raised – something I haven’t seen suggested anywhere)?

Dean – read Oreskes’ piece again. It’s a

US government agencythat she’s suggesting wanted to relax the burden of proof. Also, is she actually a Marxist?And what do watermelons have to do with p-values?

Bob O’H (and anyone else interested),

Do you have any ideas for “solving” the problems around p-values that would be at least tolerable to those involved?

I don’t know enough about stats to have strong opinions on the matter, but going back to the OP, I have doubts that replacing current arbitrary thresholds with new arbitrary thresholds would accomplish much.

daveS – p-values certainly shouldn’t be banned, but a lot of people (myself included) want to see more focus on looking at parameters, or statistics that measure how important an effect is. The problem is that it would require (a) a large culture in the way science is taught and done, and (b) a deeper understanding of the models that are used in statistical inference.

Some people have suggested using AIC or Bayesian methods, but I don’t think they solve the problem, just change it (if a Bayesian approach is to solve the problem, it will do it by creating an emphasis on parameters, but that’s not a Bayesian issue

per se).Of course, Marxist revolution is another solution….

Bob O’H,

🤔

Yeah, maybe we should just go with that.

We will have to decide which Marx to follow, though.

Dean – I’m afraid it’s not obvious what you mean about watermelons, Delingpole, and Marxism. Can you be more specific? I’d rather not have to trawl through several pages of Google searches to find out what you mean.

I understood what Oreskes was trying to get at, and her argument is closer to what a lot of people argue should be done: look at the costs & benefits of different actions (this was formalised as decision analysis decades ago).

I’m guessing that a “watermelon” is a person who is green on the outside and red on the inside. Similar to the metaphors you see using oreos, bananas, apples, etc.

Dean – that blogpost doesn’t mention Oreskes. So please, show me some actual evidence that she is a Marxist. It’s really not enough, I’m afraid, to show that communists support the same cause (if you can’t see why that’s a problem, ask people at the Discovery Institute).

Bob O’H,

Regarding your post #13, would you mind elaborating a bit on your suggestion that more time be spent looking at parameters? I have only a very elementary understanding of stats, and IIRC, almost everything I learned in my one (!) stats class had to do with inferring parameters from statistics & hypothesis testing, so I don’t have any idea what “else” is involved.

And on the matter of the importance of effects, does this mean that a goal is to not only report p-values, but also establish whether an effect is consequential in some sense? For example, one researcher might find evidence that some drug causes a particular side effect, with a very low p-value, but it turns out that this side effect is “clinically” insignificant. Another researcher might find that the same drug causes a very serious side effect, but her/his study has a higher p-value. Therefore the second researcher’s study might be more worthy of publication. Is that the underlying issue that you are addressing?

daveS – the way a lot of people use stats is to use the p-value as a measure of importance, which it isn’t (it’s a measure of the adequacy of a model), so yes we want to look at whether an effect is consequential, and you’re right that a serious side effect with a non-significant p-value may be more important than a mild side effect. So, context matters.

Bob O’H:

I fully agree with you about p-values, especially your last remarks at #23.

P-value is a very important tool, but that means that it must be used correctly. Being a medical doctor, I certainly agree that p-value is misused and often abused in medical literature and in medical culture. Indeed, the whole statistical approach to data is often misused in medicine. That is a serious problem,

But the problem does not lie in thresholds, IMO. It lies in considering thresholds really important.

Most people in medicine are not really aware of the relationship between observed effect and statistical significance of the result, least of all of the fact that p-value is only a blend of effect size and sample size.

The problem is the lack of methodological awareness, rather than statistical culture. In most cases, the misuse is methodological, and implies not giving true attention to the epistemological context.

It is certainly true that no threshold of p-value should be considered really important. The relevance is different in each case, and thresholds are only conventional references. Debating whether to lower the threshold only means to stick to the idea that a threshold is just a gate to publication, transforming a scientific tool into a political issue.

It is certainly true that in medicine we often observe strong effects that do not reach “significance” because of the very small sample size. While they certainly need better confirmation, it would be utter folly to ignore them, especially when they are of potential interest. The only reasonable thing to do, in that case, is to increase the sample size, whenever possible.

The current tendency to look for “absolute truths” in medicine is, IMO, nothing good. A lot of important knowledge in medicine starts with epidemiological observations, far from being “absolutely reliable”, but that can be confirmed as very reasonable by further research.

There are other misuses of statistics and methodology in medical papers, of course, not only the obstinate faith in p-value thresholds. That is not an easy problem.

But I do believe in the power of statistical analysis, when performed objectively, and with a sincere desire to understand what is true, in the measure that it can be understood. I have seen very low p-values, even with small samples, even of the order of 10^-10 or less. In those cases, I believe that there can be no reasonable doubt that we are observing something.

Understanding what it is, of course, is all another matter.

Dean_from_Ohio:

Your interesting argument from avionics is further proof that context is the most important thing in applied science: we must know what we want to know, and the degree of “safety” that we want to achieve in our understanding. As I believe that there is never any absolute truth in empirical sciences, the “degree of certainty” depends critically on the context and the aims we want to achieve.

Medicine is a strange science, where you often have to act even when you don’t know how to to act. That’s why many times even some gross indication can be better than no indication at all.

The problem of beta error and statistical power is certainly bigger, in medicine, than the problem of alpha error and p-value thresholds. That’s what I was pointing at when I wrote that there is a serious danger of accepting the null hypothesis when the statistical power of the available data is not acceptable at all.

I don’t know exactly to what you refer speaking of bad theories supported by a wrong use of statistics, but I would say that neo-darwinism is not certainly an example of that: neo-darwinism is not supported by any statistic, indeed neo-dariwnists seem to hate statistics and probabilities and avoid them as much as possible.

ID, instead, is fully supported by statistics and correct hypothesis testing: as you certainly know, are p-values, in almost all contexts, are at least of the order of 10^-100! 🙂