Intelligent Design Peer review

Rob Sheldon explains p-value vs. R2 value in research, and why it matters

Spread the love
Dr Sheldon
Rob Sheldon

Further to: Ecology explains less and less? (Researchers were “dismayed” to discover that “R2—a more informative statistical indicator—has been on the decline.”), physicist Rob Sheldon here:

The “p-value” is a Fisher correlation statistic that asks the question “If I have a plot with n-points on it, what is the probability that I would get this distribution by pulling n-points out of a hat?” If the “random probability” is less than 0.05, then “classical statistics” people say “Wow, that is significant!”. Now as Brian rightly points out, most people are not trained in classical statistics, but they own computers. So when they have a data set, they try this formula, that formula, they try dropping a few points, until suddenly their computer flashes the magic “p<0.05” and then they write a paper. More and more papers are being published with “just barely below 0.05” values of p. Medicine is particularly notorious for this.

But what does this mean? Is it “significant”?

Numerous statisticians have pointed out that not only is it not significant, but it is actually erroneous. The first problem, is that Fisher assumed you had one bag, and you are doing this random thing once. But if you have M bags, then the probability you will randomly find a p<0.05 for one of those bags gets a lot better by a factor M. So a “honest” statistician needs to factor in all the formulae he tried, all the data sets he looked at before he assigns “significance”. But they don’t. They don’t even realize that they are biassing the statistic this way.

For example, look at the way papers are published. Most experiments get negative results, and they never get published. Only those that get positive results are published. But that causes a bias. There are thousands of substances that caused cancer in somebody’s lab rat, but 100’s of thousands that didn’t. Guess which ones get published? So suppose 100 experiments fed rats cyclamate, and only one experiment caused cancer. Does cyclamate cause cancer? It does if the only published paper shows a positive link.

But that error in p-value would be due to ignorance. What about malice? If we take our data set, and start squinting at it long enough, we can usually find a trend. If we leave out a few data points–outliers really–then the trend gets better, as does the R2 (also known as the correlation coefficient where a perfect trend is R2=1.0 and zero trend is R2=0.0). This is known as cherry-picking, and unlike the p-value above, most experimentalists are told that cherry picking is a punishable crime. My physical chemistry class even had us use special notebooks with numbered pages so we couldn’t tear out a page of data and alter the results of an experiment. (At the time, I felt mildly insulted. Now I would insist everybody do this.)

What is the difference between R2 and p-value? Well, if you plot x vs y, and all your data lie on a straight line and you have a theory y=mx+b, then your p-value is < 0.05 and your R2=1.0. On the other hand, if your data look like a cumulus cloud and your theory is still y=mx+b, then your R2 drops to 0.0 and your p-value rises. But wait, that cloud looks just like your Aunt Gertrude! The R2 doesn’t get better, but now your p-value drops below 0.05 because what is the likelihood of that–its gotta mean something! On the other hand, if you want your theory to predict y from x, then your bizarre data is still useless, it has no predictive ability. And that is why p-values don’t really mean very much, because “significance” is in the eye of the beholder.

Follow UD News at Twitter!

This is where Brian starts to mutter about Bayesian priors and the need to dump classical statistics. The late Edwin Jaynes had the original rant on this topic.

So if we don’t cherry pick, then we are stuck with low R2 values when the data shows little trend. That’s when we pull out the magic p-value. So the p-value is a statistic of last resort. It can be ignorantly and discretely manipulated to achieve significance without punishment. No wonder papers keep citing more and more p-values while R2 is dropping! Its a sign of desperation–for publication, that is.

Follow UD News at Twitter!

16 Replies to “Rob Sheldon explains p-value vs. R2 value in research, and why it matters

  1. 1
    Jehu says:

    I disagree with your P value description. In medicine, to get a drug approved by the FDA, you need a statistically significant result as defined by a P value, which requires a large number of M. One experiment on one lab rat will not give a P value. 100 rats, 50 of which are controls, will give a P value if the medicine works.

    On the other hand, how climate scientists with the one earth find a P value for their statistics is a mystery to me. Maybe that is why they cannot predict anything.

  2. 2
    Mark Frank says:

    I am glad someone here has highlighted the well-known but ignored-in-practice problems with p-values and Fisherian hypothesis testing. Now if someone would just explain it to William Dembski.

  3. 3
    wd400 says:

    The list of things Rob Sheldon doesn’t know very much about grows…

    A p-value is not a correlation statistic, and it’s actually the probability observing data as extreme or more so than the observed data if the null hypothesis was true. So, in the example he gives the null hypothesis would be that y is not effected by x and the shape of Aunt Gertrude makes no difference whatsoever.

    There’s lots wrong with they way people use p-values, but this sort of confusion doesn’t help.

  4. 4
    wd400 says:

    (well, I guess it could matter if your Aunt Gertrude is not very symetrical, and her y actually does depend on x)

  5. 5
    CuriousCat says:

    Definition that Dr. Sheldon gives may not be the most comprehensive definition of p-value, as pointed out by wd400, but I do not think it is incorrect. In terms of correlation between two variables (x and y), the null hypothesis is “x and y are not correlated”. Hence p-value is obtained by assuming two independent variables identically distributed to x and y, and then asking the question what is the probability that the experimental distribution between x and y will be obtained from these two independent random variables. This is, I suppose, similar to what has been stated by Dr. Sheldon.

    On the other hand, in the example Dr. Sheldon gives, Aunt Gertrude on the graph DOES matter, IMO. Any nonrandomness on x&y graph means that x and y are nonlinearly correlated (which may be determined by mutual information). The predictive power of x on y may be low (low R2), but this possibly means that y depends on a, b, c, …z, all of which constantly are changing and none of which we are measuring.

    Overall, I totally agree with the misuse of statistics, particularly p-value, in many branches of science.

  6. 6
    bornagain77 says:

    correction to Mark Frank’s post,,,

    I am glad someone here has highlighted the well-known but ignored-in-practice problems with p-values and Fisherian hypothesis testing. Now if someone would just explain it to William Dembski Douglas Theobald & Nick Matzke.

    there all better,,,,

    Does Natural Selection Leave “Detectable Statistical Evidence in the Genome”? More Problems with Matzke’s Critique of Darwin’s Doubt – Casey Luskin August 7, 2013
    Excerpt: A critical review of these statistical methods has shown that their theoretical foundation is not well established and they often give false-positive and false-negative results.

    The following site gives an overview of the many problems of the statistical method that Theobald used to try to establish ‘statistical significance’ for common ancestry;:

    Scientific method: Statistical errors – P values, the ‘gold standard’ of statistical validity, are not as reliable as many scientists assume. – Regina Nuzzo – 12 February 2014
    Excerpt: “P values are not doing their job, because they can’t,” says Stephen Ziliak, an economist at Roosevelt University in Chicago, Illinois, and a frequent critic of the way statistics are used.,,,
    “Change your statistical philosophy and all of a sudden different things become important,” says Steven Goodman, a physician and statistician at Stanford. “Then ‘laws’ handed down from God are no longer handed down from God. They’re actually handed down to us by ourselves, through the methodology we adopt.”,,
    One researcher suggested rechristening the methodology “statistical hypothesis inference testing”3, presumably for the acronym it would yield.,,
    The irony is that when UK statistician Ronald Fisher introduced the P value in the 1920s, he did not mean it to be a definitive test. He intended it simply as an informal way to judge whether evidence was significant in the old-fashioned sense: worthy of a second look. The idea was to run an experiment, then see if the results were consistent with what random chance might produce.,,,
    Neyman called some of Fisher’s work mathematically “worse than useless”,,,
    “The P value was never meant to be used the way it’s used today,” says Goodman.,,,
    The more implausible the hypothesis — telepathy, aliens, homeopathy — the greater the chance that an exciting finding is a false alarm, no matter what the P value is.,,,
    “It is almost impossible to drag authors away from their p-values, and the more zeroes after the decimal point, the harder people cling to them”11,,

  7. 7
    Acartia_bogart says:

    “the R2 (also known as the correlation coefficient..”

    Assuming he is referring to the r squared, then it is not also known as the correlation coefficient. Both the p value and the r squared are valid tools, but only when used appropriately and when all of the assumptions are met.

    For example, if I try to use an r squared as a measure of significance when I have ten points randomly but tightly clustered around the low end of a plot, and a single point several orders of magnitude higher on the plot, I am going to have a hi r squared but the real significance of the regression will be extremely low because it is being biased by the single point. The r squared of just the lower points was estimated, it might be very low.

  8. 8
    anthropic says:

    Jehu 1, FDA approval isn’t predicated merely on a drug’s effectiveness. Instead, it is predicated on a drug’s effectiveness compared with a placebo.

    Since placebos are getting more effective over time, this means that it becomes more and more difficult to get a drug approved. Thus, if a drug reduces pain 43 percent while a placebo reduces it 35 percent, the drug will not be approved because it didn’t beat the placebo decisively enough.

    That not only means fewer effective drugs, and more expensive drugs (to pay for all those that were disapproved), it also means that in some cases people get zero relief rather than 43 percent.

  9. 9
    wd400 says:


    The appearance of Aunt Gertrude on a graph may “matter” in some sense, but it won’t create a significant p-value in a linear regression (unless she is asymmetrical or perhaps leaning on a jaunty angle….)

  10. 10
    CuriousCat says:


    That’s the reason I said “nonlinear correlation” may exist between x and y. Conventional example: a unit circle on x-y plane gives a linear correlation of zero, but x and y are definitely correlated. You may use x to predict y, to some extent: if x = 0.5, then y is either sqrt(0.75) or -sqrt(0.75).

    And yes, it may create a significant p-value in a mutual information test. I am not talking about linear regression, here. Linear regression or correlation (and P-values associated with those methods) only check for linear relations.

    Say that you have a dice and three buttons. Every time a 1 or 6 comes, you press the middle button, when it is 2 or 5 you press the left one, when it is 3 or 4, you press the one on the right. If you plot these data on a graph, linear regression will not tell you anything. So one option is to use the mutual information.

    Finding p-values in nonlinear cases is not well-defined like those in linear cases. The best (easiest) method is to use bootstrap methods (resample your data again and again).

    So, appearance of Aunt Gertrude not only matters, but it is quantifiable and has some (more or less) predictive power.

  11. 11
    wd400 says:

    OK, so we agree Sheldon was wrong in his example about the dangers of p-values, since he was talking about a linear model.

    (and, I think, that p-values have plenty of real dangers too)

  12. 12
    CuriousCat says:


    I cannot judge” whether Sheldon was right or wrong. What he said was not for a statistical textbook, as I mentioned above, but he hits the main point, I guess. Returning to the topic, once a nonrandom looking cluster (including the outlier points mentioned by Acartia_bogart) is seen, one should be cautious with P-values and R2, since they are linear methods. Unfortunately, scientists in biology are not much cautious about these concepts, especially compared to physicists and engineers, since their main aim seems to publish fast and be highly cited. So submitting “positive results” to journals seems to be the best solution to this end.

  13. 13
    Mung says:

    On the bright side, Dr. Sheldon was not splaining population genetics.

  14. 14
    bornagain77 says:

    semi related:

    Everything we know is wrong, 26 August 2014
    Excerpt: “Every day the newspapers carry stories of new scientific findings. There are 15 million scientists worldwide all trying to get their research published. But a disturbing fact appears if you look closely: as time goes by, many scientific findings seem to become less true than we thought. It’s called the “decline effect” – and some findings even dwindle away to zero.
    A highly influential paper by Dr John Ioannidis at Stanford University called “Why most published research findings are false” argues that fewer than half of scientific papers can be believed, and that the hotter a scientific field (with more scientific teams involved), the less likely the research findings are to be true. He even showed that of the 49 most highly cited medical papers, only 34 had been retested and of them 41 per cent had been convincingly shown to be wrong. And yet they were still being cited.
    Again and again, researchers are finding the same things, whether it’s with observational studies, or even the “gold standard” Randomised Controlled Studies, whether it’s medicine or economics. Nobody bothers to try to replicate most studies, and when they do try, the majority of findings don’t stack up. The awkward truth is that, taken as a whole, the scientific literature is full of falsehoods.”

  15. 15
    Dr JDD says:

    Semi-on-topic too:

    Just look at the comments, sickeningly ignorant and purely religious in “science” as their esteemed almighty god. Suggestions of giving people who retract papers awards for doing so, being “noble”, “heroes”, and “humble”.

    Really? It took someone independent to critique the paper before they retracted? Surely if they were heroic, humble and honest and nobel scientsits they would have performed and checked their own statistics in the first place?

    The Scientist first learned of possible problems with this analysis when the paper was under embargo prior to publication. At that time, The Scientist contacted Paul Pavlidis, a professor of psychiatry at the University of British Columbia who was not connected to the work, for comment on the paper. He pointed out a potential methodological flaw that could invalidate its conclusions. After considering the authors’ analyses, Pavlidis reached out to Meyer-Lindenberg’s team to discuss the statistical issues he perceived.

    They got caught with their pants down. It is hardly humble, honest and heroic to retract based on flawed analysis made public – it is the only correct thing to do (and legally should be given misinformation implied). Further, if they HADN’T have retracted, they would be seen in a negative light given the open criticism. They had to – not out of heroism but to be seen as worthy of still being called scientists.

    The complete rubbish comments posted about this article just show the type of mindset that claims religious are the fanatics – hyocrites.

  16. 16
    johnnyb says:

    XKCD summed this up in a classic comic:

Leave a Reply