The “p-value” is a Fisher correlation statistic that asks the question “If I have a plot with n-points on it, what is the probability that I would get this distribution by pulling n-points out of a hat?” If the “random probability” is less than 0.05, then “classical statistics” people say “Wow, that is significant!”. Now as Brian rightly points out, most people are not trained in classical statistics, but they own computers. So when they have a data set, they try this formula, that formula, they try dropping a few points, until suddenly their computer flashes the magic “p<0.05” and then they write a paper. More and more papers are being published with “just barely below 0.05” values of p. Medicine is particularly notorious for this.
But what does this mean? Is it “significant”?
Numerous statisticians have pointed out that not only is it not significant, but it is actually erroneous. The first problem, is that Fisher assumed you had one bag, and you are doing this random thing once. But if you have M bags, then the probability you will randomly find a p<0.05 for one of those bags gets a lot better by a factor M. So a “honest” statistician needs to factor in all the formulae he tried, all the data sets he looked at before he assigns “significance”. But they don’t. They don’t even realize that they are biassing the statistic this way.
For example, look at the way papers are published. Most experiments get negative results, and they never get published. Only those that get positive results are published. But that causes a bias. There are thousands of substances that caused cancer in somebody’s lab rat, but 100’s of thousands that didn’t. Guess which ones get published? So suppose 100 experiments fed rats cyclamate, and only one experiment caused cancer. Does cyclamate cause cancer? It does if the only published paper shows a positive link.
But that error in p-value would be due to ignorance. What about malice? If we take our data set, and start squinting at it long enough, we can usually find a trend. If we leave out a few data points–outliers really–then the trend gets better, as does the R2 (also known as the correlation coefficient where a perfect trend is R2=1.0 and zero trend is R2=0.0). This is known as cherry-picking, and unlike the p-value above, most experimentalists are told that cherry picking is a punishable crime. My physical chemistry class even had us use special notebooks with numbered pages so we couldn’t tear out a page of data and alter the results of an experiment. (At the time, I felt mildly insulted. Now I would insist everybody do this.)
What is the difference between R2 and p-value? Well, if you plot x vs y, and all your data lie on a straight line and you have a theory y=mx+b, then your p-value is < 0.05 and your R2=1.0. On the other hand, if your data look like a cumulus cloud and your theory is still y=mx+b, then your R2 drops to 0.0 and your p-value rises. But wait, that cloud looks just like your Aunt Gertrude! The R2 doesn’t get better, but now your p-value drops below 0.05 because what is the likelihood of that–its gotta mean something! On the other hand, if you want your theory to predict y from x, then your bizarre data is still useless, it has no predictive ability. And that is why p-values don’t really mean very much, because “significance” is in the eye of the beholder.
Follow UD News at Twitter!
This is where Brian starts to mutter about Bayesian priors and the need to dump classical statistics. The late Edwin Jaynes had the original rant on this topic.
So if we don’t cherry pick, then we are stuck with low R2 values when the data shows little trend. That’s when we pull out the magic p-value. So the p-value is a statistic of last resort. It can be ignorantly and discretely manipulated to achieve significance without punishment. No wonder papers keep citing more and more p-values while R2 is dropping! Its a sign of desperation–for publication, that is.
Follow UD News at Twitter!