The COVID-19 crisis creates a useful illustration:
Two groups of researchers recently reported a link between vitamin D and COVID-19 mortality—more vitamin D meant lower mortality. A Northwestern University researcher who reported this association advised that “it is clear that vitamin D deficiency is harmful, and it can be easily addressed with appropriate supplementation.”
Gary Smith, “Vitamin D and Covid-19: Is it Data or Noise?” at Mind Matters News
Cue media frenzy. The trouble is, as statistics analyst Gary Smith of Pomona College points out,
One very big problem with such studies is the inevitability of chance patterns and correlations in large data bases. Even if COVID-19 deaths are randomly distributed among the population (and they surely aren’t), data mining will, more likely than not, discover a geographic cluster of victims…
Gary Smith, “Vitamin D and Covid-19: Is it Data or Noise?” at Mind Matters News
He created a fictitious city to illustrate the point:
The same argument scales up to cities within a country or countries within the world. Some cities will inevitably have higher COVID-19 rates than others; so will some countries. So what? Because random data contain geographic clusters, the identification of clusters is not necessarily meaningful. The association, after the fact, of these clusters with some characteristics of the area or the people living in the area is not convincing scientific evidence of anything.
Gary Smith, “Vitamin D and Covid-19: Is it Data or Noise?” at Mind Matters News
See also: Data mining: A plague, not a cure. It is tempting to believe that patterns are unusual and their discovery meaningful; in large data sets, patterns are inevitable and generally meaningless. (Gary Smith)