The most common analytical method within population genetics is deeply flawed, according to a new study from Lund University in Sweden. This may have led to incorrect results and misconceptions about ethnicity and genetic relationships. The method has been used in hundreds of thousands of studies, affecting results within medical genetics and even commercial ancestry tests. The study is published in Scientific Reports.
The rate at which scientific data can be collected is rising exponentially, leading to massive and highly complex datasets, dubbed the “Big Data revolution.” To make these data more manageable, researchers use statistical methods that aim to compact and simplify the data while still retaining most of the key information. Perhaps the most widely used method is called PCA (principal component analysis). By analogy, think of PCA as an oven with flour, sugar and eggs as the data input. The oven may always do the same thing, but the outcome, a cake, critically depends on the ingredients’ ratios and how they are combined.
“It is expected that this method will give correct results because it is so frequently used. But it is neither a guarantee of reliability nor produces statistically robust conclusions,” says Dr. Eran Elhaik, Associate Professor in molecular cell biology at Lund University.
According to Elhaik, the method helped create old perceptions about race and ethnicity. It plays a role in manufacturing historical tales of who and where people come from, not only by the scientific community but also by commercial ancestry companies. A famous example is when a prominent American politician took an ancestry test before the 2020 presidential campaign to support their ancestral claims. Another example is the misconception of Ashkenazic Jews as a race or an isolated group driven by PCA results.
“This study demonstrates that those results were unreliable,” says Eran Elhaik.
PCA is used across many scientific fields, but Elhaik’s study focuses on its usage in population genetics, where the explosion in dataset sizes is particularly acute, which is driven by the reduced costs of DNA sequencing.
The field of paleogenomics, where we want to learn about ancient peoples and individuals such as Copper age Europeans, heavily relies on PCA. PCA is used to create a genetic map that positions the unknown sample alongside known reference samples. Thus far, the unknown samples have been assumed to be related to whichever reference population they overlap or lie closest to on the map.
However, Elhaik discovered that the unknown sample could be made to lie close to virtually any reference population just by changing the numbers and types of the reference samples, generating practically endless historical versions, all mathematically “correct,” but only one may be biologically correct.
In the study, Elhaik has examined the twelve most common population genetic applications of PCA. He has used both simulated and real genetic data to show just how flexible PCA results can be. According to Elhaik, this flexibility means that conclusions based on PCA cannot be trusted since any change to the reference or test samples will produce different results.
Between 32,000 and 216,000 scientific articles in genetics alone have employed PCA for exploring and visualizing similarities and differences between individuals and populations and based their conclusions on these results.
“I believe these results must be re-evaluated,” says Elhaik.
“Techniques that offer such flexibility encourage bad science and are particularly dangerous in a world where there is intense pressure to publish. If a researcher runs PCA several times, the temptation will always be to select the output that makes the best story,” adds Prof. William Amos, from the University of Cambridge, who was not involved in the study.
Phys.org
How much does the “intense pressure to publish” research skew scientific articles away from objectivity towards attempts to show confirmation of acceptable, popular results?
Of semi-related note. The main mathematical model used by Darwinists in population genetics, i.e. Fisher’s theorem, is now known to have little, if any, correspondence to biological reality.
Fisher’s fundamental theorem of natural selection is (considered by Darwinists as) one of the basic laws of population genetics.,,,
In his theorem Fisher “assumed that new mutations arose with a nearly normal distribution – with an equal proportion of good and bad mutations (so mutations would have a net fitness effect of zero). (Yet) We now know that the vast majority of mutations in the functional genome are harmful, and that beneficial mutations are vanishingly rare.”,,, And when realistic rates of detrimental to beneficial mutations are taken into consideration, then it falsifies Fisher’s assumption within his mathematical model. i.e. It falsifies his assumption that fitness must always increase:
Of related note to Darwinists not having a mathematical model for their theory that corresponds to biological reality,
Nobody has proven that mutations are random . Majority of mutations are just part of calibration apparatus.
Again the word “mutation”(like “evolution”) is a darwinian term that imply a transformation or metamorphosis and abduct the mind to think in darwinian paradigm.
They say the majority of mutations are “neutral”? “Mutation” and “neutral” are 2 opposite words . This is a nonsense .If are neutral why do you call them mutations because mutate nothing. The expressions like “wet dryness” and “neutral mutations” are nonsense .
Bad mutations cannot be called mutations because are just errors. Only in darwinian world is possible to exist such nonsense as “bad errors” and “good errors”.
As to Dr. Marks statement that “there exists no (mathematical) model successfully describing undirected Darwinian evolution. According to our current understanding, there never will be.,,,”
Dr. Marks statement that, “there exists no (mathematical) model successfully describing undirected Darwinian evolution. According to our current understanding, there never will be”, finds fairly strong mathematical support via Gödel’s incompleteness theorem for mathematics.
Specifically, Darwin’s theory is based upon reductive materialism. Yet Godel’s incompleteness theorem for mathematics has now been extended into quantum physics itself, and, (in that extension of Gödel’s incompleteness theorem into quantum physics), it is now proven that “even a perfect and complete description of the microscopic properties of a material is not enough to predict its macroscopic behaviour.,,,” and “challenge the reductionists’ point of view, as the insurmountable difficulty lies precisely in the derivation of macroscopic properties from a microscopic description.”
Ah, another fancy statistical technique seduces researchers! Makes one wonder just how many invalid or difficult-to-use-properly statistical techniques are out there, and how many scientific conclusions rely on them.
(If you want more examples, see the references at the end of this: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5410776/ You may remember this controversy from a few years back…)
You might be interested in
https://www.alibris.com/Everything-You-Believe-Is-Wrong-William-M-Briggs/book/51005501?matches=4
I think I’ve read some similar books, not by that author, though; they seems to be popular these days. It is unfortunate in a way that we human beings can hold so many beliefs that are not correct, that we nonetheless don’t suffer much for–at least not right after forming the beliefs. If the feedback cycle were shorter, we wouldn’t get away with believing so much garbage.
This paper is troubling. PCA analysis isn’t just a biology technique, its used in math, computer modeling, physics, economics etc. If it were true that it is an unreliable technique, we would have known it by now. What the authors seem to be saying, is that one can manipulate statistics by cherry-picking the data set. This has been known for centuries, and is not new. Recently psychology journals have implemented protocols to prevent cherry-picking, and is hoping that this stops the avalanche of retracted papers. Then all that this paper is really saying, is that biologists and genomists must also implement protocols to keep their data sets pristine. Intrinsically there is nothing wrong with PCA analysis if the data is not tampered with. The fear is that publishing pressures will cause pop gen authors to manipulate their data and cover their tracks with a PCA treatment.