A few years ago, Intelligent Design researcher Professor Michael Behe wrote a thought-provoking book entitled The Edge of Evolution, which argued that design was much more pervasive in Nature than commonly thought. Professor Behe argued that each and every class of living things, and quite probably each and every family, had been intentionally designed. Now, a recent paper by Dr. Branko Kozulic, a biochemist who serves on the editorial board of the Intelligent Design journal Bio-Complexity, argues that each and every species of living things was intelligently designed, and that the biological concept of a species can best be defined in terms of the unique proteins and genes that characterize it. In a nutshell, Dr. Kozulic’s argument is that there are literally hundreds of chemically unique proteins in each and every species of living organism. These “singleton” proteins have no close chemical relatives, making their origin a baffling mystery. Dr. Kozulic contends that the presence of not one but hundreds of chemically unique proteins in each species is an event beyond the reach of chance, and that each species must therefore be the result of intelligent planning.
If Dr. Kozulic’s conclusion is correct, then it would have interesting implications for the creation-evolution controversy. On a material level, living things might still be biologically related, insofar as they sprang from a common ancestral stock: in other words, common descent could still be true. However, on a formal level, each and every species of living thing would be the product of Intelligent Design and could be viewed as a separate creation, as the unique genes and proteins that endow it with its defining characteristics were essentially built from scratch. In other words: living things might share a common ancestry, but their constituent proteins certainly do not. They were created.
That conclusion would mean that even animals as similar as rats and mice, which diverged between 12 and 24 million years ago) were designed separately. The common Norwegian rat (pictured above, courtesy of Wikipedia) is popularly imagined to be just a scaled up version of a mouse. However, scientists have identified no less than 75 unique genes (69 mouse genes and 6 rat genes) for which there is good evidence of de novo origin since the divergence of mouse and rat. Each of these genes is only found in either the mouse or rat lineages. If Dr. Kozulic is correct, that means that rats and mice have to be viewed as separate designs. Ditto for humans and chimps, both of which have chemically unique proteins and genes.
Before I go on, I’d like to introduce Dr. Kozulic to those readers who may not have heard of him. Dr. Branko Kozulic received a Ph.D. in biochemistry from the University of Zagreb, Croatia, in 1979. From 1983 to 1988, he worked at the Institute of Biotechnology, ETH-Zurich, in Switzerland. For about fifteen years, he worked for a private Swiss biotech company, of which he was a co-founder. He currently works for Gentius Ltd., a company based in Zadar, Croatia. In addition, he teaches at the Faculty of Food Technology and Biotechnology in Zagreb. His professional interests center mainly on methods used for the analysis, detection, characterization and purification of biological macromolecules. Dr. Kozulic has published about 30 scientific papers to date, and he is also the inventor or co-inventor of numerous patents, 18 of which were issued in the USA. Dr. Kozulic formerly served on the Editorial Board of Analytical Biochemistry, and he is currently a board member of Food Technology and Biotechnology.
Dr. Kozulic’s recent paper, which I’d like to discuss today, is titled, Proteins and Genes, Singletons and Species. The paper was submitted to VIXRA, an alternative archive of science and mathematics-related e-prints serving the entire scientific community, on 16 May 2011. In this post, I’m going to be quoting extensively from portions of Dr. Kozulic’s paper, in order to walk readers through his argument. Here’s the abstract:
Recent experimental data from proteomics and genomics are interpreted here in ways that challenge the predominant viewpoint in biology according to which the four evolutionary processes, including mutation, recombination, natural selection and genetic drift, are sufficient to explain the origination of species. The predominant viewpoint appears incompatible with the finding that the sequenced genome of each species contains hundreds, or even thousands, of unique genes – the genes that are not shared with any other species. These unique genes and proteins, singletons, define the very character of every species. Moreover, the distribution of protein families from the sequenced genomes indicates that the complexity of genomes grows in a manner different from that of self-organizing networks: the dominance of singletons leads to the conclusion that in living organisms a most unlikely phenomenon can be the most common one. In order to provide proper rationale for these conclusions related to the singletons, the paper first treats the frequency of functional proteins among random sequences, followed by a discussion on the protein structure space, and it ends by questioning the idea that protein domains represent conserved units of evolution.
And now, without further ado, here are the highlights of Dr. Kozulic’s paper.
How big is protein sequence space? Much bigger than some evolutionists (Dryden et al.) would like it to be
One strategy for defusing the problem associated with the finding of functional proteins by random search through the enormous protein sequence space has been to arbitrarily reduce the size of that space. Because the space size is related to protein length (L) as 20^L, where 20 denotes the number of different amino acids of which proteins are made, the number of unique protein sequences will rapidly decrease if one assumes that the number of different amino acids can be less than 20. The same is true if one takes small L values. Dryden et al. used this strategy to illustrate the feasibility of searching through the whole protein sequence space on Earth, estimating that the maximal number of different proteins that could have been formed on planet Earth in geological time was 4 x 10^43 . In [the] laboratory, researchers have designed functional proteins with fewer than 20 amino acids [10, 11], but in nature all living organisms studied thus far, from bacteria to man, use all 20 amino acids to build their proteins. Therefore, the conclusions based on the calculations that rely on fewer than 20 amino acids are irrelevant in biology. Concerning protein length, the reported median lengths of bacterial and eukaryotic proteins are 267 and 361 amino acids, respectively . Furthermore, about 30% of proteins in eukaryotes have more than 500 amino acids, while about 7% of them have more than 1,000 amino acids . The largest known protein, titin, is built of more than 30,000 amino acids . Only such experimentally found values for L are meaningful for calculating the real size of the protein sequence space, which thus corresponds to a median figure of 10^347 (20^267) for bacterial, and 10^470 (20^361) for eukaryotic proteins. (pp. 2-3)
What proportion of amino acid chains are capable of functioning as proteins?
While scientists generally agree that only a minority of all possible protein sequences has the property to fold and create a stable 3D structure, the figure adequate to quantify that minority has been a subject of much debate. (p. 6)
In 1976, Hubert Yockey estimated the probability of about 10^-65 [that’s 1 in 100,000 million million million million million million million million million million – VJT] for finding one cytochrome c sequence among random protein sequences . For bacteriophage λ[lambda] repressor, Reidhaar-Olson and Sauer estimated that the probability was about 10^-63 . Based on β[beta]-lactamase mutation data, Douglas Axe estimated the prevalence of functional folds to be in the range of 10^-77 to 10^-53 . A comparison of these estimates with those concerning the total number of protein molecules synthesized during Earth’s history – about 10^40 [9, 51, 52] – leads to the conclusion that random assembling of amino acids could not have produced a single enzyme during 4.5 billion years [48, 53]. On the other hand, Taylor et al. estimated that a random protein library of about 10^24 members would be sufficient for finding one chorismate mutase molecule . Moreover, from an actual library of 6×10^12 proteins each containing 80 contiguous random amino acids, Keefe and Szostak isolated four ATP binding proteins and concluded that the frequency of functional proteins in the sequence space may be as high as 1 in 10^11 [1 in 100,000,000,000 – VJT], allowing for their discovery by entirely stochastic means . However, subsequent in vivo studies with this man-made ATP binding protein showed that it disrupted the normal energetic balance of the cell, acting essentially as an antibiotic . One can conclude, therefore: had this protein been formed by random mutations, the cell with it would have left no descendants. Furthermore, the probability of its formation in a cell would have been lower than 10^-11, because random DNA mutations introduce stop codons and frameshifts whereas Keefe and Szostak avoided stop codons and frameshift mutations by experimental design . The importance of distinguishing the results of in vitro from in vivo studies is highlighted by the finding that only a tiny fraction, one in about 10^10, of the active mutants of triosephosphate isomerase functioned properly in vivo . (pp. 6-7)
The importance of maintaining the correct order of amino acids
In general, there are two aspects of biological function of every protein, and both depend on correct 3D structure. Each protein specifically recognizes its cellular or extracellular counterpart: for example an enzyme its substrate, hormone its receptor, lectin sugar, repressor DNA, etc. In addition, proteins interact continuously or transiently with other proteins, forming an interactive network. This second aspect is no less important, as illustrated in many studies of protein-protein interactions [59, 60]. Exquisite structural requirements must often be fulfilled for proper functioning of a protein. For example, in enzymes spatial misplacement of catalytic residues by even a few tenths of an angstrom can mean the difference between full activity and none at all . And in the words of Francis Crick, “To produce this miracle of molecular construction all the cell need do is to string together the amino acids (which make up the polypeptide chain) in the correct order” [61, italics in original]. (pp. 7-8)
Dr. Kozulic’s very generous estimate of the odds of building a protein by random trials: 1 in 1,000,000,000,000,000,000,000
Explanatory note: the term in vitro (Latin: within the glass) refers to the technique of performing a given experiment in a controlled environment outside of a living organism; for example in a test tube. In vivo (Latin: within the living) means that which takes place inside an organism. In science, in vivo refers to experimentation done in or on the living tissue of a whole, living organism as opposed to a partial or dead one or a controlled environment. Source.
Let us assess the highest probability for finding this correct order by random trials and call it, to stay in line with Crick’s term, a “macromolecular miracle”. The experimental data of Keefe and Szostak indicate – if one disregards the above described reservations – that one from a set of 10^11 randomly assembled polypeptides can be functional in vitro, whereas the data of Silverman et al.  show that of the 10^10 in vitro functional proteins just one may function properly in vivo. The combination of these two figures then defines a “macromolecular miracle” as a probability of one against 10^21. For simplicity, let us round this figure to one against 10^20. (p. 8)
It is important to recognize that the one in 10^20 represents the upper limit, and as such this figure is in agreement with all previous lower probability estimates. Moreover, there are two components that contribute to this figure: first, there is a component related to the particular activity of a protein – for example enzymatic activity that can be assayed in vitro or in vivo – and second, there is a component related to proper functioning of that protein in the cellular context: in a biochemical pathway, cycle or complex. Taking into account both contributions is an essential requirement because a synthetic protein nicely active in the test tube can be lethal in the cellular context, as shown by Stomel et al. for the ATP-binding protein of Keefe and Szostak [55, 56]. (p. 8)
To put the 10^20 figure in the context of observable objects, about 10^20 squares each measuring 1 mm^2 would cover the whole surface of planet Earth (5.1 x 10^14 m^2). Searching through such squares to find a single one with the correct number, at a rate of 1000 per second, would take 10^17 seconds, or 3.2 billion years. Yet, based on the above discussed experimental data, one in 10^20 is the highest probability that a blind search has for finding among random sequences an in vivo functional protein. This figure denotes the minimal height of the brick wall. (p. 9)
Proteins are distributed according to a power law
A power law distribution. Popularity rankings (e.g. ratings of actors) often follow this kind of distribution. To the right is the long tail, and to the left are the few individuals that dominate (also known as the 80–20 rule). Image courtesy of Wikipedia.
What have we learned from these tens of millions of protein sequences originating from the genomes of more than one thousand species? When proteins of similar sequences are grouped into families, their distribution follows a power-law [65-72], prompting some authors to suggest that the protein sequence space can be viewed as a network similar to the World Wide Web, electrical power grid or collaboration network of movie actors, due to the similarity of respective distribution graphs. There are thus small numbers of families with thousands of member proteins having similar sequences, while, at the other extreme, there are thousands of families with just a few members. The most numerous are “families” with only one member; these lone proteins are usually called singletons. (pp. 9-10)
By plotting, on a log-log scale, the number of citations per paper against the total number of citations one obtains the graph shown in Figure 2a, characterized by a disperse tail and a dense head. At the tail, there are groups of small numbers of papers (1, 2, 3 and 4, approximately) achieving citations thousands of times. Only a few individual papers from this dataset approach the 10,000 citations mark. On the other hand, many papers are cited 100 times, even more of them 10 times, while the most numerous are the papers cited just once (apart from those never cited). An analogous plot of earthquake distribution shows many earthquakes of low magnitudes, and an ever decreasing number of stronger earthquakes (Fig. 2b). Moreover, based on common appearance of actors in the same movie, actors’ collaboration network also shows a power-law distribution (Fig 2c). At the tail there are a few superstars who collaborated with thousands of other actors, while newcomers at the head collaborated with just a few. (pp. 10-11)
Distribution of protein families in sequenced genomes is illustrated by a similar graph (Fig. 2d). Comparable distributions have been observed with protein datasets from individual sequenced genomes [65, 80], as well as with the datasets that encompassed all sequenced genomes at various time points [66-72]. Here, at the tail of the distribution there are a few large families each consisting of thousands of proteins having similar sequences, while at the head there are many singletons. The evident similarity of this distribution curve with those of Figure 2a-c has been interpreted as evidence for self-organizing nature of protein networks in living organisms. It was thus inferred that the complexity of genomes grows in the same way as the complexity of WWW, or actors’ network. These interpretations, however, are in error because they have failed to take account of a fundamental difference, as described below. (p. 11)
The first condition that the networks of Figure 2 must fulfill is a continuous addition of new members . Thus, continuously new actors appear in movies, new earthquakes happen and new scientific papers get published. Roughly one person in 10^5 [or 100,000 – VJT] acts in a movie, earthquakes make one of less than 10^5 geological phenomena, and the fraction of scientific papers among all publications is higher than one in 10^5. So, to enter the respective network – to become the first point at the head of the distribution – the newcomers must overcome a barrier not higher than one against 10^5. After the entry, to become prominent the newcomers have a chance of about one in 10^5 again. Evidently, the two barriers, of entering and of becoming prominent, are comparable, give or take a few orders of magnitude. What would happen if the entry barrier were one thousand trillion (10^15) times higher? Obviously, if just one in 10^20 persons could become an actor, we would know of no actors: there would be no records of them, and analogously, there would be no records of scientific papers and earthquakes. (p. 11)
The frequency of functional proteins among random sequences is at most one in 10^20 (see above). The proteins of unrelated sequences are as different as the proteins of random sequences [22, 81, 82] – and singletons per definition are exactly such unrelated proteins. (p. 11)
Thus, to enter the distribution graph as a newcomer (Fig. 2d), each new protein (singleton) must overcome the entry barrier of one against at least 10^20. After the entry, singleton’s chance of becoming prominent, that is to grow into one of the largest protein families, is about one in 10^5 (Fig. 2d). Thus, it is much more difficult for a protein to become biologically functional than to become, in many variations, widespread: the entry barrier is at least fifteen orders of magnitude higher than the prominence barrier. This huge difference between the entry and prominence barriers is what makes the protein family distribution graph unique. In spite of this high entry barrier, in the sequenced genomes the protein newcomers (singletons) always represent the largest, most common, group: if it were otherwise, the distribution graph would break down… [I]n living organisms the most unlikely phenomenon can be the most common one. This feature clearly distinguishes the complexity of living organisms from the complexity of self-organizing networks. (p. 12)
Protein domains follow the same power law distribution
The distribution of protein folds and domains also follows a power-law [21, 66, 67, 70, 72, 80, 83, 87], as predicted by Coulson and Moult . That prediction was considered shocking . Thus, in the sequenced genomes some domains are represented by thousands of different, non-homologous sequences, whereas other domains are represented by a few or by a single, unique sequence [21, 66, 67, 70, 72, 79, 83, 87, 95, 96]. For example, in a set of about 250,000 protein sequences Grant et al. found about 170,000 domains that remained as singletons . These unique domains, called also orphan domains, represent the largest group among all domain groups that make the distributions. This is a feature in common with singletons from the distribution graph of protein sequence families. (p. 13)
Orphan genes do, too
In addition to the term singleton, other terms, with a similar if not synonymous meaning, have been used to denote proteins and genes having no relatives. Thus, Siew and Fischer define genomic ORFans as orphan open reading frames (ORF) with no significant sequence similarity to other ORFs [103, 104]. Wilson et al. suggest that orphans should be named “taxonomically restricted genes” (TRGs) [105, 106], and state that the abundance of orphan genes is amongst the greatest surprises uncovered by the sequencing of eukaryotic and bacterial genomes . Earlier, Russell Doolittle affirmed that there are large numbers of unidentified genes in a variety of organisms, with the origin and function of these unique sequences remaining “baffling mysteries” . (p. 15)
Why the discovery of “singleton” proteins and genes came as a great shock to evolutionists
In order to understand why the finding of singletons (ORF-ans, or TRG-s) represented such a great surprise, let us look at the contemporary expectations. They were possibly best outlined by Chothia et al. in 2003 : “all but a small proportion of the protein repertoire is formed by members of families that go back to the origin of eukaryotes or the origin of the different kingdoms.” And further: “The earliest evolution of the protein repertoire must have involved the ab initio invention of new proteins. At a very low level, this may still take place. But it is clear that the dominant mechanisms for expansion of the protein repertoire, in biology as we know it, are gene duplication, divergence and recombination.” Consequently: “we will be able to trace much of the evolution of complexity by examining the duplication and recombination of these families in different genomes.” About 1000 evolutionary independent protein families were expected to encompass all protein diversity . In line with the above, there was an additional expectation of forthcoming grand unification of biology . However, the power-law distribution of protein families and the sheer abundance of singletons have exposed utopian nature of these expectations and, at the same time, opened several important issues. (p. 15)
Siew and Fischer succinctly described the issues at stake: “If proteins in different organisms have descended from common ancestral proteins by duplication and adaptive variation, why is that so many today show no similarity to each other?” And further: “Do these rapidly evolving ORFans correspond to nonessential proteins or to species determinants?” . (p. 15)
Each species of living things has hundreds of unique proteins, each of which is like no other
A recent study, based on 573 sequenced bacterial genomes, has concluded that the entire pool of bacterial genes – the bacterial pan-genome – looks as though of infinite size, because every additional bacterial genome sequenced has added over 200 new singletons . In agreement with this conclusion are the results of the Global Ocean Sampling project reported by Yooseph et al., who found a linear increase in the number of singletons with the number of new protein sequences, even when the number of the new sequences ran into millions . The trend towards higher numbers of singletons per genome seems to coincide with a higher proportion of the eukaryotic genomes sequenced. In other words, eukaryotes generally contain a larger number of singletons than eubacteria and archaea. [Eukaryotes are organisms whose cells have a nucleus, unlike bacteria – VJT.] (p. 16)
When a relative to a singleton is found, together the two proteins create a family. In the absence of biochemical data, nothing can be said about biological function of that protein family as long as no established domain or structural motif is discernible from the amino acid sequences. Such proteins of obscure function, or POFs, make about 25% of the proteins found in each genome [113, 114]. POFs tend to be shorter than the proteins of defined function . (p. 16)
Today, almost ten years since the announcement of the first draft of the human genome sequence, no structural assignment is available for about 38% of human proteins : at present we thus lack basic information about a large fraction of the proteins of human proteome . (p. 16)
Each species of living things has hundreds of unique genes, too
Based on the data from 120 sequenced genomes, in 2004 Grant et al. reported on the presence of 112,000 singletons within 600,000 sequences . This corresponds to 933 singletons per genome. In 2005, Orengo and Thornton reported on the presence of about 150,000 singletons in 150 sequenced genomes . In 2006, within 203 sequenced genomes and 633,546 nonidentical sequences Marsden et al. identified 158,798 singletons ; thus the singletons made 24% of all sequences and there were on average 782 singletons in each genome. In 2008, Yeats et al.  found around 600,000 singletons in 527 species – 50 eukaryotes, 437 eubacteria and 39 archaea – corresponding to 1,139 singletons per species. No information about the number of singletons is available in the most recent summary of the data from over 1100 sequenced genomes encompassing nearly 10 million sequences . In spite of the missing recent data on singletons, the results of the above calculations are sufficient for an unambiguous conclusion: each species possesses hundreds, or even thousands, of unique genes – the genes that are not shared with any other species. This conclusion is in full agreement with the power-law distribution of protein families discussed above. (p. 17)
The genes and proteins that are unique to a given species can be used to define that species
Figure 3 shows how the number of unique genes (singletons), expressed as an average per each sequenced genome, was changing with the total number of the genomes sequenced. Evidently, the number of singletons tends to increase, from several hundreds to more than one thousand. The presence of a large number of unique genes in each species represents a new biological reality. Moreover, the singletons as a group appear to be the most distinctive constituent of all individuals of one species, because that group of singletons is lacking in all individuals of all other species. The conclusion that the singletons are the determinants of biological phenomenon of species then follows logically. In System of Logic, John Stuart Mill outlined his Second Canon or Method of Difference : “If an instance in which the phenomenon under investigation occurs, and an instance in which it does not occur, have every circumstance in common save one, that one occurring only in the former; the circumstance in which alone the two instances differ, is the effect, or the cause, or an indispensable part of the cause, of the phenomenon.”(p. 18)
The discovery of hundreds of unique proteins in each species is at odds with the Darwinian theory of evolution
The idea that protein domains represent conserved units of evolution [72, 108, 151-155] hinges upon the presumed capability of evolutionary processes – consisting of random mutations, recombination, genetic drift and natural selection  – to maintain the 3D structure of a protein while changing its amino acid sequence. These blind processes – which do not know what kind of protein 3D structure they start with, how they change it and in which direction in the structure space they go – thus supposedly possess certain capabilities that are by far superior to those of tens of thousands of computers, or superior to those of tens of thousands people using the computers. (p. 20)
That hypothesis – that evolution strives to preserve a protein domain once it stumbles upon it – contradicts the power-law distribution of domains. The distribution graphs clearly show that unique domains are the most abundant of all domain groups [21, 66, 67, 70, 72, 79, 82, 86, 94, 95], contrary to their expected rarity. Here I predict that the idea of protein domains as the basic units of evolution will be refuted directly by finding in the genome of one species two singletons having identical domain structure. Such a finding will represent the unambiguous and definitive refutation. That finding requires structural characterization of numerous singletons, and it depends on an objective, mathematical rather than a curator’s, delineation of the protein structural elements and 3D identity. (p. 20)
Each unique gene, and accordingly each novel functional protein encoded by that gene, however, represents a major problem for evolutionary theory because unique proteins are as unrelated as the proteins of random sequences – and among random sequences functional proteins are exceedingly rare. Experimental data reviewed here suggest that at most one functional protein can be found among 10^20 proteins of random sequences. Hence every discovery of a novel functional protein (singleton) represents a testimony for successful overcoming of the probability barrier of one against at least 10^20, the probability defined here as a “macromolecular miracle”. More than one million of such “macromolecular miracles” are present in the genomes of about two thousand species sequenced thus far. Assuming that this correlation will hold with the rest of about 10 million different species that live on Earth , the total number of “macromolecular miracles” in all genomes could reach 10 billion. These 10^10 unique proteins would still represent a tiny fraction of the 10^470 possible proteins of the median eukaryotic size. (p. 21)
The appearance of hundreds of unique proteins and genes that characterize each species is an event beyond the reach of chance
If just 200 unique proteins are present in each species, the probability of their simultaneous appearance is one against at least 10^4,000. [The] Probabilistic resources of our universe are much, much smaller; they allow for a maximum of 10^149 events  and thus could account for a one-time simultaneous appearance of at most 7 unique proteins. The alternative, a sequential appearance of singletons, would require that the descendants of one family live through hundreds of “macromolecular miracles” to become a new species – again a scenario of exceedingly low probability. Therefore, now one can say that each species is a result of a Biological Big Bang; to reserve that term just for the first living organism  is not justified anymore. This view about species differs sharply from the predominant one according to which speciation is caused by reproductive isolation of two populations [159, 160] mediated by difficult to find speciation genes [161-163]. (p. 21)
Evolutionary biologists of earlier generations have not anticipated [164, 165] the challenge that singletons pose to contemporary biologists. By discovering millions of unique genes biologists have run into brick walls similar to those hit by physicists with the discovery of quantum phenomena. The predominant viewpoint in biology has become untenable: we are witnessing a scientific revolution of unprecedented proportions. (p. 21)
What do readers think of Dr. Kozulic’s paper?