The amazing level of engineering in the transition to the vertebrate proteome: a global analysis

_{Giuseppe Puccio

March 22, 2017

Intelligent Design}

Share: Facebook; Twitter; LinkedIn; Flipboard; Print; Email

As a follow-up to my previous post:

The highly engineered transition to vertebrates: an example of functional information analysis

I am presenting here some results obtained by a general application, expanded to the whole human proteome, of the procedure already introduced in that post.

Main assumptions.

The aim of the procedure is to measure a well defined equivalent of functional information in proteins: the information that is conserved throughout long evolutionary times, in a well specified evolutionary line.

The simple assumption is that such information, which is not modified by neutral variation in a time span of hundreds of million years, is certainly highly functionally constrained, and is therefore a very good empirical approximation of the value of functional information in a protein.

In particular, I will use the proteins in the human proteome as “probes” to measure the information that is conserved from different evolutionary timepoints.

The assumption here is very simple. Let’s say that the line that includes humans (let’s call it A) splits from some different line (let’s call it B) at some evolutionary timepoint T. Then, the homology that we observe in a protein when we compare organisms derived from B and humans (derived from A) must have survived neutral variation throughout the timespan from T to now. If the timespan is long enough, we can very safely assume that the measured homology is a measure of some specific functional information conserved from the time of the split to now.

Procedure.

I downloaded a list of the basic human proteome (in FASTA form). In particular, I downloaded it from UNIPROT selecting all human reviewed sequences, for a total of 20171 sequences. That is a good approximation of the basic human proteome as known at present.

I used NCBI’s blast tool in local form to blast the whole human proteome against known protein sequences from specific groups of organisms, using the nr (non redundant) NCBI database of protein sequences, and selecting, for each human protein, the alignment with the highest homology bitscore from that group of organisms.

Homology values:

I have used two different measures of homology for each protein alignment:

The total bitscore from the BLAST alignment (from now on: “bits”)
The ratio of the total bitscore to the length in aminoacids of the human protein, that I have called “bits per aminoacid” (from now on, “baa”). This is a measure of the mean “density” of functional information in that protein, which corrects for the protein length.

The values of homology in bits have a very wide range of variation in each specific comparison with a group of organisms. For example, in the comparison between human proteins and the proteins in cartilaginous fish, the range of bit homology per protein is 21.6 – 34368, with a mean of 541.4 and a median of 376 bits.

The vlaues of homology in baa , instead, are necessarily confined between 0 and about 2.2. 2.2, indeed, is (approximately) the highest homology bitscore (per aminoacid) that we get when we blast a protein against itself (total identity). I use the BLAST bitscore because it is a widely used and accepted way to measure homology and to derive probabilities from it (the E values).

So, for example, in the same human – cartilaginous fish comparison, the range of the baa values is: 0.012 – 2.126, with a mean of 0.95 and a median of 0.97 baas.

For each comparison, a small number of proteins (usually about 1-2%) did not result in any significant alignment, and were not included in the specific analysis for that comparison.

Organism categories and split times:

The analysis includes the following groups of organisms:

Cnidaria
Cephalopoda (as a representative sample of Mollusca, and more in general Protostomia: cephalopoda and more generally Mollusca, are, among Protostomia, a group with highest homology to deuterostomia, and therefore can be a good sample to evaluate conservation from the protostomia – deuterostomia split).
Deuterostomia (excluding vertebrates): this includes echinoderms, hemichordates and chordates (excluding vertebrates).
Cartilaginous fish
Bony fish
Amphibians
Crocodylia, including crocodiles and alligators (as a representative sample of reptiles, excluding birds. Here again, crocodylia have usually the highest homology with human proteins among reptiles, together maybe with turtles).
Marsupials (an infraclass of mammals representing Metatheria, a clade which split early enough from the human lineage)
Afrotheria, including elephants and other groups (representing a group of mammals relatively distant from the human lineage, in the Eutheria clade)

There are reasons for these choices, but I will not discuss them in detail for the moment. The main purpose is always to detect the functional information (in form of homology) that was present at specific split times, and has been therefore conserved in both lines after the split. In a couple of cases (Protostomia, Reptiles), I have used a smaller group (Cephalopoda, Crocodylia) which could reasonably represent the wider group, because using very big groups of sequences (like all protostomia, for example) was too time consuming for my resources.

So what are the split times we are considering? This is a very difficult question, because split times are not well known, and very often you can get very different values for them from different sources. Moreover, I am not at all an expert of these issues.

So, the best I can do is to give here some reasonable proposal, from what I have found, but I am completely open to any suggestions to improve my judgements. In each split, humans derive from the second line:

Cnidaria – Bilateria. Let’s say at least 555 My ago.
Protostomia – deuterostomia. Let’s say about 530 My ago.
Pre-vertebrate deuterostomia (including chordates like cephalocordata and Tunicates) – Vertebrates (Cartilaginous fish). Let’s say 440 My ago.
Cartilaginous fish – Bony fish. Let’s say about 410 My ago.
Bony fish – Tetrapods (Amphibians). Let’s say 370 My ago, more or less.
Amphibians – Amniota (Sauropsida, Crocodylia): about 340 My ago
Sauropsida (Crocodylia) – Synapsida (Metatheria, Marsupialia): about 310 My ago
Metatheria – Eutheria (Afrotheria): about 150 My ago
Atlantogenata (Afrotheria) – Boreoeutheria: probably about 100 My ago.

The simple rule is: for each split, the second member of each split is the line to humans, and the human conserved information present in the first member of each couple must have been conserved in both lines at least from the time of the split to present day.

So, for example, the human-conserved information in Cnidaria has been conserved for at least 555 MY, the human-conserved information in Crocodylia has been conserved for at least 310 My, and so on.

The problem of redundancy (repeated information).

However, there is an important problem that requires attention. Not all the information in the human proteome is unique, in the sense of “present only once”. Many sequences, especially domains, are repeated many times, in more or less similar way, in many different proteins. Let’s call this “the problem of redundancy”.

So, all the results that we obtain about homologies of the human proteome to some other organism or group of organisms should be corrected for that factor, if we want to draw conclusions about the real amount of new functional information in a transition. Of course, repeated information will inflate the apparent amount of new functional information.

Therefore, I computed a “coefficient of correction for redundancy” for each protein in the human proteome. For the moment, for the sake of simplicity, I will not go into the details of that computation, but I am ready to discuss it in depth if anyone is interested.

The interesting result is that the mean coefficient of correction is, according to my computations, 0.497. IOWs, we can say that about half of the potential information present in the human proteome can be considered unique, while about half can be considered as repeated information. This correction takes into account, for each protein in the human proteome, the number of proteins in the human proteome that have significant homologies to that protein and their mean homology.

So, when I give the results “corrected for redundancy” what I mean is that the homology values for each protein have been corrected multiplying them for the coefficient of that specific protein. Of course, in general, the results will be approximately halved.

Results

Table 1 shows the means of the values of total homology (bitscore) with human proteins in bits and in bits per aminoacid for the various groups of organisms.

Group of organisms	Homology bitscore (mean)	Total homology bitscore	Bits per aminoacid (mean)

Cnidaria	276.9	5465491	0.543
Cephalopoda	275.6	5324040	0.530
Deuterostomia (non vertebrates)	357.6	7041769	0.671
Cartilaginous fish	541.4	10773387	0.949
Bony fish	601.5	11853443	1.064
Amphibians	630.4	12479403	1.107
Crocodylia	706.2	13910052	1.217
Marsupialia	777.5	15515530	1.354
Afrotheria	936.2	18751656	1.629

Maximum possible value (for identity)		24905793	2.2

Figure 1 shows a plot of the mean bits-per-aminoacid score in the various groups of organisms, according to the mentioned approximate times of split.

Figure 2 shows a plot of the density distribution of human-conserved functional information in the various groups of organisms.

The jump to vertebrates.

Now, let’s see how big are the informational jumps for each split, always in relation to human conserved information.

The following table sums up the size of each jump:

Split	Homology bitscore jump (mean)	Total homology bitscore jump	Bits per aminoacid (mean)

Homology bits in Cnidaria		5465491	0.54

Cnidaria – Bilateria (cephalopoda)	-6.3	-121252	-0.02
Protostomia (Cephalopoda)- Deuterostomia	87.9	1685550	0.15
Deuterostomia (non vert.) – Vertebrates (Cartilaginous fish)	189.6	3708977	0.29
Cartilaginous fish-Bony fish	54.9	1073964	0.11
Bony fish-Tetrapoda (Amphibians)	31.9	624344	0.05
Amphibians-Amniota (Crocodylia)	73.3	1430963	0.11
Sauropsida (Crocodylia)-Synapsida (Marsupialia)	80.8	1585361	0.15
Metatheria (Marsupialia) – Eutheria (Afrotheria)	162.2	3226932	0.28

Total bits of homology in Afrotheria		18751656	1.63
Total bits of maximum information in humans		24905793	2.20

The same jumps are shown graphically in Figure 3:

As everyone can see, each of these splits, except the first one (Cnidaria-Bilateria) is characterized by a very relevant informational jumps in terms of human-conserved information. The split is in general of the order of 0.5 – 1.5 million bits.

However, two splits are characterized by a much bigger jump: the prevertebrate-vertebrate split reaches 3.7 million bits, while the Methateria-Eutheria split is very near, with 3.2 million bits.

For the moment I will discuss only the prevertebrate-vertebrate jump.

This is where a great part of the functional information present in humans seems to have been generated: 3.7 million bits, and about 0.29 bits per aminoacid of new functional information.

Let’s see that jump also in terms of information density, looking again at Figure 2, but only with the first 4 groups of organisms:

Where is the jump here?

We can see that the density distribution is almost identical for Cnidaria and Cephalopoda. Deuterostomia (non vertebrates) have a definite gain in human-conserved information, as we know, it is about 1.68 million bits, and it corresponds to the grey area (and, obviously, to the lower peak of low-homology proteins).

But the real big jump is in vertebrates (cartilaginous fish). The pink area and the lower peak in the low homology zone correspond to the amazing acquisition of about 3.7 million bits of human-conserved functional information.

That means that a significant percentage of proteins in cartilaginous fish had a high homology, higher than 1 bit per aminoacid, with the corresponding human protein. Indeed, that is true for 9574 proteins out of 19898, 48.12% of the proteome. For comparison, these high homology proteins are “only” 4459 out of 19689, 22.65% of the proteome in pre-vertebrates.

So, in the transition from pre-vertebrates to vertebrates, the following amazing events took place:

About 3,7 million bits of human-conserved functional information were generated
A mean increase of about 190 bits per proteins of that information took place
The number of high human homology proteins more than doubled

Correcting for redundancy

However, we must still correct for redundancy if we want to know how much really new functional information was generated in the transition to vertebrates. As I have explained, we should expect that about half of the total information can be considered unique information.

Making the correction for each single protein, the final result is that the total number of new unique functional bits that appear for the first time in the transition to vertebrates, and are then conserved up to humans, is:

1,764,427 bits

IOWs, more than 1.7 million bits of unique new human-conserved functional information are generated in the proteome with the transition to vertebrates.

But what does 1.7 million bits really mean?

I would like to remind that we are dealing with exponential values here. A functional complexity of 1.7 million bits means a probability (in a random search) of:

1:2^1.7 million

A quite amazing number indeed!

Just remember that Dembski’s Universal Probability Bound is 500 bits, a complexity of 2^500. Our number (2^1764427) is so much bigger that the UPB seems almost a joke, in comparison.

Moreover, this huge modification in the proteome seems to be strongly constrained and definitely necessary for the new vertebrate bodily system, so much so that it is conserved for hundreds of millions of years after its appearance.

Well, that is enough for the moment. The analysis tools I have presented here can be used for many other interesting purposes, for example to compare the evolutionary history of proteins or groups of proteins. But that will probably be the object of further posts.

Comments

Missing technical discussion threads like this.Dionisio_{June 14, 2017
June
06
Jun
14
14
2017
05:06 AM
5
05
06
AM
PDT}

GP I have been on radio silence for some time. Looks like I missed an interesting OP and discussion. Catching up at my pace. Very interesting stuff. Thanks for this. In case you missed my OP: https://uncommondescent.com/intelligent-design/selensky-shallit-koza-vs-artificial-life-simulations/EugeneS_{May 13, 2017
May
05
May
13
13
2017
11:46 AM
11
11
46
AM
PDT}

We look forward to reading gpuccio's next article.Dionisio_{April 29, 2017
April
04
Apr
29
29
2017
03:34 AM
3
03
34
AM
PDT}

Did the discussion here pause for lack of politely dissenting interlocutors? Where did they all go? Where are they? Don't want to publicly admit lacking arguments? :)Dionisio_{April 29, 2017
April
04
Apr
29
29
2017
03:29 AM
3
03
29
AM
PDT}

gpuccio: Here's another interesting article: http://www.pnas.org/content/early/2017/04/14/1614896114.abstractDionisio_{April 24, 2017
April
04
Apr
24
24
2017
04:06 AM
4
04
06
AM
PDT}

gpuccio: Here's another interesting paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5270588/pdf/f1000research-6-10907.pdfDionisio_{April 23, 2017
April
04
Apr
23
23
2017
06:39 PM
6
06
39
PM
PDT}

Dionisio: No. I did not know that paper. It seems extremely interesting and I will read it carefully. Thank you again for your constant attention and work. :)gpuccio_{April 22, 2017
April
04
Apr
22
22
2017
01:05 AM
1
01
05
AM
PDT}

gpuccio: A little old paper, maybe you already saw this: http://www.mdpi.com/1422-0067/16/3/6571/htm BTW, @233 in the first line of the last paragraph the word "our" should be "out" instead. My mistake. Fat fingers. :)Dionisio_{April 20, 2017
April
04
Apr
20
20
2017
04:51 PM
4
04
51
PM
PDT}

gpuccio: The subject of your research is very interesting indeed. When I saw that paper your research topic came to mind right away. Also I shared it with UB because it seems somehow related to his OP on biosemiotics. Also it is interesting to read "design for repurposing" in the title of that paper. Now the problem for some folks our there is to show how they would do all that protein "reshaping": by trial and error? flipping coins? In any case they will have the advantage of knowing a priori the final shape, but still shouldn't be too easy to figure it out.Dionisio_{April 19, 2017
April
04
Apr
19
19
2017
10:19 AM
10
10
19
AM
PDT}

Dionisio: Thank you indeed! That's exactly what I am trying to say. Proteins are reshaped, even when they conserve their main function throughout evolution. The reshaping can have many different meanings, but it is always a highly engineered result. Not only whole proteins, but even conserved domains are obviously reshaped in different species, and much of that reshaping is then conserved, and therefore functional. The 1.7 million bits of reshaping that took place to realize the proteome of the first vertebrates are traceable in the proteome in different forms: some proteins are almost completely new, and appear in cartilaginous fish practically for the first time, but many others are more or less drastically reshaped from existing homologues in pre-vertebrates. Conserved domains, which are often the least "reshaped" part of proteins, are also different in different species and contexts. Too much emphasis has always been given to functional similarities and to non functional differences, but the simple truth is that a very important part of the scenario has been repeatedly underestimated: the huge treasure chest of functional differences in similar things.gpuccio_{April 19, 2017
April
04
Apr
19
19
2017
07:52 AM
7
07
52
AM
PDT}

gpuccio: Here's a link to the paper referenced @229 & @230 above: https://www.researchgate.net/profile/Litao_Sun/publication/311089510_Two_crystal_structures_reveal_design_for_repurposing_the_C-Ala_domain_of_human_AlaRS/links/583f0da708ae2d217557da29/Two-crystal-structures-reveal-design-for-repurposing-the-C-Ala-domain-of-human-AlaRS.pdfDionisio_{April 19, 2017
April
04
Apr
19
19
2017
06:46 AM
6
06
46
AM
PDT}

gpuccio: here's more on the same paper:
The 20 aminoacyl tRNA synthetases (aaRSs) couple each amino acid to their cognate tRNAs. [...] 19 aaRSs expanded by acquiring novel noncatalytic appended domains, which are absent from bacteria and many lower eukaryotes but confer extracellular and nuclear functions in higher organisms. AlaRS is the single exception, with an appended C-terminal domain (C-Ala) that is conserved from prokaryotes to humans but with a wide sequence divergence. In human cells, C-Ala is also a splice variant of AlaRS. Crystal structures of two forms of human C-Ala, and small-angle X-ray scattering of AlaRS, showed that the large sequence divergence of human C-Ala reshaped C-Ala in a way that changed the global architecture of AlaRS. This reshaping removes the role of C-Ala in prokaryotes for docking tRNA and instead repurposes it to form a dimer interface presenting a DNA-binding groove. This groove cannot form with the bacterial ortholog. Direct DNA binding by human C-Ala, but not by bacterial C-Ala, was demonstrated. Thus, instead of acquiring a novel appended domain like other human aaRSs, which engendered novel functions, a new AlaRS architecture was created by diversifying a preexisting appended domain.
Two crystal structures reveal design for repurposing the C-Ala domain of human AlaRS Litao Sun, Youngzee Song, David Blocquel, Xiang-Lei Yang and Paul Schimmel PNAS vol. 113 no. 50 14300–14305 doi: 10.1073/pnas.1617316113

Dionisio_{April 19, 2017
April
04
Apr
19
19
2017
01:51 AM
1
01
51
AM
PDT}

gpuccio: I thought you may want to take a look at this:
Here we present an exception that supports the rule that the 20 human tRNA synthetases acquired new architectures to expand their functions during evolution. The new features are associated with novel, appended domains that are absent in prokaryotes and retained by their many splice variants. Alanyl-tRNA synthetase (AlaRS) is the single example that has a prototypical appended domain—C-Ala—even in prokaryotes, which is spliced out in humans. X-ray structural, small-angle X-ray scattering, and functional analysis showed that human C-Ala lost its prokaryotic tRNA functional role and instead was reshaped into a nuclear DNA-binding protein. Thus, we report another paradigm for tRNA synthetase acquisition of a novel function, namely, repurposing a preexisting domain rather than addition of a new one.
Two crystal structures reveal design for repurposing the C-Ala domain of human AlaRS Litao Sun, Youngzee Song, David Blocquel, Xiang-Lei Yang and Paul Schimmel PNAS vol. 113 no. 50 14300–14305 doi: 10.1073/pnas.1617316113

Dionisio_{April 19, 2017
April
04
Apr
19
19
2017
01:50 AM
1
01
50
AM
PDT}

#3 & #4 in the top 5 Popular Posts (Last 30 Days) FFT*: Charles unmasks the anti-ID trollish tactic of… (3,220) The Woeful State of Modern Debate (2,419) The amazing level of engineering in the transition to the… (1,815) GP on the Origin of Body Plans [OoBP] challenge (1,348) A world-famous chemist tells the truth: there’s no… (1,309)Dionisio_{April 15, 2017
April
04
Apr
15
15
2017
01:31 AM
1
01
31
AM
PDT}

gpuccio: You've got two* in the hit parade? :) Popular Posts (Last 30 Days)
FFT*: Charles unmasks the anti-ID trollish tactic of… (3,077) The Woeful State of Modern Debate (2,419) The amazing level of engineering in the transition to the… (1,809) A world-famous chemist tells the truth: there’s no… (1,363) GP on the Origin of Body Plans [OoBP] challenge (1,339)
(*) one graciously provided by KFDionisio_{April 14, 2017
April
04
Apr
14
14
2017
05:03 AM
5
05
03
AM
PDT}

gpuccio: Glad to hear that, but please take your time. No rush. Thanks. BTW, here's another update. Up one position again. Popular Posts (Last 30 Days)
FFT*: Charles unmasks the anti-ID trollish tactic of… (2,770) The Woeful State of Modern Debate (2,416) The amazing level of engineering in the transition to the… (1,795) The problem of agit prop street theatre (U/D: UC Berkeley… (1,585) Physics and the contemplation of nothing (1,571)
Dionisio_{April 12, 2017
April
04
Apr
12
12
2017
10:34 AM
10
10
34
AM
PDT}

Dionisio: I am working at it! :)gpuccio_{April 11, 2017
April
04
Apr
11
11
2017
02:10 PM
2
02
10
PM
PDT}

Still kicking? Up one position. Stats update: Popular Posts (Last 30 Days)
The Woeful State of Modern Debate (2,410) FFT*: Charles unmasks the anti-ID trollish tactic of… (2,383) Physics and the contemplation of nothing (1,886) The amazing level of engineering in the transition to the… (1,783) The problem of agit prop street theatre (U/D: UC Berkeley… (1,699)
Logically this tread may fade away from the ranking soon, leaving room for your next OP. :)Dionisio_{April 11, 2017
April
04
Apr
11
11
2017
02:05 AM
2
02
05
AM
PDT}

Dionisio: Yes, I am happy of some more visibility, because after all I really believe that the things presented and discussed here are worthwhile. Thank you for your constant support! :)gpuccio_{April 6, 2017
April
04
Apr
6
06
2017
08:14 AM
8
08
14
AM
PDT}

gpuccio: More stats:
Popular Posts (Last 30 Days) Physics and the contemplation of nothing (2,546) The Woeful State of Modern Debate (2,374) The problem of agit prop street theatre (U/D: UC Berkeley… (2,089) A world-famous chemist tells the truth: there’s no… (1,941) The amazing level of engineering in the transition to the… (1,752)
:)Dionisio_{April 6, 2017
April
04
Apr
6
06
2017
06:14 AM
6
06
14
AM
PDT}

Dionisio: Yes, those are satisfying numbers indeed. To them we could safely add: Not one single criticism about the specific content of the OP, neither here nor in KF's follow-up. Should I be happy or sad? :)gpuccio_{April 5, 2017
April
04
Apr
5
05
2017
05:41 AM
5
05
41
AM
PDT}

GP, considering that this OP + following discussion is quite technical -hence less attractive to the general public than a Hollywood celebrity scandal (welcome to this world!)- it's encouraging to see the below stats: As of now: 1,709 visits vs. 219 posted comments. 1,490 more visits than posted comments (probably including some anonymous onlookers/lurkers). Almost 7 times more visits without "footprints" than posted comments.Dionisio_{April 5, 2017
April
04
Apr
5
05
2017
12:24 AM
12
12
24
AM
PDT}

GP @216:
So, where is all the shredding?
They ran out of power to start their toothless shredder.Dionisio_{April 5, 2017
April
04
Apr
5
05
2017
12:06 AM
12
12
06
AM
PDT}

GP @216:
“goodwill” of rvb8
That's a funny "2 nice" way to refer to "trolling" :)Dionisio_{April 5, 2017
April
04
Apr
5
05
2017
12:03 AM
12
12
03
AM
PDT}

GP, is this somehow -at least slightly- related to the topic of your latest OPs? https://uncommondescent.com/evolution/protein-families-are-still-improbably-astonishing-retraction-of-matlock-and-swamidass-paper-in-order Thanks.Dionisio_{April 4, 2017
April
04
Apr
4
04
2017
11:59 PM
11
11
59
PM
PDT}

KF: So, where is all the shredding? :) Here everything is calm. On your thread, a very interesting discussion is going on, with a lot of visualizations and many very good contributions. From "our side". From the other side, just the "goodwill" of rvb8 and some brief and cautious appearance of Bob O'H. If I am not wrong.gpuccio_{April 4, 2017
April
04
Apr
4
04
2017
05:10 AM
5
05
10
AM
PDT}

KF,
Of course, if they could, it would be done here too.
Exactly. They simply lack serious arguments.Dionisio_{April 4, 2017
April
04
Apr
4
04
2017
05:07 AM
5
05
07
AM
PDT}

GP, don't forget: we can shred that stuff elsewhere. Of course, if they could, it would be done here too. KFkairosfocus_{April 4, 2017
April
04
Apr
4
04
2017
02:58 AM
2
02
58
AM
PDT}

Dionisio: It seems that the main objection that I receive, both here and in KF's new thread, is: "Why do you publish your things here?" Strange indeed.gpuccio_{April 3, 2017
April
04
Apr
3
03
2017
03:52 AM
3
03
52
AM
PDT}

KF: Thank you indeed! I am promptly joining the discussion there! :)gpuccio_{April 2, 2017
April
04
Apr
2
02
2017
03:33 AM
3
03
33
AM
PDT}

1 2 3 … 9 Next

You must be logged in to post a comment.

Main assumptions.

Procedure.

Homology values:

Organism categories and split times:

The problem of redundancy (repeated information).

Results

The jump to vertebrates.

Correcting for redundancy

Leave a Reply