Uncommon Descent Serving The Intelligent Design Community

Bioinformatics tools used in my OPs: some basic information.

Share
Facebook
Twitter
LinkedIn
Flipboard
Print
Email

EugeneS made this simple request in the thread about Random Variation:

I also have a couple of very concrete and probably very simple questions regarding the bioinformatics algorithms and software you are using. Could you write a post on the bioinformatics basics, the metrics and a little more detail about how you produced those graphs, for the benefit of the general audience?

That’s a very reasonable request, and so I am trying here to address it. So, this OP is mainly intended as a reference, and not necessarily for discussion. However, I will be happy, of course, to answer any further requests for clarifications or details, or any criticism or debate.

My first clarification is that I work on proteins sequences. And I use essentially two important tools available on the web to all.

The first basic site is Uniprot.

The mission of the site is clearly stated in the home page:

The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information.

I would say: how beautiful to work with a site which, in its own mission, incorporates the concept of functional information! And believe me, it’s not an ID site! 🙂

Uniprot is a database of proteins. Here is a screenshot of the search page.

 

 

Here I searched for “ATP synthase beta”, and I found easily the human form of the beta chain:

Now, while the “Entry name”, “ATPB_human”, is a brief identifier of the protein in Uniprot, the really important ID is in the column “Entry”:  “P06576”. Indeed, this is the ID that can be used as accession number in the BLAST software, that we will discuss later.

The “Reviewed” icon in the thord column is important too, because in general it’s better to use only reviewed sequences.

By clicking on the ID in the “Entry” column, we can open the page dedicated to that protein.

 

 

Here, we can find a lot of important information, first of all the “Function” section, which sums up what is known (or not known) about the protein function.

Another important section is the “Family and Domains” section, which gives information about domains in the protein. In this case, it just states:

Belongs to the ATPase alpha/beta chains family.

Then, the “Sequence” section gives the reference sequence for the protein:

It is often useful to have the sequence in FASTA format, which is probably the most commonly used format fro sequences. To do that, we can simply click on the FASTA button (above the Sequence section). This is the result:

This sequence is made of two parts: a comment line, which is a summary description of the sequence, and then the sequence itself. A sequence in this form can easily be pasted into BLAST, or other bioinformatics tools, either including the comment line, or just using the mere sequence.

Now, let’s go to the second important site: BLAST (Basic Local Alignment Search Tool).  It’s a service of NCBI (National Center for Biotechnology Information).

We want to go to the Protein Blast page.

Now, let’s see how we can verify my repeated statement that the beta chain of ATP synthase is extremely conserved, from bacteria to humans. So, we past the ID from Uniprot (P06576) in the field “Accession number”, and we select Escherichia coli in the field “Organism” (important: the organism name must be selected from the drop menu, and must include the taxid number). IOWs, we are blasting the human proteins against all E. coli proteins. Here’s how it looks:

 

Now, we can click on the “BLAST” blue button (bottom left), and the query starts. It takes a little time. Here is the result:

 

In the upper part, we can see a line representing the 529 AAs which make the protein sequence, and the recognized domains in the sequence (only one in this case).

The red lines are the hits (red, because each of them is higher than 200 bits). When you see red lines, something is there.

Going down, we see a summary of the first 100 hits, in order of decreasing homology. We can see that the first hit is with a protein named “F0F1 ATP synthase subunit beta [Escherichia coli]“, and has a bitscore of 663 bits. However, there are more than 50 hits with a bitscore above 600 bits, all of them in E. coli (that was how our query was defined), and all of them with variants of the same proteins. Such a redundancy is common, especially with bacteria, and especially with E. coli, because there are a lot of sequences available, often of practically the same protein.

Now, if we click on the first hit, or just go down, we can find the corresponding alignment:

 

The “query” here is the human protein. You can see that the alignment involves AAs 59 – 523, corresponding tp AAs 2 – 460 of the “Subject”, that is the E. coli protein.

Th middle line represents the identities (aminoacid letter) and positives (+). The title reminds us that the hit is with a protein of E. coli whose name is “F0F1 ATP synthase subunit beta”, which is 460 AAs long (rather shorter than the human protein). It also gives us an ID/accession number for the protein, which is a different ID from Uniprot’s IDs, but can be used just the same for BLAST queries.

The important components of the result are:

  1. Score: this is the bitscore, the number I use to measure functional information, provided that the homology is conserved for a long evolutionary time (usually, at least 200 – 400 million years). The bitscore is already adjusted for the properties of the scoring system.
  2. The Expect number: it is simply the number of such homologies that we would expect to find for unrelated sequences (IOWs random homologies) in a similar search. This is not exactly a p value, but, as stated in the BLAST reference page: when E < 0.01, P-values and E-value are nearly identical.
  3. Identities: just the number and percent of identical AAs in the two sequences, in the alignment. The percent is relative to the aligned part of the sequence, not to the total of its length.
  4. Positives: the number and percent of identical + positive AAs in the alignment.  Here is a clear explanation of what “positives” are:Similarity (aka Positives)When one amino acid is mutated to a similar residue such that the physiochemical properties are preserved, a conservative substitution is said to have occurred. For example, a change from arginine to lysine maintains the +1 positive charge. This is far more likely to be acceptable since the two residues are similar in property and won’t compromise the translated protein.Thus, percent similarity of two sequences is the sum of both identical and similar matches (residues that have undergone conservative substitution). Similarity measurements are dependent on the criteria of how two amino acid residues are to each other.(From: binf.snipcademy.com)    IOWs, we could consider “positives” as “half identities”.
  5. Gaps. This is the number of gaps used in the alignments. Our alignments are gapped, IOWs spaces are introduced to improve the alignment. The lower the number of gaps, the better the alignment. However, the bitscore already takes gaps in consideration, so we can usually not worry too much about them.

A very useful tool is the “Taxonomy report” (at the top of the page), which shows the hits in the various groups of organisms.

While in our example we looked only at E. coli, usually our search will include a wider range of organisms. If no organism is specified, BLAST will look for homologies in the whole protein database.

It is often useful to make queries in more groups of organisms, if necessary using the “exclude” option. For example, if I am interested in the transition to vertebrates for the SATB2 protein (ID = Q9UPW6, a protein that I have discussed in a previous OP), I can make a search in the whole metazoa group, excluding only vertebrates, as follows:

 

As you can see, there is very low homology before vertebrates:

 

And this is the taxonomy report:

The best hit is 158 bits in a spider.

Then, to see the difference in the first vertebrates, we can run a query of the same human protein on cartilaginous fish. Here is the result:

 

As you can see. now the best hit is 1197 bits. Quite a difference with the 158 bits best hit in pre-vertebrates.

Well, that’s what I call an information jump!

Now, my further step has been to gather the results of similar BLAST queries made for all human proteins. It is practically impossible to do that online, so I downloaded the  BLAST executables and databases. That can be done from the BLAST site, and allows one to make queries locally on one’s own computer. The use of the BLAST executables is a little more complex, because it is made by command line instructions, but it is not extremely difficult.

To perform my queries, I downloaded from Uniprot a list of all reviewed human proteins: at the time I did that, the total number was 20171. Today, it is 20239. The number varies slightly because the database is constantly modified.

So, using the local BLAST executables and the BLAST databases, I performed multiple queries of all the human proteome against different groups of organisms, as detailed in my OP here:

This kind of query take some time, from a few hours to a few days.

I have then imported the results in Excel, generating a dataset where for each  human protein (20171 rows) I have the value of protein ID, name and length, and best hit for each group of organism, including protein and organism name, protein length, bitscore value, expect value, number and percent of identities and positives, gaps. IOWs, all the results that we have when we perform a single query on the website, limited to the best hit.

In the Excel dataset I have then computed some derived variables:

  • the bitscore per AA site (baa: total bitscore / human protein length)
  • the information jump in bits for specific groups of organisms, in particolar between cartilaginous fish and pre-vertebrates (bitscore for cartilaginous fish – bitscore for non vertebrate deuteronomia)
  • the information jump in bits per aminoacid site for specific groups of organisms, in particolar between cartilaginous fish and pre-vertebrates (baa for cartilaginous fish – baa for non vertebrate deuteronomia)

The Excel data are then imported in R. R is a wonderful open source statistical software and programming language that is constantly developed and expanded by statisticians all over the world. It also allows to create very good graphs, like the following:

This is a kind of graph for which I have written the code and which, using the above mentioned dataset, can easily plot the evolutionary history, from cnidaria to mammals, of any human protein, or group of protein, using their IDs. This graph uses the bit per aminoacid values, and therefore sequences of different length can easily be compared. A refernce line is always plotted, with the mean baa value in each group of organism for all human proteins. That already allows to visualize how the pre-vertebrate-vertebrate transition exhibitis the greatest informational jump, in terms of human conserved information.

However, the plots regarding individual proteins are much more interesting, and they reveal huge differences in the their individual histories. In the above graph, for example, I have plotted the histories of two completely different proteins:

  1. The green line is protein Cdc42, “a small GTPase of the Rho family, which regulates signaling pathways that control diverse cellular functions including cell morphologycell migration, endocytosis and cell cycle progression” (Wikipedia). It is 191 AAs long, and, as can be seen in the graph, it is extremely conserved in all metazoa, presented almost maximal homology with the human form already in Cnidaria.
  2. The brown line is our well known SATB2 (see above), a 733 AAs protein which is, among other things, “required for the initiation of the upper-layer neurons (UL1) specific genetic program and for the inactivation of deep-layer neurons (DL) and UL2 specific genes, probably by modulating BCL11B expression. Repressor of Ctip2 and regulatory determinant of corticocortical connections in the developing cerebral cortex.” (Uniprot) In the graph we can admire its really astounding information jump in vertebrates.

This kind of graph is very good to visualize the behaviour of proteins and of groups of proteins. For example, the following graph is of proteins involved in neuronal adhesion:

It shows, as expected, a big information jump, mainly in cartilaginous fish.

The following, instead, is of odorant receptors:

Here, for example, the jump is later, in bony fish, and it goes on in amphibian and reptiles, and up to mammals.

Well, I think that I have given at least a general idea of the main issues and procedures. If anyone has specific requests, I am ready to answer.

 

Comments
GPuccio @31
GPuccio: Yes, it is extraordinarily conserved in all Metazoa. But also in single celled eukaryotes and plant (3900 – 4200 bits). But it is really, really absent in prokaryotes. ... Nothing at all in Archaea (3 “hits” of 37 bits, expect value 5 – 6.9). It’s exciting, isn’t it? If we are looking for a really new protein in eukaryotes, this is one of the best candidates I have ever seen! :)
Let's see. What is the evolutionary explanation of 3900 bits of functional information? Yes, indeed, “functional”, because it is conserved for many hundreds of million years. Dembski’s Universal Probability Bound informs us that the probabilistic resource of our whole universe, from the Big Bang to now, is 10^150, or 2^500, or 500 bits. 3900 bits correspond to a search space of 2^3900. Just let that sink in for a moment … breath calmly ... and then you just know:
unguided evolution is a lie!
Origenes
November 18, 2017
November
11
Nov
18
18
2017
05:50 AM
5
05
50
AM
PDT
Origenes: Let's say it together: You are a quick learner and unguided evolution is a lie! (Dionisio, you can join us if you like, but please don't shout too loud! :) )gpuccio
November 18, 2017
November
11
Nov
18
18
2017
05:15 AM
5
05
15
AM
PDT
Origenes: Yes, it is extraordinarily conserved in all Metazoa. But also in single celled eukaryotes and plant (3900 - 4200 bits). But it is really, really absent in prokaryotes. We have the usual wrong hit in Clamydia tracomatis (2958 bits!), and two extremely suspicious "hits" with two different "proteins" in E. coli (one 72 AAs long, the other 58 AAs long) which show complete and almost complete identity with two different parts of our protein. Definitely, some contamination here too. And absolutely nothing in all the rest of the bacterial world. Nothing at all in Archaea (3 "hits" of 37 bits, expect value 5 - 6.9). It's exciting, isn't it? If we are looking for a really new protein in eukaryotes, this is one of the best candidates I have ever seen! :)gpuccio
November 18, 2017
November
11
Nov
18
18
2017
05:13 AM
5
05
13
AM
PDT
Dionisio: "Or it could have been “co-opted” by RV+NS+HGT+T+… and so on… right?" Of course, I could I forget? :)gpuccio
November 18, 2017
November
11
Nov
18
18
2017
04:53 AM
4
04
53
AM
PDT
Origenes @26: You're ahead of the class in this technical course taught by gpuccio. Well done! I'm still behind and can use your example too. Thanks.Dionisio
November 18, 2017
November
11
Nov
18
18
2017
04:40 AM
4
04
40
AM
PDT
gpuccio @25: Thanks for the explanation.Dionisio
November 18, 2017
November
11
Nov
18
18
2017
04:36 AM
4
04
36
AM
PDT
gpuccio @20: "After all, we know that apparently almost 50% of protein domain superfamilies were already present in LUCA!" Where is this LUCA positioned relative to bacteria, prokaryotes, etc.? Before, in between, after? Thanks. PS. Sorry for the dumb questionsDionisio
November 18, 2017
November
11
Nov
18
18
2017
04:33 AM
4
04
33
AM
PDT
GPuccio @19
GPuccio: If you really like an extreme experience, I would suggest that you give a look to the main partner of SNRNP200 in the U5 construct of the spliceosome: PRP8 (Q6P2Q9), length 2335 AAs.
Uniprot informs us: Protein: Pre-mRNA-processing-splicing factor 8; Gene: PRPF8; Organism: Homo sapiens (Human); Status: Reviewed; Function: functions as a scaffold that mediates the ordered assembly of spliceosomal proteins and snRNAs. Required for the assembly of the U4/U6-U5 tri-snRNP complex. Functions as scaffold that positions spliceosomal U2, U5 and U6 snRNAs at splice sites on pre-mRNA substrates, so that splicing can occur. Interacts with both the 5' and the 3' splice site; Length: 2335. Okay, let's blast this protein! First a general blast; not excluding anybody. The ‘taxonomy report’ tells us that the protein is extremely well-preserved in vertebrates. Primates, rodents, whales, bats, birds, snakes, crocodiles, frogs and bony fishes all score above 97% homology. Even the Pundamilia Nyererei, a colorful Victorian cichlid fish, scores 4780, which means 97% homology. So, let’s blast while excluding vertebrates. Here the taxonomy report tells us that the protein is also well preserved in ‘pre-vertebrates.’ Starfish, termites, lice, horseshoe crabs, bees, ants, beetles, mosquitos, ants, flies all score in the range of 4596 and 4441, which means between 94% and 92% homology. Now what? Where does this huge protein come from? In e.g. E-coli it seems to be non-existent.Origenes
November 18, 2017
November
11
Nov
18
18
2017
04:32 AM
4
04
32
AM
PDT
Dionisio: The complex you refer to is formed by two identical rings, one above the other, each made of eight different proteins, each of them about 550 AAs long. The eight monomers are related, but not identical: they share about 220-300 bits of homology and 33% identity. Usually, this complex is said to exist in archaea but not in bacteria, but I have found significant homologies in archaea, and in some bacteria too, especially clostridia, of the order of 150 - 400 bits and 30 - 40% identities. Of course, in single celled eukaryotes we fiind much higher homology, about 800 bits in fungi, 750 in cellular slime molds. The sequences are, of course, highly conserved in metazoa. I don't know if these sequences require chaperons or chaperonins to fold. Many proteins do not need that.gpuccio
November 18, 2017
November
11
Nov
18
18
2017
04:30 AM
4
04
30
AM
PDT
gpuccio @19:
There are proteins which have no detectable homology before a specific transition. In that case, we can assume that the protein arose apparently “from scratch” in that transition. Of course. it could still have been engineered from some existing protein, or from some non coding DNA, by full “rewriting”.
Or it could have been "co-opted" by RV+NS+HGT+T+... and so on... right? :)Dionisio
November 18, 2017
November
11
Nov
18
18
2017
04:27 AM
4
04
27
AM
PDT
Origenes @16: No, what happens is that both gpuccio and you don't understand evolution. :) You may want to learn Biology 101 first. :)Dionisio
November 18, 2017
November
11
Nov
18
18
2017
04:20 AM
4
04
20
AM
PDT
gpuccio @15: "You are my best adversary!" Well, I was a little disappointed by the conspicuous absence of real opponents in this thread where you've publicly revealed the 'secrets' of how you analyze the proteins in order to write your OPs and follow-up commentaries. I recall a couple of years ago somebody encouraged a distinguished biochemistry professor to respond to a challenge I have posted --imitating professor Tour's challenge-- and thus teach me (and the rest of us) a lesson or two. We know the rest of the story. Perhaps looking back that distinguished biochemistry professor regrets having kneejerk reacted to that commenter who encouraged him to engage in that discussion which turned so embarrassing. Too late now. I thought maybe I could encourage some polite dissenters to jump into this discussion? But apparently I lack the persuasive skills that 'encouraging' commenter had back then? :)Dionisio
November 18, 2017
November
11
Nov
18
18
2017
04:11 AM
4
04
11
AM
PDT
gpuccio, Is the eukaryotic chaperonin TRiC (TCP-1 Ring Complex, also called CCT for chaperonin containing TCP-1) an interesting candidate for the kind of "functional information" analysis you have demonstrated here? BTW, does the eukaryotic chaperonin require 3D folding too? Does it use a chaperone? IOW, could this be another "chicken-egg" conundrum case? Just curious. Thanks.Dionisio
November 18, 2017
November
11
Nov
18
18
2017
03:37 AM
3
03
37
AM
PDT
Mung: However, an important concept, IMO, is that, form a strict ID point of view, it is not so important if a protein appears apparently from scratch, or if it shares part of its functional information with other pre-existing proteins: what really matters is the amount of new functional information added in the transition. IOWs, a protein which arises de novo, but has only, say, 300 bits of new functional information, is less amazing, from an engineering point of view, than a protein who share 500 bits of functional information with others already existing proteins, but to which 1000 bits of new functional information are added in the transition. Rewiring of old proteins, or the generation of new proteins which share modules with other existing proteins, is as important as the generation of a completely new protein, from a design point of view. New functional information is new functional information, however it is distributed. After all, we know that apparently almost 50% of protein domain superfamilies were already present in LUCA!gpuccio
November 18, 2017
November
11
Nov
18
18
2017
12:54 AM
12
12
54
AM
PDT
Mung (and Origenes): "How do we tell which proteins evolved from some other protein and which proteins did not evolve from some other protein? Can this database answer those questions?" Yes, it can. There are proteins which have no detectable homology before a specific transition. In that case, we can assume that the protein arose apparently "from scratch" in that transition. Of course. it could still have been engineered from some existing protein, or from some non coding DNA, by full "rewriting". Now, in the vertebrate transition we don't expect to have many of those "completely de novo" proteins, because at that point most of the protein domains have already appeared. However, I have made a quick search in my database, and there are definitely some good examples. I will just mention the first I found: Activity-dependent neuroprotector homeobox protein (Q9H2P0) Seems to appear in cartilaginous fish, with a best hit of only 52.4 bits (expect 0.007) in bees, and even lower ones in some other insects, limited to a very short part of the sequence (about 100 AAs). The protein is 1102 AAs long, and the human - cartilaginous fish homology is 1267 bits (1204 + 62.8 in two non overlapping alignments). The function: "Potential transcription factor. May mediate some of the neuroprotective peptide VIP-associated effects involving normal growth and cancer proliferation." Of course, neo-darwinists will certainly find potential precursors, if really motivated, maybe using extra-sensitive approaches. But the simple truth is that this proteins seems to arise in vertebrates, as a real novelty. And, of course, there are others like it. Not many, in the vertebrate transition. And each case must be manually verified, to avoid false positives. Of course, we can find a lot more in the eukaryote transition. But I cannot do that directly from my database, because it is only about metazoa. I could not include full queries on all known bacteria and archaea and single celled eukaryotes, because that would have needed months of computations, with my resources, because of the huge number of sequences in those groups. Moreover, as pointed out to Origenes, there are many false data in those groups, and each case should be reviewed manually. However, Origenes has already pointed to one very promising molecule, SNRNP200. In doing so, he has stimulated my interest in the spliceosome, a specific eukaryotic construct, and an incredibly complex molecular machine! If you really like an extreme experience, I would suggest that you give a look to the main partner of SNRNP200 in the U5 construct of the spliceosome: PRP8 (Q6P2Q9), length 2335 AAs. Origenes, that's homework for you! :)gpuccio
November 17, 2017
November
11
Nov
17
17
2017
10:23 PM
10
10
23
PM
PDT
Origenes: "The information-jump during the pre-vertebrate vertebrate transition is: 1241 bits. Yes! Easily beating GPuccio’s SATB2 protein (see OP), which means that I am a quick learner and that unguided evolution is a lie." Yes and yes! :) :)gpuccio
November 17, 2017
November
11
Nov
17
17
2017
09:16 PM
9
09
16
PM
PDT
Hi gpuccio, Thank you for this OP. But every existent protein evolved from some other protein, unless it did not. So how does this database help us tell the difference? How do we tell which proteins evolved from some other protein and which proteins did not evolve from some other protein? Can this database answer those questions?Mung
November 17, 2017
November
11
Nov
17
17
2017
08:17 PM
8
08
17
PM
PDT
Protein: Histone acetyltransferase p300; Gene: EP300; Organism: Homo sapiens (Human); Status: Reviewed; Length: 2414; Function: Functions as histone acetyltransferase and regulates transcription via chromatin remodeling. - - - - - Cartilaginous fish: Rhincodon typus (whale shark), score 2798 Pre-vertebrates: Tribolium castaneum (red floor beetle), score: 1557 E-coli “No significant similarity found”. - - - - The information-jump during the pre-vertebrate vertebrate transition is: 1241 bits. Yes! Easily beating GPuccio’s SATB2 protein (see OP), which means that I am a quick learner and that unguided evolution is a lie.Origenes
November 17, 2017
November
11
Nov
17
17
2017
04:06 PM
4
04
06
PM
PDT
Dionisio: You are my best adversary! :) :)gpuccio
November 17, 2017
November
11
Nov
17
17
2017
03:07 AM
3
03
07
AM
PDT
Here are strong arguments against your presentation and your so called 'challenges': Posted @56 here: https://uncommondescent.com/evolution/rethinking-biology-what-role-does-physical-structure-play-in-the-development-of-cells/#comment-643666 Posted @57 here: https://uncommondescent.com/evolution/rethinking-biology-what-role-does-physical-structure-play-in-the-development-of-cells/#comment-643668 :)Dionisio
November 17, 2017
November
11
Nov
17
17
2017
02:38 AM
2
02
38
AM
PDT
There isn't much complexity in the functionality described @9, right? It all looks so simple and straightforward. You guys seem to get too excited about anything. :)Dionisio
November 17, 2017
November
11
Nov
17
17
2017
02:14 AM
2
02
14
AM
PDT
"The only real urge of neo-darwinism is to flatten biological reality, and to deny as much as possible of the wonders in it, just in order to survive." That seems like a valid motive, doesn't it? :)Dionisio
November 17, 2017
November
11
Nov
17
17
2017
02:10 AM
2
02
10
AM
PDT
gpuccio @8:
[...] the transition from prokaryotes to eukaryotes, although fascinating and probably the one with the greatest information jump after OOL (or maybe even more than OOL), is extremely difficult to analyze. The main reason is that we know too little and understand even less.
Hmm... Then on that 'ignorance' background, doesn't every new discovery shedding more light on the previously unknown, seem to reveal undeniable designed systems? Therefore, isn't the known what points to intelligent design? In this case isn't the amount of functional information that gpuccio's OPs and follow-up comments show associated with certain protein families in some biological systems what points to intelligent design? Couldn't all that point to unguided processes instead? Why not? How about 'co-option' for example? :)Dionisio
November 17, 2017
November
11
Nov
17
17
2017
01:55 AM
1
01
55
AM
PDT
Origenes: When I say "more than 2000 bits" I am referring, tentatively, to the difference between the best hit in prokaryotes (311 bits in archaea) and the best hit in the group of single celled eukaryotes that shows the lowest homology (2363 in cellular slime molds). That's probably the best we can do, in the absence of some clear understanding of eukaryotes' evolutionary history.gpuccio
November 17, 2017
November
11
Nov
17
17
2017
01:30 AM
1
01
30
AM
PDT
Origenes: An interesting consideration about your protein is its main function. From Uniprot: "RNA helicase that plays an essential role in pre-mRNA splicing as component of the U5 snRNP and U4/U6-U5 tri-snRNP complexes. Involved in spliceosome assembly, activation and disassembly. Mediates changes in the dynamic network of RNA-RNA interactions in the spliceosome. Catalyzes the ATP-dependent unwinding of U4/U6 RNA duplices, an essential step in the assembly of a catalytically active spliceosome." And the spliceosome is "a large and complex molecular machine found primarily within the splicing speckles of the cell nucleus of eukaryotic cells" (Wikipedia) That easily explains, from a functional point of view, why the protein is specific of eukaryotes.gpuccio
November 17, 2017
November
11
Nov
17
17
2017
01:25 AM
1
01
25
AM
PDT
Origenes: You are perfectly right! The protein is essentially engineered in eukaryotes, even if important rewirings are certainly achieved at the usual stages, after that (Metazoa, vertebrates). And it is absolutely true that the transition from prokaryotes to eukaryotes, although fascinating and probably the one with the greatest information jump after OOL (or maybe even more than OOL), is extremely difficult to analyze. The main reason is that we know too little and understand even less. We don't have any idea of when it happened. We don't have any idea of how it happened (indeed, we have too many contrasting and unsatisfying ideas!). And I am not referring here to the big problem (design or non design), but just to the basics of the evoutionary history! Indeed, there is no real consensus, even gross consensus, about the evolutionary chain of eukaryotes, even less about their immediate precursors in prokaryotes: the various debates about bacteria, archaea, and their different possible symbioses, is good evidence for that. That's why I stick to the vertebrate transition, for the moment, even if I have been tempted many times to analyze the eukaryote transition! The vertebrate transition has the remarkable advantage of being well localized in natural history, and vertebrates certainly have a much more understood evolutionary chain. However, the engineering of eukaryotes is certainly a major and amazing event: eukaryotes are really another thing, even if they certainly reuse much information from prokaryotes. Even in terms of new proteins and new superfamilies the transition is amazing. Your protein is a good example. It is also a good example of how neo-darwinists would reason, in their habit of "flattening" biological reality. They would simply be happy that there are homologues of it in prokaryotes, and firmly believe that this explains everything! :) This is the only thing they are interested in: finding distant homologues, and decoding that there is nothing else to investigate. Why should they ackowledge that more than 2000 bits of original functional sequence information are necessary in eukaryotes to engineer the protein? Why should they ask questions about that? They have their 300 bits homologue in archaea, after all: RV + NS could certainly optimize that initial treasure trove by a couple thousands simple naturally selectable steps! And certainly there is no reason at all to verify if that can be true! :) OK, this is just what I imagine they would say. But I think that I am imagining very realistically! :) The only real urge of neo-darwinism is to flatten biological reality, and to deny as much as possible of the wonders in it, just in order to survive. A real neo-darwinist motive, if I ever saw one!gpuccio
November 17, 2017
November
11
Nov
17
17
2017
01:18 AM
1
01
18
AM
PDT
GPuccio, thank you very much. Again! You are the best. I have to study your comments very carefully and I will. It seems to me that there is at least one huge information jump between eukaryotes and prokaryotes. However I am not sure which organisms should be chosen. It seems logical to choose the prokaryote with the highest score (185 bits), but which eukaryote is next in the 'evolutionary chain'? And why?Origenes
November 16, 2017
November
11
Nov
16
16
2017
05:44 PM
5
05
44
PM
PDT
Origenes: The protein is extremely conserved in metazoa. The value in Callorhincus milii is absolutely correct. 4061 bits; 91% identities; 96% positives The apparently strange result for Rhincodon typus is in a sense an error, and it allows me to make an important clarification. If you look at the section: "Sequences producing significant alignments", the second one is the Rhincodon typus. On the right, you can see two different columns: Max score and Total score. The Max score is the number you report: 2277 bits And the Total score is higher than the Callorhincus milii score: 4548 bits What is the right one? Neither. What happens here is an event rather common in Blast alignments (let's say that it happens in a minority of cases, but no too rarely). As Blast is a local alignment tool, sometimes it makes multiple alignments for the same couple of proteins. That's what happens here: the alignment between the human protein and the Rhincodon typus protein protein is in reality 4 different alignment. So, the Max score column gives you the best hit among the 4, while the Total score column gives you the sum of the 4 scores. Now, the Max score is obviously not what we are looking for, because it is related to a part of the two molecules only. So, the 2277 bits are for the first alignments which is between the following segments: 986 - 2136 of the human protein 908 - 2058 of the shark protein The percent of identities is very high (92%, more or less as in Callorhincus milii), because it is computed on the aligned sequence only (in this case, 1064/1151). The second alignment is between the following segments: 1 - 906 and 1 - 906 This is a very compact alignment, there are no gaps, and it is responsible for 1663 further bits, and 92% identities, again. The third and fourth alignments are as follows: Third: 452 - 1286 1214 - 2046 337 bits; 29% identities Fourth: 1292 - 2123 452 - 1207 270 bits; 26% identities The column Total score sums the 4 results: 2277 + 1663 + 337 + 270 = 4547 which is the number given in the Total score (with a difference of 1 bit, probably due to approximations) Now, this too is not an useful number, because, as you can see, the third and fourth alignment overlap with the first two. Indeed, the first two alignments cover rather well the whole two molecules, and without any overlapping. So, the correct value, in this case, would be: 2277 + 1663 = 3940 which is very similar to the value in Callorhincus milii (4061) This is not surprising, because the values in these two sharks are usually rather similar (although there can be exceptions). I hope this clarifies the problem. Now, my database tells me that this problems starts very "high" already in cnidaria (more than 1.5 baa, that means about 3400 bits). We can also look at single celled eukaryotes. Try: eukaryotes + Metazoa exclude + Plants exclude The best hit is 2773 in Sphaeroforma arctica (a member of the Ichthyosporea clade). If you look at the Taxonomy report, you will see that values are rather high in all forms of eukaryotes: 2717 in fungi 2691 in choanoflagellates 2672 in chytrids 2441 in ascomycets 2408 in basidiomycetes 2363 in cellular slime molds What about prokaryotes? Let's first blast against Bacteria. Now, here we have another interesting situation which is not rare when we blast against all bacteria. We have a first hit, completely isolated, of 982 bits in Chlamydia tracomatis (an importan human pathogen). All the other hits are extremely lower, with a maximum of 185 bits and a range, in the first 100 hits, of 122 - 185. the Total scores are a little bit higher (maximum 355 bits), but they are often overlapping. Most of these hits are identified as "DEAD/DEAH box helicase" which is a different protein, with helicase activity too. In E. coli the best hit is (in my results) 89 bits, with a protein labeled only as "hypothetical protein". But there is a 74.3 bits result with a DEAD/DEAH box helicase. Interestingly, in Archaea the homology is slightly higher, with a best hit of 311 bits. Again, most hits are identified as DEAD/DEAH box helicase. So, the only really anomalous result here is the 982 bits result in Clamydia tracomatis, which is definitely out of range with what happens in all other bacteria. The probable, almost certain, explanation is that this is simply an false result, an error in the data. The probable cause could be contamination with human material (C. trachomatis is an important human pathogen). So, what conclusions can we draw from the above data? 1) snRNP is an eukaryotic protein. 2) In single celled eukaryotes it already shows some high homology with the human form (about 1.1 - 1.3 baa). 3) In metazoa, the homology is already higher than 1.6 baa in cnidaria, and very near to complete identity in the first vertebrates (about 1.9 baa) 4) In bacteria, the protein seems not to exist as such, but it has some detectable, low level homology with some bacterial proteins, especially DEAD/DEAH box helicase. This homology is in the order of 0.08 baa maximum. In Archaea, it is slightly higher (about 0.14 baa), always with DEAD/DEAH box helicase.gpuccio
November 16, 2017
November
11
Nov
16
16
2017
04:25 PM
4
04
25
PM
PDT
GPuccio, I have been looking at snRNP. Protein: ‘U5 small nuclear ribonucleoprotein 200 kDa helicase’; Gene: SNRNP200; Organism: Homo sapiens (Human); Status: Reviewed; Taxonomic identifier: 9606 [NCBI] The length is considerable: 2136 AAs Function: RNA helicase that plays an essential role in pre-mRNA splicing as component of the U5 snRNP and U4/U6-U5 tri-snRNP complexes. Involved in spliceosome assembly, activation and disassembly. Mediates changes in the dynamic network of RNA-RNA interactions in the spliceosome. Catalyzes the ATP-dependent unwinding of U4/U6 RNA duplices, an essential step in the assembly of a catalytically active spliceosome. I don’t know what to make of the following blast data. It is highly probably that I have made some mistakes. E-coli score: 87 Stegodyphus mimosarum (spider): 3564 Callorhinchus milii (elephant shark): 4061 Rhincodon typus score: 2277Origenes
November 16, 2017
November
11
Nov
16
16
2017
02:21 PM
2
02
21
PM
PDT
Thank you, GP.Truth Will Set You Free
November 15, 2017
November
11
Nov
15
15
2017
12:44 AM
12
12
44
AM
PDT
1 2 3 4 5

Leave a Reply