Uncommon Descent Serving The Intelligent Design Community

Bioinformatics tools used in my OPs: some basic information.

Share
Facebook
Twitter
LinkedIn
Flipboard
Print
Email

EugeneS made this simple request in the thread about Random Variation:

I also have a couple of very concrete and probably very simple questions regarding the bioinformatics algorithms and software you are using. Could you write a post on the bioinformatics basics, the metrics and a little more detail about how you produced those graphs, for the benefit of the general audience?

That’s a very reasonable request, and so I am trying here to address it. So, this OP is mainly intended as a reference, and not necessarily for discussion. However, I will be happy, of course, to answer any further requests for clarifications or details, or any criticism or debate.

My first clarification is that I work on proteins sequences. And I use essentially two important tools available on the web to all.

The first basic site is Uniprot.

The mission of the site is clearly stated in the home page:

The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information.

I would say: how beautiful to work with a site which, in its own mission, incorporates the concept of functional information! And believe me, it’s not an ID site! 🙂

Uniprot is a database of proteins. Here is a screenshot of the search page.

 

 

Here I searched for “ATP synthase beta”, and I found easily the human form of the beta chain:

Now, while the “Entry name”, “ATPB_human”, is a brief identifier of the protein in Uniprot, the really important ID is in the column “Entry”:  “P06576”. Indeed, this is the ID that can be used as accession number in the BLAST software, that we will discuss later.

The “Reviewed” icon in the thord column is important too, because in general it’s better to use only reviewed sequences.

By clicking on the ID in the “Entry” column, we can open the page dedicated to that protein.

 

 

Here, we can find a lot of important information, first of all the “Function” section, which sums up what is known (or not known) about the protein function.

Another important section is the “Family and Domains” section, which gives information about domains in the protein. In this case, it just states:

Belongs to the ATPase alpha/beta chains family.

Then, the “Sequence” section gives the reference sequence for the protein:

It is often useful to have the sequence in FASTA format, which is probably the most commonly used format fro sequences. To do that, we can simply click on the FASTA button (above the Sequence section). This is the result:

This sequence is made of two parts: a comment line, which is a summary description of the sequence, and then the sequence itself. A sequence in this form can easily be pasted into BLAST, or other bioinformatics tools, either including the comment line, or just using the mere sequence.

Now, let’s go to the second important site: BLAST (Basic Local Alignment Search Tool).  It’s a service of NCBI (National Center for Biotechnology Information).

We want to go to the Protein Blast page.

Now, let’s see how we can verify my repeated statement that the beta chain of ATP synthase is extremely conserved, from bacteria to humans. So, we past the ID from Uniprot (P06576) in the field “Accession number”, and we select Escherichia coli in the field “Organism” (important: the organism name must be selected from the drop menu, and must include the taxid number). IOWs, we are blasting the human proteins against all E. coli proteins. Here’s how it looks:

 

Now, we can click on the “BLAST” blue button (bottom left), and the query starts. It takes a little time. Here is the result:

 

In the upper part, we can see a line representing the 529 AAs which make the protein sequence, and the recognized domains in the sequence (only one in this case).

The red lines are the hits (red, because each of them is higher than 200 bits). When you see red lines, something is there.

Going down, we see a summary of the first 100 hits, in order of decreasing homology. We can see that the first hit is with a protein named “F0F1 ATP synthase subunit beta [Escherichia coli]“, and has a bitscore of 663 bits. However, there are more than 50 hits with a bitscore above 600 bits, all of them in E. coli (that was how our query was defined), and all of them with variants of the same proteins. Such a redundancy is common, especially with bacteria, and especially with E. coli, because there are a lot of sequences available, often of practically the same protein.

Now, if we click on the first hit, or just go down, we can find the corresponding alignment:

 

The “query” here is the human protein. You can see that the alignment involves AAs 59 – 523, corresponding tp AAs 2 – 460 of the “Subject”, that is the E. coli protein.

Th middle line represents the identities (aminoacid letter) and positives (+). The title reminds us that the hit is with a protein of E. coli whose name is “F0F1 ATP synthase subunit beta”, which is 460 AAs long (rather shorter than the human protein). It also gives us an ID/accession number for the protein, which is a different ID from Uniprot’s IDs, but can be used just the same for BLAST queries.

The important components of the result are:

  1. Score: this is the bitscore, the number I use to measure functional information, provided that the homology is conserved for a long evolutionary time (usually, at least 200 – 400 million years). The bitscore is already adjusted for the properties of the scoring system.
  2. The Expect number: it is simply the number of such homologies that we would expect to find for unrelated sequences (IOWs random homologies) in a similar search. This is not exactly a p value, but, as stated in the BLAST reference page: when E < 0.01, P-values and E-value are nearly identical.
  3. Identities: just the number and percent of identical AAs in the two sequences, in the alignment. The percent is relative to the aligned part of the sequence, not to the total of its length.
  4. Positives: the number and percent of identical + positive AAs in the alignment.  Here is a clear explanation of what “positives” are:Similarity (aka Positives)When one amino acid is mutated to a similar residue such that the physiochemical properties are preserved, a conservative substitution is said to have occurred. For example, a change from arginine to lysine maintains the +1 positive charge. This is far more likely to be acceptable since the two residues are similar in property and won’t compromise the translated protein.Thus, percent similarity of two sequences is the sum of both identical and similar matches (residues that have undergone conservative substitution). Similarity measurements are dependent on the criteria of how two amino acid residues are to each other.(From: binf.snipcademy.com)    IOWs, we could consider “positives” as “half identities”.
  5. Gaps. This is the number of gaps used in the alignments. Our alignments are gapped, IOWs spaces are introduced to improve the alignment. The lower the number of gaps, the better the alignment. However, the bitscore already takes gaps in consideration, so we can usually not worry too much about them.

A very useful tool is the “Taxonomy report” (at the top of the page), which shows the hits in the various groups of organisms.

While in our example we looked only at E. coli, usually our search will include a wider range of organisms. If no organism is specified, BLAST will look for homologies in the whole protein database.

It is often useful to make queries in more groups of organisms, if necessary using the “exclude” option. For example, if I am interested in the transition to vertebrates for the SATB2 protein (ID = Q9UPW6, a protein that I have discussed in a previous OP), I can make a search in the whole metazoa group, excluding only vertebrates, as follows:

 

As you can see, there is very low homology before vertebrates:

 

And this is the taxonomy report:

The best hit is 158 bits in a spider.

Then, to see the difference in the first vertebrates, we can run a query of the same human protein on cartilaginous fish. Here is the result:

 

As you can see. now the best hit is 1197 bits. Quite a difference with the 158 bits best hit in pre-vertebrates.

Well, that’s what I call an information jump!

Now, my further step has been to gather the results of similar BLAST queries made for all human proteins. It is practically impossible to do that online, so I downloaded the  BLAST executables and databases. That can be done from the BLAST site, and allows one to make queries locally on one’s own computer. The use of the BLAST executables is a little more complex, because it is made by command line instructions, but it is not extremely difficult.

To perform my queries, I downloaded from Uniprot a list of all reviewed human proteins: at the time I did that, the total number was 20171. Today, it is 20239. The number varies slightly because the database is constantly modified.

So, using the local BLAST executables and the BLAST databases, I performed multiple queries of all the human proteome against different groups of organisms, as detailed in my OP here:

This kind of query take some time, from a few hours to a few days.

I have then imported the results in Excel, generating a dataset where for each  human protein (20171 rows) I have the value of protein ID, name and length, and best hit for each group of organism, including protein and organism name, protein length, bitscore value, expect value, number and percent of identities and positives, gaps. IOWs, all the results that we have when we perform a single query on the website, limited to the best hit.

In the Excel dataset I have then computed some derived variables:

  • the bitscore per AA site (baa: total bitscore / human protein length)
  • the information jump in bits for specific groups of organisms, in particolar between cartilaginous fish and pre-vertebrates (bitscore for cartilaginous fish – bitscore for non vertebrate deuteronomia)
  • the information jump in bits per aminoacid site for specific groups of organisms, in particolar between cartilaginous fish and pre-vertebrates (baa for cartilaginous fish – baa for non vertebrate deuteronomia)

The Excel data are then imported in R. R is a wonderful open source statistical software and programming language that is constantly developed and expanded by statisticians all over the world. It also allows to create very good graphs, like the following:

This is a kind of graph for which I have written the code and which, using the above mentioned dataset, can easily plot the evolutionary history, from cnidaria to mammals, of any human protein, or group of protein, using their IDs. This graph uses the bit per aminoacid values, and therefore sequences of different length can easily be compared. A refernce line is always plotted, with the mean baa value in each group of organism for all human proteins. That already allows to visualize how the pre-vertebrate-vertebrate transition exhibitis the greatest informational jump, in terms of human conserved information.

However, the plots regarding individual proteins are much more interesting, and they reveal huge differences in the their individual histories. In the above graph, for example, I have plotted the histories of two completely different proteins:

  1. The green line is protein Cdc42, “a small GTPase of the Rho family, which regulates signaling pathways that control diverse cellular functions including cell morphologycell migration, endocytosis and cell cycle progression” (Wikipedia). It is 191 AAs long, and, as can be seen in the graph, it is extremely conserved in all metazoa, presented almost maximal homology with the human form already in Cnidaria.
  2. The brown line is our well known SATB2 (see above), a 733 AAs protein which is, among other things, “required for the initiation of the upper-layer neurons (UL1) specific genetic program and for the inactivation of deep-layer neurons (DL) and UL2 specific genes, probably by modulating BCL11B expression. Repressor of Ctip2 and regulatory determinant of corticocortical connections in the developing cerebral cortex.” (Uniprot) In the graph we can admire its really astounding information jump in vertebrates.

This kind of graph is very good to visualize the behaviour of proteins and of groups of proteins. For example, the following graph is of proteins involved in neuronal adhesion:

It shows, as expected, a big information jump, mainly in cartilaginous fish.

The following, instead, is of odorant receptors:

Here, for example, the jump is later, in bony fish, and it goes on in amphibian and reptiles, and up to mammals.

Well, I think that I have given at least a general idea of the main issues and procedures. If anyone has specific requests, I am ready to answer.

 

Comments
Looks like gpuccio managed to survive!Mung
December 20, 2017
December
12
Dec
20
20
2017
06:54 AM
6
06
54
AM
PDT
After the beating gpuccio took in this thread I hope he is ok.Mung
December 15, 2017
December
12
Dec
15
15
2017
06:57 PM
6
06
57
PM
PDT
Looks like gpuccio finally figured out that he doesn't know what he's talking about and decided to shut up. Maybe miracles can happen after all.Mung
December 5, 2017
December
12
Dec
5
05
2017
01:12 PM
1
01
12
PM
PDT
Interesting discussion between EugeneS and gpuccio. Thanks.Dionisio
November 27, 2017
November
11
Nov
27
27
2017
05:59 PM
5
05
59
PM
PDT
GPuccio "On the concepts, we absolutely agree". Excellent!EugeneS
November 27, 2017
November
11
Nov
27
27
2017
07:48 AM
7
07
48
AM
PDT
Mung: I can easily tell you what is the shortest protein in my human proteome database: Dolichyl-diphosphooligosaccharide--protein glycosyltransferase subunit 4 (OST4, P0C6T2) 37 AAs. "Given that the random mutation mechanism is constantly producing and testing novel polypeptides, must not the cell be just chock full of them?" That's what I often wondered about! According to most neo-darwinian scenarios, that should be the case. But that's nowhere to be seen! :) "What is preventing the constant production of new useless amino acid sequences that the cell then needs to get rid of?" Not enough imagination and faith on the part of neo-darwinists?gpuccio
November 27, 2017
November
11
Nov
27
27
2017
07:45 AM
7
07
45
AM
PDT
Here's a tough one for you gpuccio, What is the shortest known protein? And something else to think about. Given that the random mutation mechanism is constantly producing and testing novel polypeptides, must not the cell be just chock full of them? What is preventing the constant production of new useless amino acid sequences that the cell then needs to get rid of?Mung
November 27, 2017
November
11
Nov
27
27
2017
07:02 AM
7
07
02
AM
PDT
EugeneS: I agree with you. I only thought that the word "evolution", as it is commonly used, can have a broader meaning, and be applied also to design that re-uses some features, and also generates completely new features. Only a problem of how to use words. On the concepts, we absolutely agree.gpuccio
November 27, 2017
November
11
Nov
27
27
2017
05:42 AM
5
05
42
AM
PDT
GPuccio, I beg to differ. I interpret the word 'evolution' as something that unfolds what already exists in the hidden form of potentiality. In this way, OS Windows does not evolve but, in the strictest possible sense, is being designed. Even if some aspects of OS Windows were engineered by designed variation (which I don't think is the case), it would be bona fide design, not evolution. 'Designed evolution' is not evolution in the above sense unless one subscribes only to the weak form of ID (fine-tuning at the start), which both you and I do not :) We both go a lot further. The first example where this analogy between technology and evolution goes over the top, that I know of, is Stanislaw Lem "Summa Technologiae". The error is exactly in failure to see that technology is completely dominated and is driven in every aspect by intelligent design. In contrast, evolution by definition merely unfolds what is already in there in potential and is therefore just a curious result of "frozen accidents" of interactions between initial/boundary conditions and the laws of nature. IMO, it is a category error.EugeneS
November 27, 2017
November
11
Nov
27
27
2017
05:02 AM
5
05
02
AM
PDT
EugeneS: Well, designed evolution is a way of speaking that can be used, IMO. We can speak, for example of the guided evolution of the Windows operating system. The design of objects can be said to evolve. But in the end, it's only a question of words. The only important thing with words is not so much which words we use, but rather that we are clear and explicit about what they mean. That said, I would add that there is more than one way to implement design in the biological world. You mention artificial selection, and that is certainly a powerful tool. But there is also the important possibility of designed variation. For example, transposon activity could be guided by the designer's consciousness to realize exactly those variations which can lead to the desired result. Of course, these two modalities, designed variation and artificial selection, are not exclusive, and they can well act together to implememt the desired information in biological beings.gpuccio
November 27, 2017
November
11
Nov
27
27
2017
04:41 AM
4
04
41
AM
PDT
Origines "unguided evolution is a lie!" I totally agree. However, I must add that, strictly speaking, guided evolution is non-existent. It is an oxymoron. As soon as there is intelligent guidance, evolution 'evaporates' as a concept... I must stress that I view it as something greater than just a matter of terminology. I strongly believe that in this context we should stop using the word "evolution" at all. What we deal with here is a completely different concept, i.e. artificial selection. The authors of the 'glorious' wikipedia are prone to the same error when they discuss the problems of the OOL and evolution. They quote results of "artificial evolution" as something that, in their opinion, supports Darwinist claims about evolution. It does not! It is the same as quoting the work of the Institute of Protein Design as a counter-argument against Intelligent Design.EugeneS
November 27, 2017
November
11
Nov
27
27
2017
03:11 AM
3
03
11
AM
PDT
EugeneS: Thank you to you! :)gpuccio
November 26, 2017
November
11
Nov
26
26
2017
08:08 AM
8
08
08
AM
PDT
GPuccio Thank you very much for taking the time and pains to explain the basic reasoning and the mathematical details behind your work. I am sure it is widely appreciated by the readers :)EugeneS
November 26, 2017
November
11
Nov
26
26
2017
07:46 AM
7
07
46
AM
PDT
EugeneS: The absolute definition of functional information requires a knowledge of both the search space (which is easy) and the target space (which, instead, is very difficult to achieve). That's why all practical ways to measure functional information, at least for long sequences, must rely on indirect ways to measure the target space, and the target space/search space ratio. For example, in my OP about English language, I have used an indirect method to have a higher threshold of the target space estimate, with good results, I suppose (nobody has found any real flaw in my procedure). The same is true for the protein target space. As we are not able to measure directly the target space for long sequences (it is practically impossible, unless we develop a perfect and detailed understanding of the biochemical nature of protein folding and function), we need an indirect approach. The idea of using conservation as a measure for functionality is not mine. It is indeed inherent in all the biological thought in the last decades. However, the first to use it in an ID context has been, as far as I know, Durston, in his famous paper that has certainly inspired all my further reasonings: "Measuring the functional sequence complexity of proteins" https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2217542/ However, the method used by Durston is slightly different from my approach, but the basic principle is the same. The idea is simply that conservation through long evolutionary periods is proportional to functional specificity. Durston uses an alignment between many instances of the same protein, and gives a score based on the total information potential of 4.3 bits. That method is valid, but is rather complex to implement as a standard measure in general cases. I have simply used the already existing bitscore, which can be easily used for any protein and any set of sequences. Now, the original purpose of the bitscore is to allow an evaluation of homologies, to decide if they can be due to random effects. That's why the gross score (which is essentially computed by specific matrices of AA transformation) is adjusted to a normalized score. That's where the "halving" of the potential information takes place. The main reason is that the raw score is adjusted according to two constants, K and lambda, which are derived from the "extreme value distribution" of random variables:
Just as the sum of a large number of independent identically distributed (i.i.d) random variables tends to a normal distribution, the maximum of a large number of i.i.d. random variables tends to an extreme value distribution
To simplify the concept, as we are interested here in deciding if some observed homology can be due to chance, and we have multiple random variables (if we compare one sequence to many other sequences), a probability distribution is used that gives the probability of having a result as the best result among all variables. The tail of that probability distribution is used to asses E (the expect value). In the end, the idea is that the bitscore, and the derived E value, tell us how likely it is to have that level of homology, or more, by chance, in that context. That's the connection with functional information. If we assume that the observed homology is necessary to retain function, as it has been conserved for hunderds of million years by negative selection, then of course the probability of observing that level of homology by chance is very similar to the probability of some random state to express that level of homology, and therefore of functional information. IOWs, it can be considered as an indirect measure of getting that level of functional information by chance in one attempt (one tested state). We could work directly with the E value, but unfortunately it is flattened to 0 when the probability becomes too small, so using the bitscore is more feasible. Of course, in my reasoning it's not the homology itself that is important, but the homology that is conserved through long evolutionary times. The longer the time, the more likely it is that the observed homology corresponds well to the real functional information. Of course, the only objection that neo-darwinists can make is that the level of function that we observe in the proteins is an optimized function, and that such an optimized wildtype function can be reached through a gradual ladder of simple transitions, each of them naturally selectable vs the previous one. That's why I consider so important the challenge that I have made, some three threads, 6000+ visualizations and 700+ comments ago, and that nobody has even tried to answer. I paste it again here (last time was at post #36 in this thread):
Will anyone on the other side answer the following two simple questions? 1) Is there any conceptual reason why we should believe that complex protein functions can be deconstructed into simpler, naturally selectable steps? That such a ladder exists, in general, or even in specific cases? 2) Is there any evidence from facts that supports the hypothesis that complex protein functions can be deconstructed into simpler, naturally selectable steps? That such a ladder exists, in general, or even in specific cases?
gpuccio
November 25, 2017
November
11
Nov
25
25
2017
02:35 PM
2
02
35
PM
PDT
GP We have basically two things: 1. definition of functional information as per your earlier OP 2. Your statistical homology analysis relying on bitscore. Could you explain the reasons why the bitscore measure is an approximation of 1 in a bit more detail. What I can see is that 1 and 2 are kind of similar concepts. Correct me if I am wrong. The bitscore is just a measure of primary structure similarity between two proteins A and B. The problem is that without knowing the nitty-gritties of the bitscore matrix and how it is produced, it is hard to be definite. Thanks.EugeneS
November 25, 2017
November
11
Nov
25
25
2017
11:29 AM
11
11
29
AM
PDT
gilthill: Thank you for raising this issue. I did not know Behe's paper, but I was familiar with the same argument as expressed by Cornelius Hunter in his blog. You ask: "So what do you think of Behe’s contention that the idea that conservation indicates functional constraints not only lack experimental evidences but is also contradicted by them ?" The answer is easy: I simply disagree. Completely. I have already said that in the past, about Cornelius Hunter's version of the argument (which is essentially similar to Behe's). Knowing that Behe has said similar things in the past does not change my mind. A few important reasons why I disagree: a) While we can certainly accept some variance in the molecular clock, like in all biological phenomena, we have a rather consistent amount of data coming from synonymous mutations in protein coding genes, at least for time windows of a few hundred million years (after that, as I have explained, saturation of variation is achieved). You can look at this article: https://www.ncbi.nlm.nih.gov/books/NBK21946/ and in particular Fig. 26-17, where you can see that, with some variance of course, about 0.7 substitutions per synonymous site are fixed in 100 million years. I don't think that this simple fact can be denied. If you look at the alignment of the nucleotides in highly homologous proteins, but distant in evolutionary time (for example, the same conserved protein in humans and carilaginous fish), you see immediately that most of the genetic variation in the third nucleotide, and it is synonymous. You don't find at all that amount of synonymous variation if you compare the same protein in humans and chimps. Why? Because the time separation is too short. Neutral variation happens. It cannot be denied. b) If neutral variation happens, and we can see it everywhere, why does it happen so much less in non synonimous site, and so differently in different proteins? Again, look at the cited article, but this time Fig. 26-18. You can see that hemoglobin, at 400 million years, shows about 70% variation, but Cytochrome C only 20%. However, both show less variation than synonymous sites, which in 400 million years are beyond any detectable homology. Why is that? Of course it is because functional constraints, and in particular negative selection, tend to preserve functional sites. And functional specificity is different in different proteins, hemoglobin being for example a simple globular protein, where the function-structure relationship is more flexible. IOWs, as neutral variation happens at symomynous sites, why does it happen less in non synonymous site? I can see no other explanation than the one that everybody accepts. c) That said, why do experiments in AA substitution apparently show a much greater tolerance to substitutions than apparently expected, in highly conserved proteins like histones? The answer, again, is rather simple. There are two reasons, strictly connected: 1) Those experiments are made with single substitutions (if I remember well). Now, while a single substitution can be apparently tolerated, it generates important changes in how other residues behave (epistasis). That can, for example, make the protein much less tolerant to future changes, IOWs much less robust. 2) The same experiments measure fitness in a very gross way, usually as immediate survival in the lab (if I remember well). Evolutionary history measures fitness in the wild, in long times of observation. In that context, many deleterious effects of a mutation can be relevant, while we could never see them in a short observation in the lab, and evaluating only gross immediate survival. That's also the reason why a few polymorphisms are sometimes apparently tolerated, in a few individuals, in a population, even in conserved sites. But, if they are really slightly deleterious, they will probably never be fixed. Moreover, sometimes one single slightly deleterious mutation can be "stabilized" by another appropriate mutation (epistasis, again), but that happens rarely, and has a much lower probability (two, or more, coordinated mutations). IOWs, conservation is a much more sensitive evaluation of function than gross human experiments in single substitutions. So, to sum up I fully disagree with Behe and Hunter on this point: conservation through long evolutionary periods is due to function, and can be explained only by function.gpuccio
November 25, 2017
November
11
Nov
25
25
2017
10:36 AM
10
10
36
AM
PDT
gpucio @90 you said "In that sense, the longer the evolutionary time that the protein has been exposed to variation, the more we can assume that conservation is a good measure of functional information. Of course, that is absolutely sure for very long evolutionary conservation, for example what we see in the alpha and beta chains of ATP synthase, which can flaunt billions of years of conservation." But in 1990, Behe wrote a commentary in TIBS entitled « Histone deletion mutants challenge the molecular clock hypothesis » in which he casts doubt on the idea that conservation indicates functional constraints (https://www.ncbi.nlm.nih.gov/pubmed/2251727). Below is a piece I found here http://www.arn.org/docs/reviews/rev009.htm that summarizes Behe’s commentary "Early in the development of the molecular clock hypothesis, it was discovered that not all proteins "ticked" at the same rate. When compared across a range of species, the fibrinopeptides, for instance, were much "faster clocks" (i.e., having a higher rate of amino acid substitution) than the very conservative, "slowly ticking" histones. These differences, writes Michael Behe (Chemistry, Lehigh University), required a modification to the clock hypothesis: the postulate of functional constraints. Thus, for example, histone H4 would diverge less rapidly than fibrinopeptides if a larger percentage of H4 amino acid residues were critical for the function of the molecule. (p. 374) The problem with the notion of functional constraint, Behe argues, is an absence of experimental support: Although plausible, it has long been realized that no direct experimental evidence has been obtained 'showing rigorously that histone function is especially senstive to amino acid substitution or that fibrinopeptide function is especially insensitive to amino acid substitution.' (p. 374) "Recent experiments," writes Behe, "now indicate that the key assumption of functional constraints may not be valid." Since the histones are so highly conserved -- "the H4 sequence of the green pea differs from that of mammals by only two conservative substitutions in 102 residues" -- one might expect that "few, if any, substitutions could be tolerated in the H4 sequence" (p. 374). However, experiments (reported in detail by Behe) have shown that large parts of the histone molecule may be deleted without significantly affecting the viability of the organism (in this instance, yeast) -- results which, Behe argues, should trouble defenders of the molecular clock hypothesis: [The experimental] results pose a profound dilemma for the molecular clock hypothesis: although the theory needs the postulate of functional constraints to explain the different degrees of divergence in different protein classes, how can one speak of 'functional constraints' in histones when large portions of H2A, H2B and H4 are dispensable for yeast viability? And if functional constraints do not govern the accumulation of mutations in histones, how can they be invoked with any confi-dence for other proteins? (p. 375) The resolution of the dilemma, Behe contends, must "as far as possible be grounded in quantitative, reproducible experiments, rather than in simple correlations with time that are its current basis" (p. 375). Otherwise, he concludes: [T]he time-sequence correlation may end up as a curiosity, like the tracking of stock market prices with hemline heights, where correlation does not imply a causal relationship." So what do you think of Behe’s contention that the idea that conservation indicates functional constraints not only lack experimental evidences but is also contradicted by them ? I realizes that he developed his argument a long time ago (1990) and that since that time, some new experimental results may have been produced that may weaken Behe’s contention on this issue. But is it the case ?gilthill
November 25, 2017
November
11
Nov
25
25
2017
07:32 AM
7
07
32
AM
PDT
Mung: "Do protein superfamilies arise after the LUCA or are they already present in the LUCA? Does this make sense?" Of course they do arise after LUCA! Only some of the current proteins superfamilies were present in LUCA, be them 150 or 800. A great number of new superfamilies arise in the course of natural history, almost up to humans. How do they arise? Neo-darwinists believe that they arise by RV + NS. Which, of course, is impossible. But guided descent is a perfectly feasible explanation. Guided descent simply means that: a) All that can remain the same remains the same, or is just slightly tweaked to be adapted to the new design. b) All that is needed as a novelty is engineered, using some physical already existing material (non coding sequences, or duplicated and inactivated genes, for example), provided by descent, and reshaping it completely according to a designed plan. IOWs, a lot of new functional information is added to build the new features, be them new proteins, or protein regulations, or networks of any type. How is that accomplished? As well known, my favourite scenario is transposon driven re-shaping of existing stuff. But other types of implementation, of course, are possible. For example, the N terminal domain in SATB1 is a new superfamily of its own, the SATB1_N superfamily. That superfamily does not exist before the appearance of SATB1 which, as we have seen, is a vertebrate protein. So, how did SATB1 and its specific domain arise in vertebrates? Of course, by engineering. But does that mean that vertebrates arose from scratch? Of course not. Vertebrates share a lot of protein sequences with pre-vertebrates, in particular with non vertebrate chordates, and more in general with all deuterostomes. So, descent + reuse of what can be reused + engineering of all novelty is the only feasible scenario, according to what we empirically observe. "Asking gpuccio about protein superfamilies is like asking gpuccio about Santa Claus. We have the same reason for believing in both." What's your problem with Santa Claus? :) "If not the fault is all yours, I’m sure." Of course! :)gpuccio
November 25, 2017
November
11
Nov
25
25
2017
12:49 AM
12
12
49
AM
PDT
Asking gpuccio about protein superfamalies is like asking gpuccio about Santa Claus. We have the same reason for believing in both.Mung
November 24, 2017
November
11
Nov
24
24
2017
06:48 PM
6
06
48
PM
PDT
gpuccio, So let's assume 150 protein superfamilies already present in the LUCA. Does that point to 150 proteins present in the LUCA or from which modern proteins descended or does that point to more than 150 proteins from which modern proteins descended? IOW, were there already protein superfamilies present in the LUCA. or were there 150 proteins from which all extant proteins later evolved, after the LUCA? Do protein superfamilies arise after the LUCA or are they already present in the LUCA? Does this make sense? If not the fault is all yours, I'm sure. ;)Mung
November 24, 2017
November
11
Nov
24
24
2017
06:00 PM
6
06
00
PM
PDT
Mung: There are different estimates of how many protein superfamilies were already present in LUCA. I have found numbers ranging from a minimum of about 150 to a maximum of almost 1000. However, almost everyone agrees that LUCA was already a very complex organism, or set of organisms. Of course, many things about LUCA are still very tentative. But it is a very interesting subject, which can be addressed with some empirical consistency. Love you too, of course! :)gpuccio
November 24, 2017
November
11
Nov
24
24
2017
05:14 PM
5
05
14
PM
PDT
love you man, you know it!Mung
November 24, 2017
November
11
Nov
24
24
2017
03:52 PM
3
03
52
PM
PDT
gpuccio, For every species we ought to be able to construct a phylogenetic tree showing the relationship of that species to one or more other species. Do we agree so far? If we go back far enough the phylogenetic tree should be rooted in the LUCA. This is standard evolutionist thinking and I don't think you disagree. Are we on the same page so far? How many protein families were present in the LUCA, wild guess. I won't hold you to it. :)Mung
November 24, 2017
November
11
Nov
24
24
2017
03:52 PM
3
03
52
PM
PDT
Mung: Here is my silly answer! :) I am not sure that I understand well your question. Let's say that we consider the human species, or any other. Of course we can blast each human protein against the whole human proteome. It's very easy. For example, if we take the famous SATB1, Q01826, 763 AAs, and blast it against organism "homo sapiens", what we get is: a) One hit with itself, 100% identity, 1577 bits b) 4 hits with isoforms of SATB1, ranging from 1558 to 1085 bits c) 5 hits with the sister protein SATB2 and its isoforms, ranging from 854 to 514 bits d) A few more hits, 208 - 108 bits, with crystal structures of some domain of SATB1 Nothing else. IOWs, SATB1 is homologous only to itself and, at a lower level, to SATB2. I am not sure what you mean when you say: "What I mean is, every protein existing in humans must have evolved from some other protein within the human lineage." No, proteins are often isolated islands at sequence, structure and function level. That's why we have 2000 protein superfamilies. Of course, that's not always the case. Many proteins are part of vast families, and so when you blast the protein you will find many homologies, of different levels, with other proteins which are members of the same family. But, if you want to understand the evolutionary history of one protein, you have to look at older taxa, not to the same species. For example, for SATB1 and SATB2, you can see that both proteins practically start their existence in vertebrates, and that they remain mainly similar to themselves in all the successive evolutionary history. See my OP here: https://uncommondescent.com/intelligent-design/interesting-proteins-dna-binding-proteins-satb1-and-satb2/ in particular Fig. 1 and Fig. 6 You say: "So we ought to be able to create a tree of protein phylogeny within the human lineage (or any other) alone, without looking at any other taxon." I don't think that is possible. Philogenies are created by following the same protein through evolution (orthologs). In the same species, you can find paralogs: genes which share some homology, but are different proteins. Or just different isoforms of the same protein. A proteome is made of different proteins, most of them unrelated one to the other. Of course, domains are often shared between a few, or even many, proteins. But remember, we have 2000 different protein superfamilies. And even in the same superfamily, many proteins can be apparently unrelated at sequence level, while sharing some structure similarity. You say: "Doesn’t the same common ancestry argument apply to both?" No. A new protein superfamily, or just a new protein, often appears at certain definite points of evolutionary history, like SATB1 and SATB2 in vertebrates. When it appears, it has no antecedents. It is a novelty. In many cases, as I have tried to show, new proteins, appearing for example in vertebrates, share some basic information, maybe one or two domains, with other, different proteins which existed before. But also in that case, the bulk of the information in the new protein is a complete novelty, a huge jump in functional information, functional sequence information that did not exist before. The reason is simple: proteins are engineered, they don't simply descend from other proteins. As you know, I believe in Common descent of proteins, indeed all my reasonings are based on that assumptions. But that means only that the new engineering, in some way, happens starting from something that already exists, be it some other protein or, as I believe, some non coding sequence. IOWs, new proteins are engineered, they don't simply "descend". But the proteins which remain the same, or are only changed a little, do descend from existing proteins, with or withour some minor engineering. And they also bear the mark of neutral variation, especially in their synonymous sites, as unequivocal evidence of their descent. (I hope I have not mucked up too much my answer! :) )gpuccio
November 24, 2017
November
11
Nov
24
24
2017
02:59 PM
2
02
59
PM
PDT
hi gpuccio, Most of the discussion seems to be revolving around comparison of proteins across taxa. Are you aware of any protein database that focuses on protein homology within species? What I mean is, every protein existing in humans must have evolved from some other protein within the human lineage. So we ought to be able to create a tree of protein phylogeny within the human lineage (or any other) alone, without looking at any other taxon. I know our unguided evolutionist friends like to concentrate on relatedness of species, but what about relatedness of proteins within species? Doesn't the same common ancestry argument apply to both? Try not to muck up your answer. You already look silly enough. Thanks. ;)Mung
November 24, 2017
November
11
Nov
24
24
2017
09:55 AM
9
09
55
AM
PDT
EugeneS: Yes, the bitscore usually assigns a maximum of about 2.2 for identity. The way bitscore is computed is rather complex. It is detailed here: https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html The main purpose of the bitscore is to detect homologues. In comparison to the whole informational potential of about 4 bits per AA, it very much underestimates identities. However, the bitscore is normalized with respect to the scoring system. My idea is that it represents a valid objective reference. In the end, we can never know how precisely any score measures functional information, unless we have precise top-down methods of computing the functional information in proteins. Which, at present, we do not have. Sequence conservation is an indirect way of measuring functional information. And, IMO, a very good one. There is no doubt that sequence conservation, with the cautions that I have highlighted in the OP and in the discussion, measures functional information. There are very good correspondences with what we know of protein function: for example, the simple fact that many proteins with great information jump in vertebrates are involved in regulation of neuronal differentiation is in itself amazing. At present we cannot safely assess if the bitscore underestimates the absolute functional information (which is my opinion), or overestimates it, or simply gives a reliable estimate of it. To know that, we should have a direct measure of functional information and use it as a gold standard. That will come, in time, But we are still far away from such a result. In the meantime, the bitscore is the best tool we have. And it is a very powerful and useful tool.gpuccio
November 24, 2017
November
11
Nov
24
24
2017
06:04 AM
6
06
04
AM
PDT
GP Why is the max baa is about 2.2, not about log_(20) = 4.3? I seem to recall you mentioning that bitscore was an underestimate. Thanks.EugeneS
November 24, 2017
November
11
Nov
24
24
2017
05:14 AM
5
05
14
AM
PDT
@94 follow-up On the other hand, the source of complex functionally specified information has been empirically known for years.Dionisio
November 23, 2017
November
11
Nov
23
23
2017
10:56 PM
10
10
56
PM
PDT
Error correction Sorry, I meant "...discover new evidences..."Dionisio
November 23, 2017
November
11
Nov
23
23
2017
10:53 PM
10
10
53
PM
PDT
"a big information jump points to design, either if it rests on already existing homology in previous existing proteins, or if it is “from scratch”. There is no difference, from an ID point of view: it is simply the amount of functional complexity linked to the transition that counts." Can "the Neo-Darwinism of the gaps" claim that they don't know how those information jumps appeared, but eventually science will advance and researchers will discover NW evidences that will help them figure out how that happened? Or maybe the 'third way of evolution' folks hope to reach such a breakthrough moment?Dionisio
November 23, 2017
November
11
Nov
23
23
2017
07:45 PM
7
07
45
PM
PDT
1 2 3 5

Leave a Reply