Intelligent Design

Bioinformatics tools used in my OPs: some basic information.

Spread the love

EugeneS made this simple request in the thread about Random Variation:

I also have a couple of very concrete and probably very simple questions regarding the bioinformatics algorithms and software you are using. Could you write a post on the bioinformatics basics, the metrics and a little more detail about how you produced those graphs, for the benefit of the general audience?

That’s a very reasonable request, and so I am trying here to address it. So, this OP is mainly intended as a reference, and not necessarily for discussion. However, I will be happy, of course, to answer any further requests for clarifications or details, or any criticism or debate.

My first clarification is that I work on proteins sequences. And I use essentially two important tools available on the web to all.

The first basic site is Uniprot.

The mission of the site is clearly stated in the home page:

The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information.

I would say: how beautiful to work with a site which, in its own mission, incorporates the concept of functional information! And believe me, it’s not an ID site! 🙂

Uniprot is a database of proteins. Here is a screenshot of the search page.

 

 

Here I searched for “ATP synthase beta”, and I found easily the human form of the beta chain:

Now, while the “Entry name”, “ATPB_human”, is a brief identifier of the protein in Uniprot, the really important ID is in the column “Entry”:  “P06576”. Indeed, this is the ID that can be used as accession number in the BLAST software, that we will discuss later.

The “Reviewed” icon in the thord column is important too, because in general it’s better to use only reviewed sequences.

By clicking on the ID in the “Entry” column, we can open the page dedicated to that protein.

 

 

Here, we can find a lot of important information, first of all the “Function” section, which sums up what is known (or not known) about the protein function.

Another important section is the “Family and Domains” section, which gives information about domains in the protein. In this case, it just states:

Belongs to the ATPase alpha/beta chains family.

Then, the “Sequence” section gives the reference sequence for the protein:

It is often useful to have the sequence in FASTA format, which is probably the most commonly used format fro sequences. To do that, we can simply click on the FASTA button (above the Sequence section). This is the result:

This sequence is made of two parts: a comment line, which is a summary description of the sequence, and then the sequence itself. A sequence in this form can easily be pasted into BLAST, or other bioinformatics tools, either including the comment line, or just using the mere sequence.

Now, let’s go to the second important site: BLAST (Basic Local Alignment Search Tool).  It’s a service of NCBI (National Center for Biotechnology Information).

We want to go to the Protein Blast page.

Now, let’s see how we can verify my repeated statement that the beta chain of ATP synthase is extremely conserved, from bacteria to humans. So, we past the ID from Uniprot (P06576) in the field “Accession number”, and we select Escherichia coli in the field “Organism” (important: the organism name must be selected from the drop menu, and must include the taxid number). IOWs, we are blasting the human proteins against all E. coli proteins. Here’s how it looks:

 

Now, we can click on the “BLAST” blue button (bottom left), and the query starts. It takes a little time. Here is the result:

 

In the upper part, we can see a line representing the 529 AAs which make the protein sequence, and the recognized domains in the sequence (only one in this case).

The red lines are the hits (red, because each of them is higher than 200 bits). When you see red lines, something is there.

Going down, we see a summary of the first 100 hits, in order of decreasing homology. We can see that the first hit is with a protein named “F0F1 ATP synthase subunit beta [Escherichia coli]“, and has a bitscore of 663 bits. However, there are more than 50 hits with a bitscore above 600 bits, all of them in E. coli (that was how our query was defined), and all of them with variants of the same proteins. Such a redundancy is common, especially with bacteria, and especially with E. coli, because there are a lot of sequences available, often of practically the same protein.

Now, if we click on the first hit, or just go down, we can find the corresponding alignment:

 

The “query” here is the human protein. You can see that the alignment involves AAs 59 – 523, corresponding tp AAs 2 – 460 of the “Subject”, that is the E. coli protein.

Th middle line represents the identities (aminoacid letter) and positives (+). The title reminds us that the hit is with a protein of E. coli whose name is “F0F1 ATP synthase subunit beta”, which is 460 AAs long (rather shorter than the human protein). It also gives us an ID/accession number for the protein, which is a different ID from Uniprot’s IDs, but can be used just the same for BLAST queries.

The important components of the result are:

  1. Score: this is the bitscore, the number I use to measure functional information, provided that the homology is conserved for a long evolutionary time (usually, at least 200 – 400 million years). The bitscore is already adjusted for the properties of the scoring system.
  2. The Expect number: it is simply the number of such homologies that we would expect to find for unrelated sequences (IOWs random homologies) in a similar search. This is not exactly a p value, but, as stated in the BLAST reference page: when E < 0.01, P-values and E-value are nearly identical.
  3. Identities: just the number and percent of identical AAs in the two sequences, in the alignment. The percent is relative to the aligned part of the sequence, not to the total of its length.
  4. Positives: the number and percent of identical + positive AAs in the alignment.  Here is a clear explanation of what “positives” are:Similarity (aka Positives)When one amino acid is mutated to a similar residue such that the physiochemical properties are preserved, a conservative substitution is said to have occurred. For example, a change from arginine to lysine maintains the +1 positive charge. This is far more likely to be acceptable since the two residues are similar in property and won’t compromise the translated protein.Thus, percent similarity of two sequences is the sum of both identical and similar matches (residues that have undergone conservative substitution). Similarity measurements are dependent on the criteria of how two amino acid residues are to each other.(From: binf.snipcademy.com)    IOWs, we could consider “positives” as “half identities”.
  5. Gaps. This is the number of gaps used in the alignments. Our alignments are gapped, IOWs spaces are introduced to improve the alignment. The lower the number of gaps, the better the alignment. However, the bitscore already takes gaps in consideration, so we can usually not worry too much about them.

A very useful tool is the “Taxonomy report” (at the top of the page), which shows the hits in the various groups of organisms.

While in our example we looked only at E. coli, usually our search will include a wider range of organisms. If no organism is specified, BLAST will look for homologies in the whole protein database.

It is often useful to make queries in more groups of organisms, if necessary using the “exclude” option. For example, if I am interested in the transition to vertebrates for the SATB2 protein (ID = Q9UPW6, a protein that I have discussed in a previous OP), I can make a search in the whole metazoa group, excluding only vertebrates, as follows:

 

As you can see, there is very low homology before vertebrates:

 

And this is the taxonomy report:

The best hit is 158 bits in a spider.

Then, to see the difference in the first vertebrates, we can run a query of the same human protein on cartilaginous fish. Here is the result:

 

As you can see. now the best hit is 1197 bits. Quite a difference with the 158 bits best hit in pre-vertebrates.

Well, that’s what I call an information jump!

Now, my further step has been to gather the results of similar BLAST queries made for all human proteins. It is practically impossible to do that online, so I downloaded the  BLAST executables and databases. That can be done from the BLAST site, and allows one to make queries locally on one’s own computer. The use of the BLAST executables is a little more complex, because it is made by command line instructions, but it is not extremely difficult.

To perform my queries, I downloaded from Uniprot a list of all reviewed human proteins: at the time I did that, the total number was 20171. Today, it is 20239. The number varies slightly because the database is constantly modified.

So, using the local BLAST executables and the BLAST databases, I performed multiple queries of all the human proteome against different groups of organisms, as detailed in my OP here:

The amazing level of engineering in the transition to the vertebrate proteome: a global analysis

This kind of query take some time, from a few hours to a few days.

I have then imported the results in Excel, generating a dataset where for each  human protein (20171 rows) I have the value of protein ID, name and length, and best hit for each group of organism, including protein and organism name, protein length, bitscore value, expect value, number and percent of identities and positives, gaps. IOWs, all the results that we have when we perform a single query on the website, limited to the best hit.

In the Excel dataset I have then computed some derived variables:

  • the bitscore per AA site (baa: total bitscore / human protein length)
  • the information jump in bits for specific groups of organisms, in particolar between cartilaginous fish and pre-vertebrates (bitscore for cartilaginous fish – bitscore for non vertebrate deuteronomia)
  • the information jump in bits per aminoacid site for specific groups of organisms, in particolar between cartilaginous fish and pre-vertebrates (baa for cartilaginous fish – baa for non vertebrate deuteronomia)

The Excel data are then imported in R. R is a wonderful open source statistical software and programming language that is constantly developed and expanded by statisticians all over the world. It also allows to create very good graphs, like the following:

This is a kind of graph for which I have written the code and which, using the above mentioned dataset, can easily plot the evolutionary history, from cnidaria to mammals, of any human protein, or group of protein, using their IDs. This graph uses the bit per aminoacid values, and therefore sequences of different length can easily be compared. A refernce line is always plotted, with the mean baa value in each group of organism for all human proteins. That already allows to visualize how the pre-vertebrate-vertebrate transition exhibitis the greatest informational jump, in terms of human conserved information.

However, the plots regarding individual proteins are much more interesting, and they reveal huge differences in the their individual histories. In the above graph, for example, I have plotted the histories of two completely different proteins:

  1. The green line is protein Cdc42, “a small GTPase of the Rho family, which regulates signaling pathways that control diverse cellular functions including cell morphologycell migration, endocytosis and cell cycle progression” (Wikipedia). It is 191 AAs long, and, as can be seen in the graph, it is extremely conserved in all metazoa, presented almost maximal homology with the human form already in Cnidaria.
  2. The brown line is our well known SATB2 (see above), a 733 AAs protein which is, among other things, “required for the initiation of the upper-layer neurons (UL1) specific genetic program and for the inactivation of deep-layer neurons (DL) and UL2 specific genes, probably by modulating BCL11B expression. Repressor of Ctip2 and regulatory determinant of corticocortical connections in the developing cerebral cortex.” (Uniprot) In the graph we can admire its really astounding information jump in vertebrates.

This kind of graph is very good to visualize the behaviour of proteins and of groups of proteins. For example, the following graph is of proteins involved in neuronal adhesion:

It shows, as expected, a big information jump, mainly in cartilaginous fish.

The following, instead, is of odorant receptors:

Here, for example, the jump is later, in bony fish, and it goes on in amphibian and reptiles, and up to mammals.

Well, I think that I have given at least a general idea of the main issues and procedures. If anyone has specific requests, I am ready to answer.

 

123 Replies to “Bioinformatics tools used in my OPs: some basic information.

  1. 1
    Dionisio says:

    It is very nice of gpuccio to present this detailed technical information, which shows the thorough work he has done on this important research topic.

  2. 2

    Thank you GP for doing this!

    (…and thanks ES for asking)

  3. 3
    gpuccio says:

    Dionisio, UB:

    Thank you for being interested! 🙂

  4. 4
  5. 5
    Origenes says:

    GPuccio,

    I have been looking at snRNP. Protein: ‘U5 small nuclear ribonucleoprotein 200 kDa helicase’; Gene: SNRNP200; Organism: Homo sapiens (Human); Status: Reviewed; Taxonomic identifier: 9606 [NCBI]

    The length is considerable: 2136 AAs

    Function: RNA helicase that plays an essential role in pre-mRNA splicing as component of the U5 snRNP and U4/U6-U5 tri-snRNP complexes. Involved in spliceosome assembly, activation and disassembly. Mediates changes in the dynamic network of RNA-RNA interactions in the spliceosome. Catalyzes the ATP-dependent unwinding of U4/U6 RNA duplices, an essential step in the assembly of a catalytically active spliceosome.

    I don’t know what to make of the following blast data. It is highly probably that I have made some mistakes.

    E-coli score: 87
    Stegodyphus mimosarum (spider): 3564
    Callorhinchus milii (elephant shark): 4061
    Rhincodon typus score: 2277

  6. 6
    gpuccio says:

    Origenes:

    The protein is extremely conserved in metazoa.

    The value in Callorhincus milii is absolutely correct.

    4061 bits; 91% identities; 96% positives

    The apparently strange result for Rhincodon typus is in a sense an error, and it allows me to make an important clarification.

    If you look at the section:

    “Sequences producing significant alignments”, the second one is the Rhincodon typus.

    On the right, you can see two different columns: Max score and Total score.

    The Max score is the number you report:

    2277 bits

    And the Total score is higher than the Callorhincus milii score:

    4548 bits

    What is the right one?

    Neither. What happens here is an event rather common in Blast alignments (let’s say that it happens in a minority of cases, but no too rarely).

    As Blast is a local alignment tool, sometimes it makes multiple alignments for the same couple of proteins.

    That’s what happens here: the alignment between the human protein and the Rhincodon typus protein protein is in reality 4 different alignment.

    So, the Max score column gives you the best hit among the 4, while the Total score column gives you the sum of the 4 scores.

    Now, the Max score is obviously not what we are looking for, because it is related to a part of the two molecules only.

    So, the 2277 bits are for the first alignments which is between the following segments:

    986 – 2136 of the human protein

    908 – 2058 of the shark protein

    The percent of identities is very high (92%, more or less as in Callorhincus milii), because it is computed on the aligned sequence only (in this case, 1064/1151).

    The second alignment is between the following segments:

    1 – 906 and
    1 – 906

    This is a very compact alignment, there are no gaps, and it is responsible for 1663 further bits, and 92% identities, again.

    The third and fourth alignments are as follows:

    Third:

    452 – 1286
    1214 – 2046

    337 bits; 29% identities

    Fourth:

    1292 – 2123
    452 – 1207

    270 bits; 26% identities

    The column Total score sums the 4 results:

    2277 + 1663 + 337 + 270 = 4547

    which is the number given in the Total score (with a difference of 1 bit, probably due to approximations)

    Now, this too is not an useful number, because, as you can see, the third and fourth alignment overlap with the first two.

    Indeed, the first two alignments cover rather well the whole two molecules, and without any overlapping.

    So, the correct value, in this case, would be:

    2277 + 1663 = 3940

    which is very similar to the value in Callorhincus milii (4061)

    This is not surprising, because the values in these two sharks are usually rather similar (although there can be exceptions).

    I hope this clarifies the problem.

    Now, my database tells me that this problems starts very “high” already in cnidaria (more than 1.5 baa, that means about 3400 bits).

    We can also look at single celled eukaryotes. Try:

    eukaryotes +
    Metazoa exclude +
    Plants exclude

    The best hit is 2773 in Sphaeroforma arctica (a member of the Ichthyosporea clade).

    If you look at the Taxonomy report, you will see that values are rather high in all forms of eukaryotes:

    2717 in fungi

    2691 in choanoflagellates

    2672 in chytrids

    2441 in ascomycets

    2408 in basidiomycetes

    2363 in cellular slime molds

    What about prokaryotes?

    Let’s first blast against Bacteria.

    Now, here we have another interesting situation which is not rare when we blast against all bacteria.

    We have a first hit, completely isolated, of 982 bits in Chlamydia tracomatis (an importan human pathogen).

    All the other hits are extremely lower, with a maximum of 185 bits and a range, in the first 100 hits, of 122 – 185. the Total scores are a little bit higher (maximum 355 bits), but they are often overlapping.

    Most of these hits are identified as “DEAD/DEAH box helicase” which is a different protein, with helicase activity too.

    In E. coli the best hit is (in my results) 89 bits, with a protein labeled only as “hypothetical protein”. But there is a 74.3 bits result with a DEAD/DEAH box helicase.

    Interestingly, in Archaea the homology is slightly higher, with a best hit of 311 bits. Again, most hits are identified as DEAD/DEAH box helicase.

    So, the only really anomalous result here is the 982 bits result in Clamydia tracomatis, which is definitely out of range with what happens in all other bacteria.

    The probable, almost certain, explanation is that this is simply an false result, an error in the data. The probable cause could be contamination with human material (C. trachomatis is an important human pathogen).

    So, what conclusions can we draw from the above data?

    1) snRNP is an eukaryotic protein.

    2) In single celled eukaryotes it already shows some high homology with the human form (about 1.1 – 1.3 baa).

    3) In metazoa, the homology is already higher than 1.6 baa in cnidaria, and very near to complete identity in the first vertebrates (about 1.9 baa)

    4) In bacteria, the protein seems not to exist as such, but it has some detectable, low level homology with some bacterial proteins, especially DEAD/DEAH box helicase. This homology is in the order of 0.08 baa maximum. In Archaea, it is slightly higher (about 0.14 baa), always with DEAD/DEAH box helicase.

  7. 7
    Origenes says:

    GPuccio,

    thank you very much. Again! You are the best. I have to study your comments very carefully and I will.

    It seems to me that there is at least one huge information jump between eukaryotes and prokaryotes. However I am not sure which organisms should be chosen. It seems logical to choose the prokaryote with the highest score (185 bits), but which eukaryote is next in the ‘evolutionary chain’? And why?

  8. 8
    gpuccio says:

    Origenes:

    You are perfectly right! The protein is essentially engineered in eukaryotes, even if important rewirings are certainly achieved at the usual stages, after that (Metazoa, vertebrates).

    And it is absolutely true that the transition from prokaryotes to eukaryotes, although fascinating and probably the one with the greatest information jump after OOL (or maybe even more than OOL), is extremely difficult to analyze.

    The main reason is that we know too little and understand even less.

    We don’t have any idea of when it happened.

    We don’t have any idea of how it happened (indeed, we have too many contrasting and unsatisfying ideas!). And I am not referring here to the big problem (design or non design), but just to the basics of the evoutionary history!

    Indeed, there is no real consensus, even gross consensus, about the evolutionary chain of eukaryotes, even less about their immediate precursors in prokaryotes: the various debates about bacteria, archaea, and their different possible symbioses, is good evidence for that.

    That’s why I stick to the vertebrate transition, for the moment, even if I have been tempted many times to analyze the eukaryote transition!

    The vertebrate transition has the remarkable advantage of being well localized in natural history, and vertebrates certainly have a much more understood evolutionary chain.

    However, the engineering of eukaryotes is certainly a major and amazing event: eukaryotes are really another thing, even if they certainly reuse much information from prokaryotes. Even in terms of new proteins and new superfamilies the transition is amazing.

    Your protein is a good example.

    It is also a good example of how neo-darwinists would reason, in their habit of “flattening” biological reality.

    They would simply be happy that there are homologues of it in prokaryotes, and firmly believe that this explains everything! 🙂

    This is the only thing they are interested in: finding distant homologues, and decoding that there is nothing else to investigate.

    Why should they ackowledge that more than 2000 bits of original functional sequence information are necessary in eukaryotes to engineer the protein? Why should they ask questions about that?

    They have their 300 bits homologue in archaea, after all: RV + NS could certainly optimize that initial treasure trove by a couple thousands simple naturally selectable steps! And certainly there is no reason at all to verify if that can be true! 🙂

    OK, this is just what I imagine they would say. But I think that I am imagining very realistically! 🙂

    The only real urge of neo-darwinism is to flatten biological reality, and to deny as much as possible of the wonders in it, just in order to survive.

    A real neo-darwinist motive, if I ever saw one!

  9. 9
    gpuccio says:

    Origenes:

    An interesting consideration about your protein is its main function. From Uniprot:

    “RNA helicase that plays an essential role in pre-mRNA splicing as component of the U5 snRNP and U4/U6-U5 tri-snRNP complexes. Involved in spliceosome assembly, activation and disassembly. Mediates changes in the dynamic network of RNA-RNA interactions in the spliceosome. Catalyzes the ATP-dependent unwinding of U4/U6 RNA duplices, an essential step in the assembly of a catalytically active spliceosome.”

    And the spliceosome is “a large and complex molecular machine found primarily within the splicing speckles of the cell nucleus of eukaryotic cells” (Wikipedia)

    That easily explains, from a functional point of view, why the protein is specific of eukaryotes.

  10. 10
    gpuccio says:

    Origenes:

    When I say “more than 2000 bits” I am referring, tentatively, to the difference between the best hit in prokaryotes (311 bits in archaea) and the best hit in the group of single celled eukaryotes that shows the lowest homology (2363 in cellular slime molds). That’s probably the best we can do, in the absence of some clear understanding of eukaryotes’ evolutionary history.

  11. 11
    Dionisio says:

    gpuccio @8:

    […] the transition from prokaryotes to eukaryotes, although fascinating and probably the one with the greatest information jump after OOL (or maybe even more than OOL), is extremely difficult to analyze.

    The main reason is that we know too little and understand even less.

    Hmm…

    Then on that ‘ignorance’ background, doesn’t every new discovery shedding more light on the previously unknown, seem to reveal undeniable designed systems?

    Therefore, isn’t the known what points to intelligent design?

    In this case isn’t the amount of functional information that gpuccio’s OPs and follow-up comments show associated with certain protein families in some biological systems what points to intelligent design?

    Couldn’t all that point to unguided processes instead?

    Why not?

    How about ‘co-option’ for example? 🙂

  12. 12
    Dionisio says:

    “The only real urge of neo-darwinism is to flatten biological reality, and to deny as much as possible of the wonders in it, just in order to survive.”

    That seems like a valid motive, doesn’t it? 🙂

  13. 13
    Dionisio says:

    There isn’t much complexity in the functionality described @9, right?
    It all looks so simple and straightforward.
    You guys seem to get too excited about anything.
    🙂

  14. 14
  15. 15
    gpuccio says:

    Dionisio:

    You are my best adversary! 🙂 🙂

  16. 16
    Origenes says:

    Protein: Histone acetyltransferase p300; Gene: EP300; Organism: Homo sapiens (Human); Status: Reviewed; Length: 2414; Function: Functions as histone acetyltransferase and regulates transcription via chromatin remodeling.

    – – – – –
    Cartilaginous fish: Rhincodon typus (whale shark), score 2798
    Pre-vertebrates: Tribolium castaneum (red floor beetle), score: 1557
    E-coli “No significant similarity found”.
    – – – –
    The information-jump during the pre-vertebrate vertebrate transition is: 1241 bits.
    Yes! Easily beating GPuccio’s SATB2 protein (see OP), which means that I am a quick learner and that unguided evolution is a lie.

  17. 17
    Mung says:

    Hi gpuccio,

    Thank you for this OP.

    But every existent protein evolved from some other protein, unless it did not.

    So how does this database help us tell the difference?

    How do we tell which proteins evolved from some other protein and which proteins did not evolve from some other protein?

    Can this database answer those questions?

  18. 18
    gpuccio says:

    Origenes:

    “The information-jump during the pre-vertebrate vertebrate transition is: 1241 bits.
    Yes! Easily beating GPuccio’s SATB2 protein (see OP), which means that I am a quick learner and that unguided evolution is a lie.”

    Yes and yes! 🙂 🙂

  19. 19
    gpuccio says:

    Mung (and Origenes):

    “How do we tell which proteins evolved from some other protein and which proteins did not evolve from some other protein?

    Can this database answer those questions?”

    Yes, it can.

    There are proteins which have no detectable homology before a specific transition.

    In that case, we can assume that the protein arose apparently “from scratch” in that transition. Of course. it could still have been engineered from some existing protein, or from some non coding DNA, by full “rewriting”.

    Now, in the vertebrate transition we don’t expect to have many of those “completely de novo” proteins, because at that point most of the protein domains have already appeared. However, I have made a quick search in my database, and there are definitely some good examples.

    I will just mention the first I found:

    Activity-dependent neuroprotector homeobox protein (Q9H2P0)

    Seems to appear in cartilaginous fish, with a best hit of only 52.4 bits (expect 0.007) in bees, and even lower ones in some other insects, limited to a very short part of the sequence (about 100 AAs).

    The protein is 1102 AAs long, and the human – cartilaginous fish homology is 1267 bits (1204 + 62.8 in two non overlapping alignments).

    The function: “Potential transcription factor. May mediate some of the neuroprotective peptide VIP-associated effects involving normal growth and cancer proliferation.”

    Of course, neo-darwinists will certainly find potential precursors, if really motivated, maybe using extra-sensitive approaches.

    But the simple truth is that this proteins seems to arise in vertebrates, as a real novelty.

    And, of course, there are others like it. Not many, in the vertebrate transition. And each case must be manually verified, to avoid false positives.

    Of course, we can find a lot more in the eukaryote transition.

    But I cannot do that directly from my database, because it is only about metazoa. I could not include full queries on all known bacteria and archaea and single celled eukaryotes, because that would have needed months of computations, with my resources, because of the huge number of sequences in those groups.

    Moreover, as pointed out to Origenes, there are many false data in those groups, and each case should be reviewed manually.

    However, Origenes has already pointed to one very promising molecule, SNRNP200.

    In doing so, he has stimulated my interest in the spliceosome, a specific eukaryotic construct, and an incredibly complex molecular machine!

    If you really like an extreme experience, I would suggest that you give a look to the main partner of SNRNP200 in the U5 construct of the spliceosome: PRP8 (Q6P2Q9), length 2335 AAs.

    Origenes, that’s homework for you! 🙂

  20. 20
    gpuccio says:

    Mung:

    However, an important concept, IMO, is that, form a strict ID point of view, it is not so important if a protein appears apparently from scratch, or if it shares part of its functional information with other pre-existing proteins: what really matters is the amount of new functional information added in the transition.

    IOWs, a protein which arises de novo, but has only, say, 300 bits of new functional information, is less amazing, from an engineering point of view, than a protein who share 500 bits of functional information with others already existing proteins, but to which 1000 bits of new functional information are added in the transition.

    Rewiring of old proteins, or the generation of new proteins which share modules with other existing proteins, is as important as the generation of a completely new protein, from a design point of view. New functional information is new functional information, however it is distributed.

    After all, we know that apparently almost 50% of protein domain superfamilies were already present in LUCA!

  21. 21
    Dionisio says:

    gpuccio,

    Is the eukaryotic chaperonin TRiC (TCP-1 Ring Complex, also called CCT for chaperonin containing TCP-1) an interesting candidate for the kind of “functional information” analysis you have demonstrated here?

    BTW, does the eukaryotic chaperonin require 3D folding too?
    Does it use a chaperone?

    IOW, could this be another “chicken-egg” conundrum case?

    Just curious. Thanks.

  22. 22
    Dionisio says:

    gpuccio @15:

    “You are my best adversary!”

    Well, I was a little disappointed by the conspicuous absence of real opponents in this thread where you’ve publicly revealed the ‘secrets’ of how you analyze the proteins in order to write your OPs and follow-up commentaries.

    I recall a couple of years ago somebody encouraged a distinguished biochemistry professor to respond to a challenge I have posted –imitating professor Tour’s challenge– and thus teach me (and the rest of us) a lesson or two. We know the rest of the story.

    Perhaps looking back that distinguished biochemistry professor regrets having kneejerk reacted to that commenter who encouraged him to engage in that discussion which turned so embarrassing. Too late now.

    I thought maybe I could encourage some polite dissenters to jump into this discussion? But apparently I lack the persuasive skills that ‘encouraging’ commenter had back then?

    🙂

  23. 23
    Dionisio says:

    Origenes @16:

    No, what happens is that both gpuccio and you don’t understand evolution. 🙂

    You may want to learn Biology 101 first. 🙂

  24. 24
    Dionisio says:

    gpuccio @19:

    There are proteins which have no detectable homology before a specific transition.

    In that case, we can assume that the protein arose apparently “from scratch” in that transition. Of course. it could still have been engineered from some existing protein, or from some non coding DNA, by full “rewriting”.

    Or it could have been “co-opted” by RV+NS+HGT+T+… and so on… right?

    🙂

  25. 25
    gpuccio says:

    Dionisio:

    The complex you refer to is formed by two identical rings, one above the other, each made of eight different proteins, each of them about 550 AAs long.

    The eight monomers are related, but not identical: they share about 220-300 bits of homology and 33% identity.

    Usually, this complex is said to exist in archaea but not in bacteria, but I have found significant homologies in archaea, and in some bacteria too, especially clostridia, of the order of 150 – 400 bits and 30 – 40% identities.

    Of course, in single celled eukaryotes we fiind much higher homology, about 800 bits in fungi, 750 in cellular slime molds.

    The sequences are, of course, highly conserved in metazoa.

    I don’t know if these sequences require chaperons or chaperonins to fold. Many proteins do not need that.

  26. 26
    Origenes says:

    GPuccio @19

    GPuccio: If you really like an extreme experience, I would suggest that you give a look to the main partner of SNRNP200 in the U5 construct of the spliceosome: PRP8 (Q6P2Q9), length 2335 AAs.

    Uniprot informs us:

    Protein: Pre-mRNA-processing-splicing factor 8; Gene: PRPF8; Organism: Homo sapiens (Human); Status: Reviewed; Function: functions as a scaffold that mediates the ordered assembly of spliceosomal proteins and snRNAs. Required for the assembly of the U4/U6-U5 tri-snRNP complex. Functions as scaffold that positions spliceosomal U2, U5 and U6 snRNAs at splice sites on pre-mRNA substrates, so that splicing can occur. Interacts with both the 5′ and the 3′ splice site; Length: 2335.

    Okay, let’s blast this protein!

    First a general blast; not excluding anybody.

    The ‘taxonomy report’ tells us that the protein is extremely well-preserved in vertebrates. Primates, rodents, whales, bats, birds, snakes, crocodiles, frogs and bony fishes all score above 97% homology. Even the Pundamilia Nyererei, a colorful Victorian cichlid fish, scores 4780, which means 97% homology.

    So, let’s blast while excluding vertebrates.

    Here the taxonomy report tells us that the protein is also well preserved in ‘pre-vertebrates.’ Starfish, termites, lice, horseshoe crabs, bees, ants, beetles, mosquitos, ants, flies all score in the range of 4596 and 4441, which means between 94% and 92% homology.

    Now what? Where does this huge protein come from?

    In e.g. E-coli it seems to be non-existent.

  27. 27
    Dionisio says:

    gpuccio @20:

    “After all, we know that apparently almost 50% of protein domain superfamilies were already present in LUCA!”

    Where is this LUCA positioned relative to bacteria, prokaryotes, etc.?
    Before, in between, after?
    Thanks.

    PS. Sorry for the dumb questions

  28. 28
    Dionisio says:

    gpuccio @25:

    Thanks for the explanation.

  29. 29
    Dionisio says:

    Origenes @26:

    You’re ahead of the class in this technical course taught by gpuccio.

    Well done!

    I’m still behind and can use your example too.

    Thanks.

  30. 30
    gpuccio says:

    Dionisio:

    “Or it could have been “co-opted” by RV+NS+HGT+T+… and so on… right?”

    Of course, I could I forget? 🙂

  31. 31
    gpuccio says:

    Origenes:

    Yes, it is extraordinarily conserved in all Metazoa.

    But also in single celled eukaryotes and plant (3900 – 4200 bits).

    But it is really, really absent in prokaryotes.

    We have the usual wrong hit in Clamydia tracomatis (2958 bits!), and two extremely suspicious “hits” with two different “proteins” in E. coli (one 72 AAs long, the other 58 AAs long) which show complete and almost complete identity with two different parts of our protein. Definitely, some contamination here too. And absolutely nothing in all the rest of the bacterial world.

    Nothing at all in Archaea (3 “hits” of 37 bits, expect value 5 – 6.9).

    It’s exciting, isn’t it? If we are looking for a really new protein in eukaryotes, this is one of the best candidates I have ever seen! 🙂

  32. 32
    gpuccio says:

    Origenes:

    Let’s say it together:

    You are a quick learner and unguided evolution is a lie!

    (Dionisio, you can join us if you like, but please don’t shout too loud! 🙂 )

  33. 33
    Origenes says:

    GPuccio @31

    GPuccio: Yes, it is extraordinarily conserved in all Metazoa.

    But also in single celled eukaryotes and plant (3900 – 4200 bits).

    But it is really, really absent in prokaryotes. …

    Nothing at all in Archaea (3 “hits” of 37 bits, expect value 5 – 6.9).

    It’s exciting, isn’t it? If we are looking for a really new protein in eukaryotes, this is one of the best candidates I have ever seen! 🙂

    Let’s see. What is the evolutionary explanation of 3900 bits of functional information? Yes, indeed, “functional”, because it is conserved for many hundreds of million years.

    Dembski’s Universal Probability Bound informs us that the probabilistic resource of our whole universe, from the Big Bang to now, is 10^150, or 2^500, or 500 bits. 3900 bits correspond to a search space of 2^3900.

    Just let that sink in for a moment … breath calmly … and then you just know:

    unguided evolution is a lie!

  34. 34
    Origenes says:

    // follow up #33 //

    In order to add some more perspective to the 3900 bits of information, wrt the SNRNP200 protein, which is in need of an explanation.

    From this recent article by GPuccio:

    … any sequence with 160 bits of functional information is, by far, beyond any reasonable probability of being the result of RV in the system of all bacteria in 4 billion years of natural history, even with the most optimistic assumptions.

    However, the reader may object that bacteria are not involved in finding this particular 3900 bits of information. And the reader would be correct, since the sequence is non-existent in bacteria.
    But, surely, this doesn’t help the hypothesis of unguided evolution.

  35. 35
  36. 36
    gpuccio says:

    Origenes:

    Well, the can always try to deconstruct the function into about 950 naturally selectable steps of 1 AA! 🙂

    Ah, I forgot that they have never taken my challenge…

    May I paste it again here? One never knows.

    Will anyone on the other side answer the following two simple questions?

    1) Is there any conceptual reason why we should believe that complex protein functions can be deconstructed into simpler, naturally selectable steps? That such a ladder exists, in general, or even in specific cases?

    2) Is there any evidence from facts that supports the hypothesis that complex protein functions can be deconstructed into simpler, naturally selectable steps? That such a ladder exists, in general, or even in specific cases?

    And let’s remember that the function that should be deconstructed, here, is the whole spliceosome, an amazingly huge molecular machine.

    Made of…

    Something like about 140 different proteins + 5 specific RNAs?

    And that we have already analyzed two of those proteins (OK, two of the biggest!):

    snRNP, which requires at least 2000 bits of new functional information in eukaryotes

    PRPF8, which requires at least 3900 bits of new functional information in eukaryotes

    and that those two sequences are part of the U5 component, which is part of the U6-U4-U5 component, which is part of the whole cycle of the spliceosome?

    Anyone is thinking of possible irreducible complexity here?

    Let’s see. This is from an article of 2002:

    “Comprehensive proteomic analysis of the human spliceosome”

    https://www.nature.com/articles/nature01031

    Using nanoscale microcapillary liquid chromatography tandem mass spectrometry, we identify 145 distinct spliceosomal proteins, making the spliceosome the most complex cellular machine so far characterized. Our spliceosomes comprise all previously known splicing factors and 58 newly identified components. The spliceosome contains at least 30 proteins with known or putative roles in gene expression steps other than splicing. This complexity may be required not only for splicing multi-intronic metazoan pre-messenger RNAs, but also for mediating the extensive coupling between splicing and other steps in gene expression.

    (emphasis mine)

    Irreducible complexity?

    But… wait: the paper says that:

    “The spliceosome contains at least 30 proteins with known or putative roles in gene expression steps other than splicing.”

    (emphasis mine)

    So, what’s the problem? We are in Ken Miller’s tie clip ideological space! No problem at all. Who cares if 115 (145 – 30) proteins with no “known or putative role” elsewhere have also been found in the machine? Neo-darwinism is safe! 🙂

    OK, maybe Ken will have to use something better than a tie clip, this time. What about cuff links? 🙂

  37. 37
    Dionisio says:

    Maybe the ‘third way’ folks are going to include gpuccio’s functional deconstruction challenge into their things to do list?

    🙂

  38. 38
    gpuccio says:

    Dionisio:

    A third way to deconstruction?

    Sounds great! 🙂

  39. 39
    Dionisio says:

    gpuccio,
    cuff links might do the trick 🙂

  40. 40
    gpuccio says:

    Origenes and Dioniso:

    Bt the way, there is a very interesting ID page about the spliceosome, here, by Jonathan M. :

    https://evolutionnews.org/2013/09/the_spliceosome_1/

    He has perfectly caught the importance of this unique molecular machine from an ID point of view. He also mentions the amazing PRP8 component! 🙂

  41. 41
    Mung says:

    I don’t know why anyone listens to gpuccio. The Noble Prize for fraudulent internet blogging is not an award to be respected.

    And anyone could write a database that makes it appear as if new proteins pop into existence out of nothing. That doesn’t prove anything.

    And all this talk about information jumps? Information is continuous, not discrete, it doesn’t make jumps. The jumps are only apparent jumps, they are not real jumps, they are just an artifact of our models.

    I could go on for days but I doubt that it would have any impact on the cult of gpuccio here. BLAST ON!

  42. 42
    gpuccio says:

    Mung:

    Now, who is my best adversary? You or Dionisio? 🙂

  43. 43
    Dionisio says:

    Mung is correct.
    gpuccio is making up all those values so they look favorable to his ideas which nobody else shares because they lack theoretical and empirical confirmation.
    🙂

  44. 44
    Mung says:

    What we have here is merely the appearance of information jumps. Nature does not make jumps. If any jumps are required then my theory would absolutely break down and I would give nothing for it.

  45. 45
    jstanley01 says:

    “Lie” is such an ugly word… 🙂

  46. 46
    Origenes says:

    GPuccio @40

    Thank you for that link.
    It sparked my interest in introns — non-coding sections of DNA by some considered to be junk-DNA.
    However there are “UCRs” — ultraconserved regions, which are regions over 200 bp in length with 100% identity across species.

    WIKI: “It is still not fully understood why the negative selective pressure on these regions is so much stronger than the selection in protein-coding regions.”

    One thing is for sure: conservation –> functional information. So, I thought, why not blast an intron sequence ?
    My choice is intron “ESRRG_Apollo” with a length of 757 nt. Then I found ccg.vital-it.ch/UCNEbase/ a website solely dedicated to UCRs.

    “UCNEbase provides information on the evolution and genomic organization of ultra-conserved non-coding elements (UCNEs) in multiple vertebrate species. It currently covers 4351 such elements in 18 different species.”

    They did the work for me.
    On this page we see conservation from mouse to frog (bit-score 419.4). Next we see the zebrafish Bit-score: 255.9. And that is it.

    I blasted the nucleotide sequence and excluded vertebrates, there was no score higher than 46. This might be interesting.

  47. 47
    Dionisio says:

    What’s all that excitement about big jumps?

    Big jumps have been recorded in nature before:

    https://www.youtube.com/embed/QUdVteq8XBs

    🙂

  48. 48
    gpuccio says:

    Origenes:

    Very interesting site, thank you for pointing to it. You are becoming better and better! 🙂

    Indeed, zebrafish gives 528 bits total, in two non overlapping aligments.

    This interestin intronic sequence seems to be located between exon 40 and 41, nearer to exon 41, in the Usherin gene.

    Usherin (O75445) is a 5202 AAs long protein, whose gene is divided into 72 exons. The protein has a rather standard evolutionary history, similar to the mean in metazoa. Its function in Uniprot is rather briefly described as “Involved in hearing and vision”.

    How did you blast the nucleotide sequence in pre-vertebrates? I have tried, and could not get any hit before bony fish. Not even your 46 bits.

    My experience with blasting nucleotides is very limited, but it seems that this ultra-conserved intronic sequence exhibits a more distinct jump in vertebrates (apparently in bony fish) than the protein itself where it is located (at least in terms of conservation density).

  49. 49
    Origenes says:

    GPuccio @48

    Good to hear that you find it interesting.

    GPuccio: How did you blast the nucleotide sequence in pre-vertebrates? I have tried, and could not get any hit before bony fish. Not even your 46 bits.

    (1) ‘Apollo’ Fasta sequence from this website:

    GTTCTGTTTATCATTCTAATCCATGTTTTGCAATTTATCTACTCCCTGTT
    AATATTATAGGCGATTTTTTACTGTGGCTGTGACAGAAGCTGCCGATTTA
    GCTCTTTCACCTACTGATAAATAAACAATGCACAGATCTGACCTTTAGGT
    TAACAGGTTTTATGCTTGCTCCACTCAGCACTCTAACTGATTCAATTATC
    ATAAAGGTTCAGGAGGCTCCATGAATACTGAAAAAGGCCCACCATATGCC
    TGCATAGGTGTTGTGGAACAGCAAATATTCTGCAGCCCTCCAGAGAAATT
    CCTTAATTGTAAATAATTTCACCATGCGACACAATCAAGTCACCTTGAAT
    GCAAACCCCTCAGCCTGCGGGGGCAAAGTGTTATTTAAGCTTTACTGGGC
    TGCGTTAAATTCTGCAATTTGAAGGGCTGTTAAGTTTTTCAATTGAAATT
    TCATTTAAAATGCAGGTGCTTTTTATTATATTGAGGCTTTACTGCTCTCT
    AGGTACAAGCAAGAACATGGTGCAATAACACAAATCTGGCTCAATCACTG
    ATCAGTAACAGCTGTAATTCCAGAACATTTAGCATTCTTATAAACCACGG
    CCTGAAATCTATAAATTGCTAAAACAGATCAAGAAAATACTGTATCCCCC
    CTTTTCTGCCCACAGCAATTTTGACATTTATGAGATTTTTCTGTGAACAT
    TAGATTTTATTGAAAATCTTTAAAAAAGATATACTTGGATTTAGTAATTG
    TTTAAAA

    (2) Go to nucleotide blast
    (3) Insert fasta code & exclude vertebrates.

    – – – – – – –
    BTW here are UCNs listed. They all have a short blast summary.

  50. 50
    gpuccio says:

    Origenes:

    That’s what I did, but I got the:

    “No significant similarity found”

    result.

    However, maybe I understand the reason. I blasted with the default option (megablast), which is optimized for higher similarities. I repeated the query choosing blastn, which is more sensitive, and I got a few hits, the best of them being 51.8 bits in Populus trichocarpa (a plant).

    Even in cartilaginous fish, the best hit was 41 bits.

    So, it really seems that the ultra conserved nucleotide sequence appears in bony fish.

  51. 51
    Dionisio says:

    Very insightful discussion between Origenes and gpuccio.

  52. 52
    Dionisio says:

    Ok, enough pretending being an opponent.

    I mistakenly thought that my stupid comments @43 & @47 could encourage some of gpuccio’s polite dissenters -so conspicuously absent from his threads- to jump into this discussion, but now I realized that they won’t, simply because they lack what is required for serious technical discussions: valid arguments and desire to find the truth.

    It is difficult for me to imitate writing so much nonsense.
    Perhaps it’s easier when they believe that it’s true?

    If gpuccio wants to have politely dissenting interlocutors in his discussion threads, he would have to stop making references to theoretical and empirical evidences that scare the potential dissenters away, because they seem to prefer pseudo-philosophical speculative gossiping, not detailed technical scientific discussions with so much real data leaving no room for troll hogwash.

    I prefer the technical discussions, even though they are more difficult for me to understand well. If that keeps the dissenters away, so be it. Still gpuccio’s threads attracted a relatively large number of anonymous readers.

  53. 53
    Origenes says:

    GPuccio @50

    I forgot to mention that I had included metazoa, which explains why I did not get that plant. Scusa!
    – – – – –
    Browsing through the introns base one notices that the scores among vertebrate species are often rather irregular.
    Also, often we find part of the sequence in pre-vertebrates, but ever so often there is no trace.
    For instance the ultra conserved intron IRXA_Ruby. Very conserved in vertebrates. When I blasted it, the lowest score is 351 in bony fishes. And then “No significant similarity found”.
    That is 351 bits of functional information out of nowhere. Surely, it is not as spectacular as the 1000s of unaccounted bits wrt protein sequences, but, still, it is not what unguided-evolution-fans want to hear. 🙂

  54. 54
    gpuccio says:

    Dionisio:

    Yes, certainly I am not happy that no serious discussion about real data can be achieved with our interlocutors.

    That nobody has even tried to answer my challenge is the most disappointing fact of all.

    How can they deny that deconstruction of complex functions is a basic requirement for their theory to be acceptable?

    How can they not even try to give arguments in its support?

    If Corey is all that is left on the opponent side, the situation is really sad! 🙂

    However, be happy! I officially relieve you from having to pretend to be my opponent!

    Mung can probably go on. He does it so naturally and graciously! 🙂

  55. 55
    gpuccio says:

    Origenes:

    Yes, the amazing examples of thousands of bits in protein information jumps have probably intoxicated us, but we must not forget that even a couple of hundreds of bits is an amazing result.

    Moreover, these examples of ultra-conserved information in non coding sequences, like introns, is specially fascinating.

    However, I think that non coding DNA can be functional even when no relevant conservation is observed.

    In principle, there are two possible explanations for that:

    a) The function is extremely specific to the species, and therefore is scarcely conserved.

    b) The relationship between function and sequence is different than in proteins, and allows much greater sequence variation.

    Both things are possible, and they are not mutually exclusive. But we still know too little.

    Another fascinating aspect of non coding DNA is that its function can well be non local. Transcribed DNA can work in a multitude of different regulatory ways, and it can influence processes that are apparently unrelated to the location where the DNA sequence is found. Moreover, complex variations in DNA structure, still little understood, can generate important interactions between very distant DNA sites.

    How does Dionisio say?

    “Complex functionally specified informational complexity.”

    I suppose it’s an understatement! 🙂

  56. 56
    gpuccio says:

    Dionisio at #27:

    “Where is this LUCA positioned relative to bacteria, prokaryotes, etc.?
    Before, in between, after?”

    The idea is that it is the common ancestor of both bacteria and archaea (and therefore, of all other living things). So, those domains or structures that are common to the two would have likely already been present in LUCA.

    LECA would be, instead, the last eukaryotic common ancestor. This entity, however, is much more elusive, given the many uncertainties about eukaryotic appearance.

    Of course, neo-darwinists believe that there was a FUCA (first universal common ancestor), and that OOL proceeded in some way from FUCA to LUCA.

    But there is no real evidence that a FUCA ever existed as separated from LUCA. Only fairy tales.

    Instead, LUCA at least can be defined in some way from empirical observations (the information common to bacteria and archaea), and therefore has some scientific status, IMO, with all the necessary cautions.

  57. 57
    Origenes says:

    GPuccio: Yes, certainly I am not happy that no serious discussion about real data can be achieved with our interlocutors.

    That nobody has even tried to answer my challenge is the most disappointing fact of all.

    To be frank, I am disgusted. Very angry.

  58. 58
    daveS says:

    gpuccio,

    GPuccio: Yes, certainly I am not happy that no serious discussion about real data can be achieved with our interlocutors.

    That nobody has even tried to answer my challenge is the most disappointing fact of all.

    This is a fairly small community; is it possible that there are no ID critics here with the background necessary to address your challenge? Or perhaps those who do have the background do not have the time or interest at the moment?

  59. 59
    gpuccio says:

    daveS:

    Of course, it is possible.

    However, a number of critics seem to be ready to debate the most disparate arguments, ranging from philosophy to religion to politics to morals to phisics to quantum mechanics to probability and so on. And, of course, anything about biology, even when they seem not to understand the basics of it.

    My two questions are about a very simple and fundamental requirement for the neo-darwinian algorithm to work: that complex functions should be, as a rule, deconstructable into simple naturally selectable steps.

    Without that, all the neo-darwinian theory has no foundation.

    I have simply asked if somebody can give reasons, any reason, why we should believe such a thing to be true. Either logical reasons, or empirical reasons.

    So, such a complete silence is not really understandable, from people who are so certain of what they believe, and from a whole world that is so certain that that belief is absolute truth. 🙂

  60. 60
    Mung says:

    Mung can probably go on. He does it so naturally and graciously!

    All of you remind me of a bunch of monkeys banging away at typewriters. My cat produces more interesting content when he sits on my keyboard.

    You guys stumble across a few sites on the interweb and it’s like you’ve discovered the fountain of truth. Frankly, it’s embarrassing.

    Let me know when you understand the science. Then perhaps we can have an intelligible conversation.

    Until then …

  61. 61
    daveS says:

    gpuccio,

    However, a number of critics seem to be ready to debate the most disparate arguments, ranging from philosophy to religion to politics to morals to phisics to quantum mechanics to probability and so on. And, of course, anything about biology, even when they seem not to understand the basics of it.

    Perhaps true, but quite a few of those subjects are accessible to the layman (at least at a superficial level). Most of us have had some exposure to religion and politics, for example.

    On the other hand, I suspect that if you posed a problem involving deriving the position wave equation for a particle in a box, you would also get relatively few responses.

    So, such a complete silence is not really understandable, from people who are so certain of what they believe, and from a whole world that is so certain that that belief is absolute truth.

    I wonder if your interlocutors actually feel so certain about their beliefs. I don’t feel very confident in many of my beliefs, and in fact am sometimes surprised at how readily some here can answer what I consider to be very difficult questions.

  62. 62
    gpuccio says:

    Mung:

    Quod erat demonstrandum! 🙂

    And my cats strolling on my keyboard are a serious problem indeed. 🙂

  63. 63
    gpuccio says:

    daveS:

    Maybe some details in my OPs are very technical. I can understand that.

    But I think that my two questions, my challenge, are not technical at all.

    We all have to do with complex functions: in language, in software, in all kinds of machines.

    Wondering if they can be, as a rule, deconstructed into simpler steps does not seem so far fetched.

    And statements about proteins and their functional space are made daily by our friends on the other side, most of them without any foundation.

    How many of them have invoked Keefe and Szostak as the final evidence that functional proteins are abundant on the market? How many of them have invoked Wagner’s n dimensional cubes or what else, without even trying to explain what they meant? How many of them have invested all they personal credibility in the reality of the RNA world?

    Complex issues, indeed. Nobody seem to fear their complexity.

    My two answers are not so complex. We have a protein with hundreds of conserved functional residues. Do they really believe that there are hundreds of gradual 1 AA steps to that sequence? Do they really believe that each of those steps confers a reproductive advantage, and is therefore naturally selectable?

    It seems that they do believe that.

    I am only asking: why? Please, give me your reasons. I am not asking to be convinced. Just to know if they have reasons, reasons that can be expressed and explained.

  64. 64
    gpuccio says:

    daveS:

    “I wonder if your interlocutors actually feel so certain about their beliefs.”

    Well, I am not in their minds and in their hearts. They act as though they were, however.

    “I don’t feel very confident in many of my beliefs,”

    Me too, but I have a good trick, and a rather simple one too. When I don’t feel very confident, I simply try to not consider those things as beliefs, but rather as tentative hypotheses. And I rarely speak of them.

    On the other hand, when I feel confident… 🙂

  65. 65
    Origenes says:

    GPuccio

    Would you be so kind to tell me how to interpret the following? I blasted the human gene “sonic hedgehog (SHH)”. It codes for a protein with (of course) the same name, which, for one thing, plays a key-role in the organization of the brain. Rather a long gene: 10034 bits

    More info here. Also, on that page, you can find the fasta sequence — under the heading “Genomic regions, transcripts, and products.”

    Okay, so I blasted it:

    The top scores: Homo sapiens scores 18530; Chlorocebus sabaeus (green monkey) 6514; .Macaca fascicularis (crab-eating macaque) 3722; Gorilla 2710.
    Those were some unexpected jumps, but here is another: most animals (e.g. rodents, bats, whales & dolphins) score about 1000.
    Excluding vertebrates resulted in 2 hits: a cricket 231 and a millipede 326.

  66. 66
    gpuccio says:

    Origenes:

    Blasting nucleotides is rather tricky. That’s why I usually stick to proteins.

    Now, the gene you blasted is 12289 nucleotides long, and it is a genomic sequence. Most of it is introns, and there are at least 3 exons, and many different transcripts.

    In chimp, with megablast, you have 99% identity and 22957 bits.

    In gorilla I get 2821 (max) and 4181 (total). You see homology mainly in the exons, while the introns are already different.

    In mouse I get about 2000 bits, and it is still possible to distinguish 3 zones of homology (probably the 3 main exons, and some other near regions).

    In cartilaginous fish, you barely distinguish the exons.

    Introns seem not to be conserved, in large part, except for very near species (like chimp).

    One thing that we must consider is that, at nucleotide level, there is the strong effect of neutral variation in exons (synonymous mutations), which does not effect the protein sequence, but does change the nucleotide sequence, proportionally to evolutionary distance. Remember that 200 – 400 million years can alter greatly any homology in synonymous sites.

    In general, also, non coding DNA is not much conserved, as we have seen, even if there are some important exceptions (like the ultra-conserved segments).

    In blasting nucleotides, we have to pay attention also to the type of sequence we are blasting: DNA, cDNA, mRNA, and so on. There is a lot of heterogeneity, and that makes the task more difficult.

    Again, I have not great experience in blasting nucleotides. For my purposes, blasting protein sequences is more useful.

  67. 67
    Origenes says:

    GPuccio

    Thank you very much for your time.
    I must have made several mistakes — including choosing the wrong fasta sequence.
    Good that you point out the many problems with blasting nucleotides. It is clearer now. For all those reasons it is best to stick with protein sequences, as you say.

  68. 68
    gpuccio says:

    Origenes:

    Yes, protein sequence conservation in spite of neutral variation is the best measure of functional information, which is what we are really interested in.

    I have worked a little with nucleotides when I made some evaluations of the Ka/Ks ratio, and it was really hard work. In general, it is not necessary to do that for our purposes.

  69. 69
    ET says:

    The only experts at TSZ are those expert is obfuscation, equivocation and denial.

  70. 70
    Origenes says:

    Yes, let’s blast a protein!

    SNYE1 – Nesprin-1 is a rather large human protein.
    Function:

    Multi-isomeric modular protein which forms a linking network between organelles and the actin cytoskeleton to maintain the subcellular spatial organization. As a component of the LINC (LInker of Nucleoskeleton and Cytoskeleton) complex involved in the connection between the nuclear lamina and the cytoskeleton. The nucleocytoplasmic interactions established by the LINC complex play an important role in the transmission of mechanical forces across the nuclear envelope and in nuclear movement and positioning. May be involved in nucleus-centrososme attachment and nuclear migration in neural progenitors implicating LINC complex association with SUN1/2 and probably association with cytoplasmic dynein-dynactin motor complexes; SYNE1 and SYNE2 may act redundantly. Required for centrosome migration to the apical cell surface during early ciliogenesis. May be involved in nuclear remodeling during sperm head formation in spermatogenenis; a probable SUN3:SYNE1/KASH1 LINC complex may tether spermatid nuclei to posterior cytoskeletal structures such as the manchette.
    [uniprot]

    I blasted isoform1, also known as: Nesprin-1 Giant, Enaptin. Length 8,797 bits.

    What we end up seeing is an enormous information jump during the pre-vertebrate-vertebrate transition:

    Extreme conservation in primates: homo sapiens 17954, gorilla 17801, rhesus macaque 17589.
    Also very well conserved in: dog 16355, horse 16293, elephant 15969, gray short-tailed opossum 14830, crocodile 13575, Blue-crowned manakin (bird) 13010, Central bearded dragon (lizard) 12872, frog 11872 , cichlid fish (bony fish) 9965

    And now, as promised, things get really jumpy:

    Whale shark 6791; West Indian Ocean coelacanth (fish) 6069; acorn worm 1732 (!), ant 853.

  71. 71
    Mung says:

    So there you have it. Expose gpuccio to serious criticism and he folds like a worn out accordion. No wonder it sounds like he plays the same tune over and over.

    Does he have a doctorate in BLASTing? I think not.

    Are his publications on bioinformatics examples of leading edge in the field? Laughable.

    I bet he’s a handsome guy though, with a nice smile, and an easygoing manner. Mothers, protect your daughters.

  72. 72
    gpuccio says:

    Origenes:

    Yes, that is indeed a huge protein, and it certainly shows a very big jump in vertebrates.

    The best pre-vertebrate hit is 1732, while in cartilaginous fish we have 10730 in Callorhincus milii, 8998 bits and 1.02 baa.

    One caution. This protein is a good example of a specific problem we have to consider, especially with these huge molecules: they often include many repetitions of some domain.

    For example, SYNE1 includes 74 instances of the Spectrin domain! They make up the greatest part of the protein.

    IOWs, they are highly modular proteins.

    Of course, that can in some way amplify sequence similarities or differences. Of course, the single modules are not really repetitive. For example, I have blasted the last module in SYNE1 (the C terminal one) against the protein itself, and homologies with all the other modules in the proteins are rather small (20-30% identities). That means that each module is in some way differently engineered (if you consider that the conserved homology of the whole protein in Callorhincus milii is 62% identities).

    However, in principle these highly modular proteins, which are often proteins involved in the cytoskeleton, should be considered with caution, because they certainly have great inner informational redundancy, if compared with most other proteins where such domain repetitions are not present.

    That can also probably explain the rather big difference in human homology that we observe between the two sequenced sharks:

    Callorhincus milii: 10730

    Rhincodon typus: 6791

    For most proteins, those two organisms are rather concordant. Again, the high modularity of this protein can amplify differences (or similarities).

  73. 73
    gpuccio says:

    Mung:

    “I bet he’s a handsome guy though, with a nice smile, and an easygoing manner. Mothers, protect your daughters.”

    If only all my critics were like you! 🙂 🙂

  74. 74
    Dionisio says:

    gpuccio @72:

    Of course, that can in some way amplify sequence similarities of differences.

    Of course, that can in some way amplify sequence similarities or differences.

    “of” ‘or’?

  75. 75
    Dionisio says:

    gpuccio,

    What do you think of this paper?

    https://www.frontiersin.org/articles/10.3389/fmicb.2015.01144/pdf

    Thanks.

  76. 76
    gpuccio says:

    Dionisio:

    or, of course or! 🙂

  77. 77
    Origenes says:

    GPuccio: One caution. This protein is a good example of a specific problem we have to consider, especially with these huge molecules: they often include many repetitions of some domain.

    For example, SYNE1 includes 74 instances of the Spectrin domain! They make up the greatest part of the protein.

    Thank you for taking a look at it.
    I see that I can find this information (74 instances of the Spectrin domain) at Uniprot under the heading ‘Family & Domains’.
    Twelve different isoforms are offered. Is there any way to tell which isoform contains many instances of the Spectrin domain? Or does each isoform contain 74 instances of the Spectrin domain?
    – – – – –
    / / / / /
    Mung is really good at this. 🙂

  78. 78
    gpuccio says:

    Dionisio:

    It is interesting for some aspects, but really vague. The idea that LUCA, or any other step of OOL, could have been a complex system of interacting organisms is interesting. I would probably support that point.

    The problem remains, of course: how did such a complex system of interactions arise? 🙂

  79. 79
    gpuccio says:

    Origenes:

    Isoform 1 is the reference sequence for the protein.

    As you can see in Uniprot, the other isoforms are mostly partial sequences, with missing parts, and some changes in other parts. I really don’t know what their biological value is. One should look at the relative publications, but probably little is understood in most cases.

    For most practical purposes, what is known is about the reference sequence.

  80. 80
    gpuccio says:

    Origenes:

    Of course, the partial sequences cannot include all the spectrin domains.

    For example, take isoform 11, which is very short.

    If you blast the whole protein (Q8NF91) against the FASTA sequence of isoform 11 (from Uniprot), using the “Align two or more sequences” tool, you will see that isoform 11 is identical to the 7659-8797 part of the whole protein.

    Therefore, it includes only the last 8 spectrin domains.

    The extremely low partial homologies with the rest of the protein (the black alignments under 50 bits and the one green alignment of 52 bits) are all expressions of the low homology existing between different spectrin domains in the molecule, as I have previously mentioned.

  81. 81
    Dionisio says:

    gpuccio @78,
    Interesting point. Thanks.

  82. 82
    Origenes says:

    Here is a large human protein ‘Fibrous sheath-interacting protein 2’ — FSIP2 — that meets at least a few of GPuccio’s standards:
    (1) It is reviewed.
    (2) It is non-modular.
    Information can be found at Uniprot
    Function:

    This gene encodes a protein associated with the sperm fibrous sheath. Genes encoding most of the fibrous-sheath associated proteins genes are transcribed only during the postmeiotic period of spermatogenesis. The protein encoded by this gene is specific to spermatogenic cells. Copy number variation in this gene may be associated with testicular germ cell tumors. Pseudogenes associated with this gene are reported on chromosomes 2 and X. [provided by RefSeq, Aug 2016]
    [source]

    There are two isoforms. I blasted isoform 1; Length:6,907.

    The results are astonishing if I am correct. But it seems very likely that I made some mistake (again). Anyway, here a summary of the results I got:
    homo sapiens: 14163
    bonobo: 13951
    gorilla: 13840
    ….
    whales & dolphins: 9536

    domestic cat: 9237
    bats: 9120
    ….
    the Cape golden mole :8255

    Here is where things are getting weird.

    Rhincodon typus: 226

    Surely, I have made a few mistakes.

  83. 83
    EugeneS says:

    GPuccio,

    Thank you very much! It’s is absolutely essential to have a reference like this for a non-expert reader.

  84. 84
    gpuccio says:

    Origenes:

    No mistakes. That is simply one of the many proteins that were engineered later, in particular in mammals.

    The first big hit can be found in marsupialia:monodelphis domestica, 4627 bits.

    And, of course, in afrotheria we have a further jump: 9264 bits in Trichechus manatus latirostris.

    A lot of human proteins are engineered in the transition to mammals. In particular, sperm and testis related proteins are often late-engineered.

    So, you are perfectly right.

    The transition to mammals certainly deserves to be analyzed. I have thought many times to write something about it.

    The reason why I stick usually to the vertebrate transition is very simple: it is much older.

    There, we have 400+ million years.

    With mammals, much less. Maybe 100 – 130 million years.

    Which is not a short time, certainly.

    But 400 is better.

    400 million years guarantees complete and full exposure to neutral variation. That can be easily seen when Ka/Ks ratios are computed. The Ks ratio reaches what is called “satutation” after 400 million years: IOWs, any initial homology between synonymous sites is completely undetectable after that time.

    That mean that what is conserved after that time is certainly conserved because of functional constraint.

    While 100 million years are certainly a lot of time for neutral variation to occur, still it is likely that part of the homology we observe can be attributed to passive conservation.

    IOWs, let’s say that we have 95% identity between humans and mouse, for some protein. Maybe some of that homology is simply due to the fact that the split was 80 million years ago: IOWs, some AA positions could be neutral, but still be the same only because there was not enough time to change them.

    Of course, the bulk of conserved information will still be functionally constrained, but probably not all of it.

    However, that does not prevent us from analyzing more recent transitions. 100 million years are still an interesting split time for that.

    Of course, there is scarce meaning in analyzing primates that way, for example, because there most of the similarity will be passive, especially in species very near to humans, like chimp and gorilla.

    Another way to say that is that we can assume that conserved information accurately measures functional information, provided that a long enough evolutionary time separates the branches we are analyzing.

  85. 85
    gpuccio says:

    EugeneS:

    You are welcome!

    You were the real inspiration for this OP and discussion. 🙂

  86. 86
    Origenes says:

    GPuccio @84

    While 100 million years are certainly a lot of time for neutral variation to occur, still it is likely that part of the homology we observe can be attributed to passive conservation.

    Oh, that’s good to know. I didn’t realize that.

    The first big hit can be found in marsupialia:monodelphis domestica, 4627 bits.

    I missed that one. 4627 bits is an astounding information jump, but, as you explained, a jump during mammal transition is not so convincing as a jump during vertebrate transition.
    – – – – –
    One technical question:

    GPuccio: And, of course, in afrotheria we have a further jump: 9264 bits in Trichechus manatus latirostris.

    I take it that you reject the Cape golden mole (8255). Why is that?

  87. 87
    gpuccio says:

    Origenes:

    “I take it that you reject the Cape golden mole (8255). Why is that?”

    No, I don’t reject it. But my methodology is to take the best hit in each class of organisms. The best hit represents the best conservation in that class. In each class, especially if numerous, you have a lot of lower hits, which can be explained in many different ways:

    a) Different functional constraints in different sepecies

    b) Loss of functionality in specific species (a very common issue, as explicitly admitted by neo-darwinists)

    c) Hits with different but similar proteins, or with different isoforms of the same protein

    The best hit is what represents the class in all my reasonings.

    In Afrotheria, the best hit is 9264 bits. That means that, at the split between Afrotheria and Boreoeutheria, that level of information had to be present in the common precursor.

  88. 88
    gpuccio says:

    Origenes:

    If you look at this article:

    https://www.ncbi.nlm.nih.gov/books/NBK21946/

    in particular at Fig. 26.17 and 26.18, you will see that 100 million years could be just enough to ensure about 1 mutation per synonymous nucleotide, which is certainly a lot.

    With 400 million years, we are almost at 3 mutations per site, which ensures complete degradation of any passive homology!

  89. 89
    Origenes says:

    GPuccio,

    Thanks again for your time. I feel a bit guilty by taking advantage of your expertise. Hopefully there are others, besides me, who profit from my attempts and questions.

    I take it that when you write …

    While 100 million years are certainly a lot of time for neutral variation to occur, still it is likely that part of the homology we observe can be attributed to passive conservation.

    … your concern is that the functionality of the sequence is not thoroughly established.

    If so, what if there is independent way to establish that the protein sequence is functional? What if functionality is established by e.g. disabling the protein in the organisms? Would that restore the claim that thousands of bits of functional information popped into existence?

  90. 90
    gpuccio says:

    Origenes:

    No, the problem is not the functionality of the proteins. Of course, there is no reason to believe that each protein is not fully functional, in its own organism.

    The problem is rather: how much does the conserved information correspond to specific functional information? IOWs, how much does the functionality depend on the conserved information?

    In that sense, the longer the evolutionary time that the protein has been exposed to variation, the more we can assume that conservation is a good measure of functional information.

    Of course, that is absolutely sure for very long evolutionary conservation, for example what we see in the alpha and beta chains of ATP synthase, which can flaunt billions of years of conservation.

    The 400+ million years are a safe guarantee, too.

    200 million years, 100 million years… I believe those are valid time frames too, but in those cases some part of the conservation could be less related to functionality.

    Less than 100 million years: conservation retains functional value, of course, but passive conservation is probably an important factor too.

    I hope this is clear.

  91. 91
    Origenes says:

    GPuccio

    So, information which originated 400 million years ago and is conserved up till humans is without a reasonable doubt truly functional information. Not only is it functional information, we also know that there is no alternative functional information in the proximity of the sequence in search space. IOWs the functionality depends heavily on the information of the conserved sequence.

    Which means that, in order to show an information jump that is inexplicable by unguided evolution, we need low homology in pre-vertebrates, but not too low. It must be a functional protein in pre-vertebrates, but there must also be a 500+ bits information jump at the vertebrate transition.

    Good examples can be found in your article about SAT-A and SAT-B.

    Is that correct?

  92. 92
    Dionisio says:

    Origenes,

    Perhaps others readers, including myself, could learn from the interesting friendly discussion you have with gpuccio.
    Keep it going.
    Thanks.

  93. 93
    gpuccio says:

    Origenes:

    Yes, that’s the idea. You are correct.

    However, it is not strictly necessary that low homology in pre-vertebrates should be found. We can also have a jump “from scratch”, IOWs a completely new functional protein in vertebrates, practically absent in pre-vertebrates. It’s rarer, but there are a few examples.

    See for example the Activity-dependent neuroprotector homeobox protein (Q9H2P0), that I have mentioned in post #19, in answer to Mung.

    A greater number of new proteins can certainly be found in the transition to eukaryotes.

    The point is: a big information jump points to design, either if it rests on already existing homology in previous existing proteins, or if it is “from scratch”. There is no difference, from an ID point of view: it is simply the amount of functional complexity linked to the transition that counts.

    In the same way, it is not important if we write a 2 Megabytes module to add it to an existing program, or just as a standalone program. It is, always, a new original 2 Megabyte module of functional information. A similar amount of design is necessary to write it, whether it is as an addition to an existing program or as a standalone piece of software.

  94. 94
    Dionisio says:

    “a big information jump points to design, either if it rests on already existing homology in previous existing proteins, or if it is “from scratch”. There is no difference, from an ID point of view: it is simply the amount of functional complexity linked to the transition that counts.”

    Can “the Neo-Darwinism of the gaps” claim that they don’t know how those information jumps appeared, but eventually science will advance and researchers will discover NW evidences that will help them figure out how that happened?
    Or maybe the ‘third way of evolution’ folks hope to reach such a breakthrough moment?

  95. 95
    Dionisio says:

    Error correction

    Sorry, I meant

    “…discover new evidences…”

  96. 96
    Dionisio says:

    @94 follow-up

    On the other hand, the source of complex functionally specified information has been empirically known for years.

  97. 97
    EugeneS says:

    GP

    Why is the max baa is about 2.2, not about log_(20) = 4.3? I seem to recall you mentioning that bitscore was an underestimate. Thanks.

  98. 98
    gpuccio says:

    EugeneS:

    Yes, the bitscore usually assigns a maximum of about 2.2 for identity.

    The way bitscore is computed is rather complex. It is detailed here:

    https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

    The main purpose of the bitscore is to detect homologues. In comparison to the whole informational potential of about 4 bits per AA, it very much underestimates identities.

    However, the bitscore is normalized with respect to the scoring system. My idea is that it represents a valid objective reference.

    In the end, we can never know how precisely any score measures functional information, unless we have precise top-down methods of computing the functional information in proteins. Which, at present, we do not have.

    Sequence conservation is an indirect way of measuring functional information. And, IMO, a very good one.

    There is no doubt that sequence conservation, with the cautions that I have highlighted in the OP and in the discussion, measures functional information. There are very good correspondences with what we know of protein function: for example, the simple fact that many proteins with great information jump in vertebrates are involved in regulation of neuronal differentiation is in itself amazing.

    At present we cannot safely assess if the bitscore underestimates the absolute functional information (which is my opinion), or overestimates it, or simply gives a reliable estimate of it. To know that, we should have a direct measure of functional information and use it as a gold standard. That will come, in time, But we are still far away from such a result.

    In the meantime, the bitscore is the best tool we have. And it is a very powerful and useful tool.

  99. 99
    Mung says:

    hi gpuccio,

    Most of the discussion seems to be revolving around comparison of proteins across taxa. Are you aware of any protein database that focuses on protein homology within species?

    What I mean is, every protein existing in humans must have evolved from some other protein within the human lineage. So we ought to be able to create a tree of protein phylogeny within the human lineage (or any other) alone, without looking at any other taxon.

    I know our unguided evolutionist friends like to concentrate on relatedness of species, but what about relatedness of proteins within species?

    Doesn’t the same common ancestry argument apply to both?

    Try not to muck up your answer. You already look silly enough. Thanks.

    😉

  100. 100
    gpuccio says:

    Mung:

    Here is my silly answer! 🙂

    I am not sure that I understand well your question.

    Let’s say that we consider the human species, or any other.

    Of course we can blast each human protein against the whole human proteome. It’s very easy.

    For example, if we take the famous SATB1, Q01826, 763 AAs, and blast it against organism “homo sapiens”, what we get is:

    a) One hit with itself, 100% identity, 1577 bits

    b) 4 hits with isoforms of SATB1, ranging from 1558 to 1085 bits

    c) 5 hits with the sister protein SATB2 and its isoforms, ranging from 854 to 514 bits

    d) A few more hits, 208 – 108 bits, with crystal structures of some domain of SATB1

    Nothing else.

    IOWs, SATB1 is homologous only to itself and, at a lower level, to SATB2.

    I am not sure what you mean when you say:

    “What I mean is, every protein existing in humans must have evolved from some other protein within the human lineage.”

    No, proteins are often isolated islands at sequence, structure and function level. That’s why we have 2000 protein superfamilies.

    Of course, that’s not always the case. Many proteins are part of vast families, and so when you blast the protein you will find many homologies, of different levels, with other proteins which are members of the same family.

    But, if you want to understand the evolutionary history of one protein, you have to look at older taxa, not to the same species.

    For example, for SATB1 and SATB2, you can see that both proteins practically start their existence in vertebrates, and that they remain mainly similar to themselves in all the successive evolutionary history. See my OP here:

    https://uncommondescent.com/intelligent-design/interesting-proteins-dna-binding-proteins-satb1-and-satb2/

    in particular Fig. 1 and Fig. 6

    You say:

    “So we ought to be able to create a tree of protein phylogeny within the human lineage (or any other) alone, without looking at any other taxon.”

    I don’t think that is possible. Philogenies are created by following the same protein through evolution (orthologs). In the same species, you can find paralogs: genes which share some homology, but are different proteins. Or just different isoforms of the same protein.

    A proteome is made of different proteins, most of them unrelated one to the other. Of course, domains are often shared between a few, or even many, proteins. But remember, we have 2000 different protein superfamilies. And even in the same superfamily, many proteins can be apparently unrelated at sequence level, while sharing some structure similarity.

    You say:

    “Doesn’t the same common ancestry argument apply to both?”

    No. A new protein superfamily, or just a new protein, often appears at certain definite points of evolutionary history, like SATB1 and SATB2 in vertebrates. When it appears, it has no antecedents. It is a novelty.

    In many cases, as I have tried to show, new proteins, appearing for example in vertebrates, share some basic information, maybe one or two domains, with other, different proteins which existed before. But also in that case, the bulk of the information in the new protein is a complete novelty, a huge jump in functional information, functional sequence information that did not exist before.

    The reason is simple: proteins are engineered, they don’t simply descend from other proteins. As you know, I believe in Common descent of proteins, indeed all my reasonings are based on that assumptions. But that means only that the new engineering, in some way, happens starting from something that already exists, be it some other protein or, as I believe, some non coding sequence.

    IOWs, new proteins are engineered, they don’t simply “descend”.

    But the proteins which remain the same, or are only changed a little, do descend from existing proteins, with or withour some minor engineering. And they also bear the mark of neutral variation, especially in their synonymous sites, as unequivocal evidence of their descent.

    (I hope I have not mucked up too much my answer! 🙂 )

  101. 101
    Mung says:

    gpuccio,

    For every species we ought to be able to construct a phylogenetic tree showing the relationship of that species to one or more other species. Do we agree so far?

    If we go back far enough the phylogenetic tree should be rooted in the LUCA. This is standard evolutionist thinking and I don’t think you disagree.

    Are we on the same page so far?

    How many protein families were present in the LUCA, wild guess. I won’t hold you to it. 🙂

  102. 102
    Mung says:

    love you man, you know it!

  103. 103
    gpuccio says:

    Mung:

    There are different estimates of how many protein superfamilies were already present in LUCA. I have found numbers ranging from a minimum of about 150 to a maximum of almost 1000.

    However, almost everyone agrees that LUCA was already a very complex organism, or set of organisms.

    Of course, many things about LUCA are still very tentative. But it is a very interesting subject, which can be addressed with some empirical consistency.

    Love you too, of course! 🙂

  104. 104
    Mung says:

    gpuccio,

    So let’s assume 150 protein superfamilies already present in the LUCA. Does that point to 150 proteins present in the LUCA or from which modern proteins descended or does that point to more than 150 proteins from which modern proteins descended?

    IOW, were there already protein superfamilies present in the LUCA. or were there 150 proteins from which all extant proteins later evolved, after the LUCA?

    Do protein superfamilies arise after the LUCA or are they already present in the LUCA? Does this make sense?

    If not the fault is all yours, I’m sure. 😉

  105. 105
    Mung says:

    Asking gpuccio about protein superfamalies is like asking gpuccio about Santa Claus. We have the same reason for believing in both.

  106. 106
    gpuccio says:

    Mung:

    “Do protein superfamilies arise after the LUCA or are they already present in the LUCA? Does this make sense?”

    Of course they do arise after LUCA! Only some of the current proteins superfamilies were present in LUCA, be them 150 or 800.

    A great number of new superfamilies arise in the course of natural history, almost up to humans.

    How do they arise?

    Neo-darwinists believe that they arise by RV + NS. Which, of course, is impossible.

    But guided descent is a perfectly feasible explanation. Guided descent simply means that:

    a) All that can remain the same remains the same, or is just slightly tweaked to be adapted to the new design.

    b) All that is needed as a novelty is engineered, using some physical already existing material (non coding sequences, or duplicated and inactivated genes, for example), provided by descent, and reshaping it completely according to a designed plan. IOWs, a lot of new functional information is added to build the new features, be them new proteins, or protein regulations, or networks of any type.

    How is that accomplished? As well known, my favourite scenario is transposon driven re-shaping of existing stuff. But other types of implementation, of course, are possible.

    For example, the N terminal domain in SATB1 is a new superfamily of its own, the SATB1_N superfamily. That superfamily does not exist before the appearance of SATB1 which, as we have seen, is a vertebrate protein.

    So, how did SATB1 and its specific domain arise in vertebrates?

    Of course, by engineering.

    But does that mean that vertebrates arose from scratch? Of course not.

    Vertebrates share a lot of protein sequences with pre-vertebrates, in particular with non vertebrate chordates, and more in general with all deuterostomes.

    So, descent + reuse of what can be reused + engineering of all novelty is the only feasible scenario, according to what we empirically observe.

    “Asking gpuccio about protein superfamilies is like asking gpuccio about Santa Claus. We have the same reason for believing in both.”

    What’s your problem with Santa Claus? 🙂

    “If not the fault is all yours, I’m sure.”

    Of course! 🙂

  107. 107
    gilthill says:

    gpucio @90

    you said “In that sense, the longer the evolutionary time that the protein has been exposed to variation, the more we can assume that conservation is a good measure of functional information.
    Of course, that is absolutely sure for very long evolutionary conservation, for example what we see in the alpha and beta chains of ATP synthase, which can flaunt billions of years of conservation.”

    But in 1990, Behe wrote a commentary in TIBS entitled « Histone deletion mutants challenge the molecular clock hypothesis » in which he casts doubt on the idea that conservation indicates functional constraints (https://www.ncbi.nlm.nih.gov/pubmed/2251727).

    Below is a piece I found here http://www.arn.org/docs/reviews/rev009.htm that summarizes Behe’s commentary

    “Early in the development of the molecular clock hypothesis, it was discovered that not all proteins “ticked” at the same rate. When compared across a range of species, the fibrinopeptides, for instance, were much “faster clocks” (i.e., having a higher rate of amino acid substitution) than the very conservative, “slowly ticking” histones. These differences, writes Michael Behe (Chemistry, Lehigh University), required a modification to the clock hypothesis: the postulate of functional constraints. Thus, for example, histone H4 would diverge less rapidly than fibrinopeptides if a larger percentage of H4 amino acid residues were critical for the function of the molecule. (p. 374)
    The problem with the notion of functional constraint, Behe argues, is an absence of experimental support:
    Although plausible, it has long been realized that no direct experimental evidence has been obtained ‘showing rigorously that histone function is especially senstive to amino acid substitution or that fibrinopeptide function is especially insensitive to amino acid substitution.’ (p. 374)
    “Recent experiments,” writes Behe, “now indicate that the key assumption of functional constraints may not be valid.”
    Since the histones are so highly conserved — “the H4 sequence of the green pea differs from that of mammals by only two conservative substitutions in 102 residues” — one might expect that “few, if any, substitutions could be tolerated in the H4 sequence” (p. 374). However, experiments (reported in detail by Behe) have shown that large parts of the histone molecule may be deleted without significantly affecting the viability of the organism (in this instance, yeast) — results which, Behe argues, should trouble defenders of the molecular clock hypothesis:
    [The experimental] results pose a profound dilemma for the molecular clock hypothesis: although the theory needs the postulate of functional constraints to explain the different degrees of divergence in different protein classes, how can one speak of ‘functional constraints’ in histones when large portions of H2A, H2B and H4 are dispensable for yeast viability? And if functional constraints do not govern the accumulation of mutations in histones, how can they be invoked with any confi-dence for other proteins? (p. 375)
    The resolution of the dilemma, Behe contends, must “as far as possible be grounded in quantitative, reproducible experiments, rather than in simple correlations with time that are its current basis” (p. 375). Otherwise, he concludes:
    [T]he time-sequence correlation may end up as a curiosity, like the tracking of stock market prices with hemline heights, where correlation does not imply a causal relationship.”

    So what do you think of Behe’s contention that the idea that conservation indicates functional constraints not only lack experimental evidences but is also contradicted by them ?
    I realizes that he developed his argument a long time ago (1990) and that since that time, some new experimental results may have been produced that may weaken Behe’s contention on this issue. But is it the case ?

  108. 108
    gpuccio says:

    gilthill:

    Thank you for raising this issue.

    I did not know Behe’s paper, but I was familiar with the same argument as expressed by Cornelius Hunter in his blog.

    You ask:

    “So what do you think of Behe’s contention that the idea that conservation indicates functional constraints not only lack experimental evidences but is also contradicted by them ?”

    The answer is easy: I simply disagree. Completely.

    I have already said that in the past, about Cornelius Hunter’s version of the argument (which is essentially similar to Behe’s). Knowing that Behe has said similar things in the past does not change my mind.

    A few important reasons why I disagree:

    a) While we can certainly accept some variance in the molecular clock, like in all biological phenomena, we have a rather consistent amount of data coming from synonymous mutations in protein coding genes, at least for time windows of a few hundred million years (after that, as I have explained, saturation of variation is achieved). You can look at this article:

    https://www.ncbi.nlm.nih.gov/books/NBK21946/

    and in particular Fig. 26-17, where you can see that, with some variance of course, about 0.7 substitutions per synonymous site are fixed in 100 million years.

    I don’t think that this simple fact can be denied. If you look at the alignment of the nucleotides in highly homologous proteins, but distant in evolutionary time (for example, the same conserved protein in humans and carilaginous fish), you see immediately that most of the genetic variation in the third nucleotide, and it is synonymous. You don’t find at all that amount of synonymous variation if you compare the same protein in humans and chimps. Why? Because the time separation is too short.

    Neutral variation happens. It cannot be denied.

    b) If neutral variation happens, and we can see it everywhere, why does it happen so much less in non synonimous site, and so differently in different proteins?

    Again, look at the cited article, but this time Fig. 26-18. You can see that hemoglobin, at 400 million years, shows about 70% variation, but Cytochrome C only 20%. However, both show less variation than synonymous sites, which in 400 million years are beyond any detectable homology.

    Why is that? Of course it is because functional constraints, and in particular negative selection, tend to preserve functional sites. And functional specificity is different in different proteins, hemoglobin being for example a simple globular protein, where the function-structure relationship is more flexible.

    IOWs, as neutral variation happens at symomynous sites, why does it happen less in non synonymous site? I can see no other explanation than the one that everybody accepts.

    c) That said, why do experiments in AA substitution apparently show a much greater tolerance to substitutions than apparently expected, in highly conserved proteins like histones?

    The answer, again, is rather simple. There are two reasons, strictly connected:

    1) Those experiments are made with single substitutions (if I remember well). Now, while a single substitution can be apparently tolerated, it generates important changes in how other residues behave (epistasis). That can, for example, make the protein much less tolerant to future changes, IOWs much less robust.

    2) The same experiments measure fitness in a very gross way, usually as immediate survival in the lab (if I remember well).

    Evolutionary history measures fitness in the wild, in long times of observation. In that context, many deleterious effects of a mutation can be relevant, while we could never see them in a short observation in the lab, and evaluating only gross immediate survival.

    That’s also the reason why a few polymorphisms are sometimes apparently tolerated, in a few individuals, in a population, even in conserved sites. But, if they are really slightly deleterious, they will probably never be fixed. Moreover, sometimes one single slightly deleterious mutation can be “stabilized” by another appropriate mutation (epistasis, again), but that happens rarely, and has a much lower probability (two, or more, coordinated mutations).

    IOWs, conservation is a much more sensitive evaluation of function than gross human experiments in single substitutions.

    So, to sum up I fully disagree with Behe and Hunter on this point: conservation through long evolutionary periods is due to function, and can be explained only by function.

  109. 109
    EugeneS says:

    GP

    We have basically two things:

    1. definition of functional information as per your earlier OP
    2. Your statistical homology analysis relying on bitscore.

    Could you explain the reasons why the bitscore measure is an approximation of 1 in a bit more detail. What I can see is that 1 and 2 are kind of similar concepts. Correct me if I am wrong. The bitscore is just a measure of primary structure similarity between two proteins A and B. The problem is that without knowing the nitty-gritties of the bitscore matrix and how it is produced, it is hard to be definite.

    Thanks.

  110. 110
    gpuccio says:

    EugeneS:

    The absolute definition of functional information requires a knowledge of both the search space (which is easy) and the target space (which, instead, is very difficult to achieve).

    That’s why all practical ways to measure functional information, at least for long sequences, must rely on indirect ways to measure the target space, and the target space/search space ratio.

    For example, in my OP about English language, I have used an indirect method to have a higher threshold of the target space estimate, with good results, I suppose (nobody has found any real flaw in my procedure).

    The same is true for the protein target space. As we are not able to measure directly the target space for long sequences (it is practically impossible, unless we develop a perfect and detailed understanding of the biochemical nature of protein folding and function), we need an indirect approach.

    The idea of using conservation as a measure for functionality is not mine. It is indeed inherent in all the biological thought in the last decades. However, the first to use it in an ID context has been, as far as I know, Durston, in his famous paper that has certainly inspired all my further reasonings:

    “Measuring the functional sequence complexity of proteins”

    https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2217542/

    However, the method used by Durston is slightly different from my approach, but the basic principle is the same.

    The idea is simply that conservation through long evolutionary periods is proportional to functional specificity.

    Durston uses an alignment between many instances of the same protein, and gives a score based on the total information potential of 4.3 bits. That method is valid, but is rather complex to implement as a standard measure in general cases.

    I have simply used the already existing bitscore, which can be easily used for any protein and any set of sequences.

    Now, the original purpose of the bitscore is to allow an evaluation of homologies, to decide if they can be due to random effects.

    That’s why the gross score (which is essentially computed by specific matrices of AA transformation) is adjusted to a normalized score. That’s where the “halving” of the potential information takes place.

    The main reason is that the raw score is adjusted according to two constants, K and lambda, which are derived from the “extreme value distribution” of random variables:

    Just as the sum of a large number of independent identically distributed (i.i.d) random variables tends to a normal distribution, the maximum of a large number of i.i.d. random variables tends to an extreme value distribution

    To simplify the concept, as we are interested here in deciding if some observed homology can be due to chance, and we have multiple random variables (if we compare one sequence to many other sequences), a probability distribution is used that gives the probability of having a result as the best result among all variables. The tail of that probability distribution is used to asses E (the expect value).

    In the end, the idea is that the bitscore, and the derived E value, tell us how likely it is to have that level of homology, or more, by chance, in that context.

    That’s the connection with functional information.

    If we assume that the observed homology is necessary to retain function, as it has been conserved for hunderds of million years by negative selection, then of course the probability of observing that level of homology by chance is very similar to the probability of some random state to express that level of homology, and therefore of functional information.

    IOWs, it can be considered as an indirect measure of getting that level of functional information by chance in one attempt (one tested state).

    We could work directly with the E value, but unfortunately it is flattened to 0 when the probability becomes too small, so using the bitscore is more feasible.

    Of course, in my reasoning it’s not the homology itself that is important, but the homology that is conserved through long evolutionary times. The longer the time, the more likely it is that the observed homology corresponds well to the real functional information.

    Of course, the only objection that neo-darwinists can make is that the level of function that we observe in the proteins is an optimized function, and that such an optimized wildtype function can be reached through a gradual ladder of simple transitions, each of them naturally selectable vs the previous one.

    That’s why I consider so important the challenge that I have made, some three threads, 6000+ visualizations and 700+ comments ago, and that nobody has even tried to answer. I paste it again here (last time was at post #36 in this thread):

    Will anyone on the other side answer the following two simple questions?

    1) Is there any conceptual reason why we should believe that complex protein functions can be deconstructed into simpler, naturally selectable steps? That such a ladder exists, in general, or even in specific cases?

    2) Is there any evidence from facts that supports the hypothesis that complex protein functions can be deconstructed into simpler, naturally selectable steps? That such a ladder exists, in general, or even in specific cases?

  111. 111
    EugeneS says:

    GPuccio

    Thank you very much for taking the time and pains to explain the basic reasoning and the mathematical details behind your work. I am sure it is widely appreciated by the readers 🙂

  112. 112
    gpuccio says:

    EugeneS:

    Thank you to you! 🙂

  113. 113
    EugeneS says:

    Origines

    “unguided evolution is a lie!”

    I totally agree. However, I must add that, strictly speaking, guided evolution is non-existent. It is an oxymoron. As soon as there is intelligent guidance, evolution ‘evaporates’ as a concept… I must stress that I view it as something greater than just a matter of terminology. I strongly believe that in this context we should stop using the word “evolution” at all. What we deal with here is a completely different concept, i.e. artificial selection.

    The authors of the ‘glorious’ wikipedia are prone to the same error when they discuss the problems of the OOL and evolution. They quote results of “artificial evolution” as something that, in their opinion, supports Darwinist claims about evolution. It does not! It is the same as quoting the work of the Institute of Protein Design as a counter-argument against Intelligent Design.

  114. 114
    gpuccio says:

    EugeneS:

    Well, designed evolution is a way of speaking that can be used, IMO. We can speak, for example of the guided evolution of the Windows operating system. The design of objects can be said to evolve. But in the end, it’s only a question of words. The only important thing with words is not so much which words we use, but rather that we are clear and explicit about what they mean.

    That said, I would add that there is more than one way to implement design in the biological world. You mention artificial selection, and that is certainly a powerful tool.

    But there is also the important possibility of designed variation. For example, transposon activity could be guided by the designer’s consciousness to realize exactly those variations which can lead to the desired result.

    Of course, these two modalities, designed variation and artificial selection, are not exclusive, and they can well act together to implememt the desired information in biological beings.

  115. 115
    EugeneS says:

    GPuccio,

    I beg to differ.

    I interpret the word ‘evolution’ as something that unfolds what already exists in the hidden form of potentiality. In this way, OS Windows does not evolve but, in the strictest possible sense, is being designed. Even if some aspects of OS Windows were engineered by designed variation (which I don’t think is the case), it would be bona fide design, not evolution. ‘Designed evolution’ is not evolution in the above sense unless one subscribes only to the weak form of ID (fine-tuning at the start), which both you and I do not 🙂 We both go a lot further.

    The first example where this analogy between technology and evolution goes over the top, that I know of, is Stanislaw Lem “Summa Technologiae”. The error is exactly in failure to see that technology is completely dominated and is driven in every aspect by intelligent design. In contrast, evolution by definition merely unfolds what is already in there in potential and is therefore just a curious result of “frozen accidents” of interactions between initial/boundary conditions and the laws of nature.

    IMO, it is a category error.

  116. 116
    gpuccio says:

    EugeneS:

    I agree with you. I only thought that the word “evolution”, as it is commonly used, can have a broader meaning, and be applied also to design that re-uses some features, and also generates completely new features.

    Only a problem of how to use words. On the concepts, we absolutely agree.

  117. 117
    Mung says:

    Here’s a tough one for you gpuccio,

    What is the shortest known protein?

    And something else to think about. Given that the random mutation mechanism is constantly producing and testing novel polypeptides, must not the cell be just chock full of them?

    What is preventing the constant production of new useless amino acid sequences that the cell then needs to get rid of?

  118. 118
    gpuccio says:

    Mung:

    I can easily tell you what is the shortest protein in my human proteome database:

    Dolichyl-diphosphooligosaccharide–protein glycosyltransferase subunit 4 (OST4, P0C6T2)

    37 AAs.

    “Given that the random mutation mechanism is constantly producing and testing novel polypeptides, must not the cell be just chock full of them?”

    That’s what I often wondered about! According to most neo-darwinian scenarios, that should be the case. But that’s nowhere to be seen! 🙂

    “What is preventing the constant production of new useless amino acid sequences that the cell then needs to get rid of?”

    Not enough imagination and faith on the part of neo-darwinists?

  119. 119
    EugeneS says:

    GPuccio

    “On the concepts, we absolutely agree”.

    Excellent!

  120. 120
    Dionisio says:

    Interesting discussion between EugeneS and gpuccio.
    Thanks.

  121. 121
    Mung says:

    Looks like gpuccio finally figured out that he doesn’t know what he’s talking about and decided to shut up. Maybe miracles can happen after all.

  122. 122
    Mung says:

    After the beating gpuccio took in this thread I hope he is ok.

  123. 123
    Mung says:

    Looks like gpuccio managed to survive!

Leave a Reply