Uncommon Descent Serving The Intelligent Design Community

Exon Shuffling, and the Origins of Protein Folds

Share
Facebook
Twitter
LinkedIn
Flipboard
Print
Email

800px-Protein_structure.png

A frequently made claim in the scientific literature is that protein domains can be readily recombined to form novel folds. In Darwin’s Doubt, Stephen Meyer addresses this subject in detail (see Chapter 11). Over the course of this article, I want to briefly expand on what was said there.

Defining Our Terms

Before going on, it may be useful for me to define certain key terms and concepts. I will be referring frequently to “exons” and “introns.” Exons are sections of genes that code for proteins; whereas introns are sections of genes that don’t code for proteins.Introns and exons.png

Proteins have multiple structural levels. Primary structure refers to the linear sequence of amino acids comprising the protein chain. When segments within this chain fold into structures such as helices and loops, this is referred to as secondary structure. Common units of secondary structure include α-helices and β-strands. Tertiary structure is the biologically active form of the protein, and refers to the packing of secondary structural elements into domains. Since a protein’s tertiary structure optimizes the forces of attraction between amino acids, it is the most stable form of the protein. When multiple folded domains are arranged in a multi-subunit complex, it is referred to as a quaternary structure.

A further concept is domain shuffling. This is the hypothesis that fundamentally new protein folds can be created by recombining already-existing domains. This is thought to be accomplished by moving exons from one part of the genome to another (exon shuffling). There are various ways in which exon shuffling might be achieved, and it is to this subject that I now turn.

The Mechanisms of Exon Shuffling

There are several ways in which exon shuffling may occur. Exon shuffling can be transposon-mediated, or it can occur as a result of crossover during meiosis and recombination between non-homologous or (less frequently) short homologous DNA sequences. Alternative splicing is also thought to play a role in facilitating exon shuffling.

When domain shuffling occurs as a result of crossover during sexual recombination, it is hypothesized that it takes place in three stages (called the “modularization hypothesis”). First, introns are gained at positions that correspond to domain boundaries, forming a “protomodule.” Introns are typically longer than exons, and thus the majority of crossover events take place in the noncoding regions. Second, within the inserted introns, the newly formed protomodule undergoes tandem duplication. Third, intronic recombination facilitates the movement of the protomodule to a different, non-homologous, gene.

Another hypothesized mechanism for domain shuffling involves transposable elements such as LINE-1 retroelements and Helitron transposons, as well as LTR retroelements. LINE-1 elements are transcribed into an mRNA that specifies proteins called ORF1 and ORF2, both of which are essential for the process of transposition. LINE-1 frequently associates with 3′ flanking DNA, transporting the flanking sequence to a new locus somewhere else on the genome (Ejima and Yang, 2003Moran et al., 1999Eickbush, 1999). This association can happen if the weak polyadenylation signal of the LINE-1 element is bypassed during transcription, causing downstream exons to be included on the RNA transcript. Since LINE-1’s are “copy-and-paste” elements (i.e. they transpose via an RNA intermediate), the donor sequence remains unaltered.

Long-terminal repeat (LTR) retrotransposons have also been established to facilitate exon shuffling, notably in rice (e.g. Zhang et al., 2013Wang et al., 2006). LTR retrotransposons possess a gag and a pol gene. The pol gene translates into a polyprotein composed of an aspartic protease (which cleaves the polyprotein), and various other enzymes including reverse transcriptase (which reverse transcribes RNA into DNA), integrase (used for integrating the element into the host genome), and Rnase H (which serves to degrade the RNA strand of the RNA-DNA hybrid, resulting in single-stranded DNA). Like LINE-1 elements, LTR retrotransposons transpose in a “copy-and-paste” fashion via an RNA intermediate. There are a number of subfamilies of LTR retrotransposons, including endogenous retroviruses, Bel/Pao, Ty1/copia, and Ty3/gypsy.

Alternative splicing by exon skipping is also believed to play a role in exon shuffling (Keren et al., 2010). Alternative splicing allows the exons of a pre-mRNA transcript to be spliced into a number of different isoforms to produce multiple proteins from the same transcript. This is facilitated by the joining of a 5′ donor site of one intron to the 3′ site of another intron downstream, resulting in the “skipping” of exons that lie in between. This process may result in introns flanking exons. If this genomic structure is reinserted somewhere else in the genome, the result is exon shuffling.There are of course other mechanisms that are hypothesized to play a role in exon shuffling. But this will suffice for our present purposes. Next, we will look at the evidence for and against domain shuffling as an explanation for the origin of new protein folds.

Introns Early vs. Introns Late

It was hypothesized fairly early, after the discovery of introns in vertebrate genes, that they could have contributed to the evolution of proteins. In a 1978 article in Nature, Walter Gilbert first proposed that exons could be independently assorted by recombination within introns (Gilbert, 1978). Gilbert also hypothesized that introns are in fact relics of the original RNA world (Gilbert, 1986). According to the “exons early” hypothesis, all protein-coding genes were created from exon modules — coding for secondary structural elements (such as α-helices, β-sheets, signal peptides, or transmembrane helices) or folding domains — by a process of intron-mediated recombination (Gilbert and Glynias, 1993Dorit et al., 1990).

The alternative “introns late” scenario proposed that introns only appeared much later in the genes of eukaryotes (Hickey and Benkel, 1986Sharp, 1985Cavalier-Smith, 1985Orgel and Crick, 1980). Such a scenario renders exon shuffling moot in accounting for the origins of the most ancient proteins.

The “introns early” hypothesis was the dominant view in the 1980s. The frequently cited evidence for this was the then widespread belief in the general correspondence between exon-intron structure and protein secondary structure.

From the mid 1980s, this view became increasingly untenable, however, as new information came to light (e.g. see Palmer and Logsdon, 1991; and Patthy, 1996199419911987) that raised doubts about a general correlation between protein structure and intron-exon structure. Such a correspondence is not borne out in many ancient protein-coding genes. Moreover, the apparently clearest examples of exon shuffling all took place fairly late in the evolution of eukaryotes, becoming significant only at the time of the emergence of the first multicellular animals (Patthy,19961994).

In addition, analysis of intron splicing junctions suggested a similar pattern of late-arising exon shuffling. The location where introns are inserted and interrupt the protein’s reading frame determines whether exons can be recombined, duplicated or deleted by intronic recombination without altering the downstream reading frame of the modified protein (Patthy, 1987). Introns can be grouped according to three “phases”: Phase 0 introns insert between two consecutive codons; phase 1 introns insert between the first and second nucleotide of a codon; and phase 2 introns insert between the second and third nucleotide.

Thus, if exon shuffling played a major role in protein evolution, we should expect a characteristic intron phase distribution. But the hypothetical modules of ancient proteins do not conform to such expectations (Patthy, 19911987).

It is clear, then, that exon shuffling (at the very least) is unlikely to explain the origins of the most ancient proteins that have emerged in the history of life. But is this mechanism adequate to explain the origins of later proteins such as those that arise in the evolution of eukaryotes? I now turn to evaluate the evidence pro-and-con for the role of exon shuffling in protein origins.

The Case for Exon Shuffling

What, then, are the best arguments for exon shuffling? If the thesis is correct, a prediction would be that exon boundaries should correlate strongly with protein domains. In other words, one exon should code for a single protein domain. One argument, therefore, points to the fact that there is a statistically significant correlation between exon boundaries and protein domains (e.g., see Liu et al., 2005 and Liu and Grigoriev, 2004).

However, there are many, many examples where this correspondence does not hold. In many cases, single exons code for multiple domains. For instance, protocadhedrin genes typically involve large exons coding for multiple domains (Wu and Maniatis, 2000). In other cases, multiple exons are required to specify a single domain (e.g. see Ramasarma et al., 2012; or Buljan et al., 2010).

A further argument for the role of exon shuffling in protein evolution is the intron phase distributions found in the exons coding for protein domains in humans. In 2002, Henrik Kaessmann and colleagues reported that “introns at the boundaries of domains show high excess of symmetrical phase combinations (i.e., 0-0, 1-1, and 2-2), whereas nonboundary introns show no excess symmetry” (Kaessmann, 2002). Their conclusion was thus that “exon shuffling has primarily involved rearrangement of structural and functional domains as a whole.” They also performed a similar analysis on the nematode worm Caenorhabditis elegans, finding that “Although the C. elegans data generally concur with the human patterns, we identified fewer intron-bounded domains in this organism, consistent with the lower complexity of C. elegans genes.”

Another line of evidence relates to genes that appear to be chimeras of parent genes. These are typically associated with signs indicative of its mode of origin. One famous example is the jingweigene in Drosophila, which may have arisen when “the sequence of the processed Adh [alcohol dehydrogenase] messenger RNA became part of a new functional gene by capturing several upstream exons and introns of an unrelated gene” (Long and Langley, 1993).

We must take care, however, not to confuse the observed pattern of intron phase distribution, or exon/domain mapping, with proof that exon shuffling is actually the process by which this pattern arose.

Perhaps common ancestry is the cause, but this must be demonstrated and not assumed. It is the biologist’s duty to determine whether unintelligent chance-based mechanisms actually can produce novel genes in this manner. It is to this question that I now turn.

The Problems with Domain Shuffling as an Explanation for Protein Folds

While the hypothesis of exon shuffling does, taken at face value, have some attractive elements, it suffers from a number of problems. For one thing, the model at its core presupposes the prior existence of protein domains. A protein’s lower-level secondary structures (α-helices and β-strands) exist stably only in the context of the tertiary structures in which they are found. In other words, the domain level is the lowest level at which self-contained stable structural modules exist. This leaves the origins of these domains in the first place unaccounted for. But stable and functional protein domains are demonstrably rare within amino-acid sequence space (e.g. Axe, 2010Axe, 2004Taylor et al., 2001Keefe and Szostak, 2001Reidhaar-Olson and Sauer, 1990Salisbury, 1969).

A fairly recent study examined many different combinations of E. coli secondary structural elements (α-helices, β-strands and loops), assembling them “semirandomly into sequences comprised of as many as 800 amino acid residues” (Graziano et al., 2008). The researchers screened 108 variants for features that might suggest folded structure. They failed, however, to find any folded protein structures. Reporting on this study, Axe (2010) writes:

“After a definitive demonstration that the most promising candidates were not properly folded, the authors concluded that “the selected clones should therefore not be viewed as ‘native-like’ proteins but rather ‘molten-globule-like'”, by which they mean that secondary structure is present only transiently flickering in and out of existence along a compact but mobile chain. This contrasts with native-like structure, where secondary structure is locked-in to form a well defined and stable tertiary fold. Their finding accords well with what we should expect in view of the above considerations. Indeed, it would be very puzzling if secondary structure were modular.”

“For those elements to work as robust modules,” explains Axe, “their structure would have to be effectively context-independent, allowing them to be combined in any number of ways to form new folds.” In the case of protein secondary structure, however, this requirement is not met.

The model also seems to require that the diversity and disparity of functions carried out by proteins in the cell can in principle originate by mixing and matching prior existing domains. But this presupposes the ability of blind evolutionary processes to account for a specific “toolbox” of domains that can be recombined in various ways to yield new functions. This seems unlikely, especially in light of the estimation that “1000 to 7000 exons were needed to construct all proteins” (Dorit et al., 1990). In other words, a primordial toolkit of thousands of diverse protein domains needs to be constructed before the exon shuffling hypothesis even becomes a possibility. And even then there are severe problems.

A further issue relates to interface compatibility. The domain shuffling hypothesis in many cases requires the formation of new binding interfaces. Since amino acids that comprise polypeptide chains are distinguished from one another by the specificity of their side-chains, however, the binding interfaces that allow units of secondary structure (i.e. α-helices and β-strands) to come together to form elements of tertiary structure is dependent upon the specific sequence of amino acids. That is to say, it is non-generic in the sense that it is strictly dependent upon the particulars of the components. Domains that must bind and interact with one another can’t simply be pieced together like jenga tiles.

In his 2010 paper in the journal BIO-Complexity Douglas Axe reports on an experiment conducted using β-lactamase enzymes which illustrates this difficulty (Axe, 2010). Take a look at the following figure, excerpted from the paper:

Beta lactamase comparison.png

The top half of the figure (labeled “A”) reveals the ribbon structure of the TEM-1 β-lactamase (left) and the PER-1 β-lactamase (right). The bottom half of the figure (labeled “B”) reveals the backbone alignments for the two corresponding domains in the two proteins. Note the high level of structural similarity between the two enzymes. Axe attempted to recombine sections of the two genes to produce a chimeric protein from the domains colored green and red. Since the two parent enzymes exhibit extremely high levels of structural and functional similarity, this should be expected to work. No detectable function was identified in the chimeric construct, though, presumably as a consequence of the substantial dissimilarity between the respective amino-acid sequences and the interface incompatibility between the two domains.

This isn’t by any means the only study demonstrating the difficulty of shuffling domains to form new functional proteins. Another study by Axe (2000) described “a set of hybrid sequences” from “the 50%-identical TEM-1 and Proteus mirabilis β-lactamases,” which were created such that the “hybrids match[ed] the TEM-1 sequence except for a region at the C-terminal end, where they [were] random composites of the two parents.” The results? “All of these hybrids are biologically inactive.”

In fact, in the few cases where protein chimeras do possess detectable function, it only works for the precise reason that the researchers used an algorithm (developed by Meyer et al., 2006) to carefully select the sections of a protein structure that possess the fewest side-chain interactions with the rest of the fold, and chose parent proteins with relatively high sequence identity (Voigt et al., 2002). This only serves to underscore the problem. Even in the Voigt study, the success rate was quite low, even with highly favorable circumstances, with only one in five chimeras possessing discernible functionality.

Conclusion

To conclude, although there is some indirect inferential evidence for the role of exon shuffling in protein evolution, a consideration of how such a process might work in reality reveals that the hypothesis itself is fraught with severe difficulties.

This article was originally published at Evolution News & Views (part 1; part 2)

Comments
Wilson & Tucker, Fgf and Bmp signals repress the expression of Bapx1 in the mandibular mesenchyme and control the position of the developing jaw joint, Developmental Biology 2004.Zachriel
February 23, 2015
February
02
Feb
23
23
2015
04:10 AM
4
04
10
AM
PDT
See also Singer et al., A wide variety of DNA sequences can functionally replace a yeast TATA element for transcriptional activation, Genes & Development 1990. It's an oldie, but shows the basic process. But the Petri dishes are not natural!Zachriel
February 23, 2015
February
02
Feb
23
23
2015
03:44 AM
3
03
44
AM
PDT
gpuccio: Very simply, do you believe that anything that is “selectable” is “naturally selectable”? Depends what it means to select. If it means selecting for an intermediate structure rather than function, then it would not mimic natural selection. Natural selection works without regard to any knowledge of the structure, but selection only according to function. However, the Szostak experiment selected for ATP-binding, a common biological enzymatic function.Zachriel
February 23, 2015
February
02
Feb
23
23
2015
03:33 AM
3
03
33
AM
PDT
gpuccio: if you want to state that a protein is naturally selectable, you have to show (not hypothesize) that it confers some reproductive advantage in a biological system. Oh gee whiz. gpuccio: The protein analyzed in Szostac’s paper was selected and engineered through laboratory tools which were very much intelligently designed to recognize a specific biochemical property even at low levels of expression, ... That is incorrect. The sequences were random, not engineered for a specific biochemical property. gpuccio: and then amplify that property by cycles of random variation and intelligent selection. That is correct. They were artificially selected for the specified function. gpuccio: That is not natural selection. No. It's called an *experiment*. It tests whether random sequence proteins can fold into a protein with a basic enzymatic function. They can and do. Our use of the term "selectable" referred to the experimental ability to distinguish the active enzyme from other sequences. Whether you consider it "natural selection" is immaterial to the fact that active enzymes are not that uncommon in random sequences.Zachriel
February 23, 2015
February
02
Feb
23
23
2015
03:25 AM
3
03
25
AM
PDT
JonathanM, if you have time, could you briefly comment on this following study and how it might impact the thesis of 'random' exon shuffling? From my nose-bleed section, it seems pretty devastating to me:
Duality in the human genome - November 28, 2014 Excerpt: The results show that most genes can occur in many different forms within a population: On average, about 250 different forms of each gene exist. The researchers found around four million different gene forms just in the 400 or so genomes they analysed. This figure is certain to increase as more human genomes are examined. More than 85 percent of all genes have no predominant form which occurs in more than half of all individuals. This enormous diversity means that over half of all genes in an individual, around 9,000 of 17,500, occur uniquely in that one person - and are therefore individual in the truest sense of the word. The gene, as we imagined it, exists only in exceptional cases. "We need to fundamentally rethink the view of genes that every schoolchild has learned since Gregor Mendel's time.,,, According to the researchers, mutations of genes are not randomly distributed between the parental chromosomes. They found that 60 percent of mutations affect the same chromosome set and 40 percent both sets. Scientists refer to these as cis and trans mutations, respectively. Evidently, an organism must have more cis mutations, where the second gene form remains intact. "It's amazing how precisely the 60:40 ratio is maintained. It occurs in the genome of every individual – almost like a magic formula," says Hoehe. http://medicalxpress.com/news/2014-11-duality-human-genome.html
bornagain77
February 23, 2015
February
02
Feb
23
23
2015
03:19 AM
3
03
19
AM
PDT
Zachriel: By the way, just a simple question. Why did you (who are so careful with words) state: "They found selectable proteins" and not: "They found naturally selectable proteins"? Very simply, do you believe that anything that is "selectable" is "naturally selectable"? IOWs, do you believe that natural systems exist that can select for any property that an engineered system can recognize? Just to give an old example, correct English words? Are there natural systems which can do that?gpuccio
February 23, 2015
February
02
Feb
23
23
2015
01:27 AM
1
01
27
AM
PDT
Zachriel: The distinction between Intelligent Selection and Natural Selection, which I have discussed in detail many times, including recently with DNA_Jock. In brief, if you want to state that a protein is naturally selectable, you have to show (not hypothesize) that it confers some reproductive advantage in a biological system. The protein analyzed in Szostac's paper was selected and engineered through laboratory tools which were very much intelligently designed to recognize a specific biochemical property even at low levels of expression, and then amplify that property by cycles of random variation and intelligent selection. That is not natural selection. The protein is selected by the active measurement of a biochemical affinity, and engineered laboratory cycles amplify that initial affinity. It is protein engineering, not natural selection. To show that naturally selectable proteins were present in his initial library, Szostac had to introduce those proteins in real biological systems (like bacterial cultures) and let those systems select new functions from the initial library. That's not what he did.gpuccio
February 23, 2015
February
02
Feb
23
23
2015
01:22 AM
1
01
22
AM
PDT
gpuccio: And, beyond all the rest, one thing is certain: they were not naturally selectable proteins. What distinction are you attempting to draw?Zachriel
February 22, 2015
February
02
Feb
22
22
2015
04:50 PM
4
04
50
PM
PDT
Zachriel: And, beyond all the rest, one thing is certain: they were not naturally selectable proteins.gpuccio
February 22, 2015
February
02
Feb
22
22
2015
02:27 PM
2
02
27
PM
PDT
Zachriel: Old story. Debated many times here. Shall we start again? (OK, it was Jonathan who quoted it, I know... :) )gpuccio
February 22, 2015
February
02
Feb
22
22
2015
02:25 PM
2
02
25
PM
PDT
Jonathan M: Keefe and Szostak, 2001 They found selectable proteins in about 1 in 10^11 random sequences. That isn't exactly common, but is well within the posited limitations.Zachriel
February 22, 2015
February
02
Feb
22
22
2015
12:12 PM
12
12
12
PM
PDT
Joe: I agree. In the measure that it is confirmed, it is an example of modular design and Object Oriented Programming. :)gpuccio
February 22, 2015
February
02
Feb
22
22
2015
11:55 AM
11
11
55
AM
PDT
Exon shuffling- more evidence for Intelligent Design...Joe
February 22, 2015
February
02
Feb
22
22
2015
11:53 AM
11
11
53
AM
PDT
JonathanM: Thank you for your wonderful OP. This is a very important topic, and it deserves a lot of attention. :)gpuccio
February 22, 2015
February
02
Feb
22
22
2015
11:47 AM
11
11
47
AM
PDT
Hangonasec: I suppose Jonathan was using "folded protein structures" in the sense highlighted by Axe. However, Jonathan can answer himself, certainly.gpuccio
February 22, 2015
February
02
Feb
22
22
2015
11:45 AM
11
11
45
AM
PDT
gpuccio, it wasn't Axe's statement, but Jonathan's I was taking issue with. 'Failed to find any folded protein structures' is at odds with numerous statements in the paper, from its title ("Selecting Folded Proteins from a Library of Secondary Structural Elements") onwards.Hangonasec
February 22, 2015
February
02
Feb
22
22
2015
08:12 AM
8
08
12
AM
PDT
Hi JonathanM, long time no see, welcome back. :) This topic is important, but was lacking in substantive analysis. I am very glad you have now addressed the topic more fully than you did before. It plugs a huge whole in my references.bornagain77
February 22, 2015
February
02
Feb
22
22
2015
08:06 AM
8
08
06
AM
PDT
Hangonasec, Thanks for finding that -- the blog editor actually had the 8 as a superscript but it apparently didn't appear as such on the blog post. I have now fixed it. JonathanJonathan M
February 22, 2015
February
02
Feb
22
22
2015
08:03 AM
8
08
03
AM
PDT
Hangonasec: To be precise, they found: " Of the 1149 clones screened 4 sequences (0.3%; clones 5.1, 5.6, 5.26 and 5.31) were identified as proteins with significant amounts of secondary structure." Emphasis added. So, Axe's comment remains valid: "“After a definitive demonstration that the most promising candidates were not properly folded, the authors concluded that “the selected clones should therefore not be viewed as ‘native-like’ proteins but rather ‘molten-globule-like’”, by which they mean that secondary structure is present only transiently flickering in and out of existence along a compact but mobile chain. This contrasts with native-like structure, where secondary structure is locked-in to form a well defined and stable tertiary fold. Their finding accords well with what we should expect in view of the above considerations. Indeed, it would be very puzzling if secondary structure were modular.”"gpuccio
February 22, 2015
February
02
Feb
22
22
2015
07:22 AM
7
07
22
AM
PDT
I'm lukewarm about the idea that exons have any particular relevance to early protein evolution. After all, exon boundaries only manifest themselves at transcription/edit/translate time, and are not visible to mechanisms of replication and recombination. But what they do show is that, contrary to the thrust of this article, folded soluble peptides can be produced in multiple ways from the same basic elements.Hangonasec
February 22, 2015
February
02
Feb
22
22
2015
01:31 AM
1
01
31
AM
PDT
A fairly recent study examined many different combinations of E. coli secondary structural elements (?-helices, ?-strands and loops), assembling them “semirandomly into sequences comprised of as many as 800 amino acid residues” (Graziano et al., 2008). The researchers screened 108 variants for features that might suggest folded structure. They failed, however, to find any folded protein structures.
That's not what the paper says! (A typo, they screened 10^8 not 108. But they got a 0.3% hit rate).Hangonasec
February 22, 2015
February
02
Feb
22
22
2015
01:20 AM
1
01
20
AM
PDT
1 2 3 4

Leave a Reply