Today’s guest post comes to us from one of our regular commenters, Dr.JDD. All that follows is his:
I would like to start off this post by emphasising this is not meant to be seen in any way as a “disproof” nor an “attempt to disprove” the appearance of complex proteins in eukaryotic cells through proposed unguided evolutionary mechanisms. This is rather I hope something to stir up discussion and engage thought in particular to those who wish to understand better the real complexities and challenges that are needed to be overcome, if indeed we were to accept such proposed mechanisms as genuine and real.
We all know that mutations in DNA can result in a different amino acid appearing in a protein. For example the DNA triplet codon if read as “CTT” would be translated to the amino acid Leucine (L; obviously via the mRNA intermediate). However, if there was a mutation from the C to the G, the frame would read “GTT” and this would be translated into a Valine (V). As we all further know, we can have deleterious, neutral, and beneficial mutations (in a given context). Additionally, a mutation in the third letter of the DNA triplet codon is often redundant at the level of the amino acid because of the redundant nature of the genetic code (“perfectly optimised” many would say). Obviously then, removal of or insertion of a new DNA base will have a much greater impact on the sequence (as you will shift the reading frame) and therefore is usually deleterious.
Now I would make a request that I am not attacked for over-simplifying this concept, but to talk very simply about evolutionary change, mutations will occur at random in certain positions in the DNA sequence and this may be inherited (germline mutations) with a consequence of either deleterious, neutral or beneficial, with most “thought to be near-neutral.”
There remains a question though that has fascinated me for a while, and led me to look at some examples of this. What if we discovered other layers of code within the same gene? What would be the impact of a mutation on this other code, relative to the foremost code? How much would this then limit the availability of more than one code to co-evolve, realistically?
Now these are not questions I personally can easily answer nor have the capacity to answer, to any full degree. But I think it is something interesting that others who are perhaps smarter and of a more “code-orientated” training and mind-set should consider, especially in the context of the ID paradigm.
Just to make a “disclaimer” as well – I profess to not be an expert in this area. My PhD and some of my first post-doctoral work was in the endocytic pathway and protein trafficking and I then moved on to human Immunology. I am no longer in academia but rather the pharmaceutical world so the way I approach research and scientific questions is perhaps a little different than the academic approach, but personally I do not see that as a bad thing. I am not a geneticist is the main point I am making, although I obviously have some training in that field (not that this is an exceptionally useful thing).
Now with the advance of proteomics and our ability to detect peptides and “map” the human proteome, a lot of information has come to light. In particular, it is apparent we are “missing” a lot of proteins found in cells but not annotated as genes in our databases. Surprisingly, for quite some time the field has held to the dogma of one gene, one Open Reading Frame (ORF) – and potentially many different proteins due to alternative splicing events, for example. Yet recent studies mapping the human proteome (“A draft map of the human proteome.” Kim et al. 2014. Nature. 509, 575-581) have yielded many MS spectra that cannot be assigned to annotated genes in the human genome. With that publication in the prestigious Nature journal, one researcher made a very insightful comment which I would like to focus on (emphasis mine):
Xavier Roucou 2014 Jul 15
Among several significant contributions in this work, the discovery of 44 novel protein-coding open reading frames (ORFs) illustrates the complexity of the human proteome. Recently, we reported the discovery of 83,886 previously undescribed ORFs termed alternative ORFs (AltORFs) Vanderperre B, 2013. AltORFs are defined as ORFs present in the transcriptome that are different from annotated ORFs. We detected 1,259 proteins translated from AltORFs in human biological samples Vanderperre B, 2013. While the role and importance of this “alternative proteome” will require substantial further validation, there can be no doubt that a comprehensive description of the human proteome must include the distinct possibility of a vastly greater number of functional proteins than has been traditionally considered. Given the existence of the alternative proteome, it is not surprising that Kim et al. found that nearly 50% of the 35 million MS/MS spectra of human proteins did not match proteins in the NCBI’s RefSeq human protein sequence database. In an attempt to identify these novel proteins, the authors translated the human reference genome, RefSeq transcript sequences, non-coding RNAs, and pseudogenes. Among the 193 newly identified proteins, 44 were translated from novel uORFs, ORFs located in an alternate reading frame within coding regions of annotated genes, or ORFs located in 3’-UTRs. The astonishing failure to have detected the alternative proteome years ago results from the fact that MS-based proteomic methods rely on existing protein sequence databases that are far from complete and therefore do not allow the assignment of all MS/MS spectra. Recent ribosome profiling and footprinting approaches have suggested the significant use of unconventional translation initiation sites in mammals Ingolia NT, 2011 Lee S, 2012 Michel AM, 2012, and these alternative proteins should have been detected. In order to better define the human proteome, we generated a new database of alternative ORFs (AltORFs) present in NCBI’s RefSeq human mRNA sequence database. AltORFs overlap the annotated or reference protein coding ORF (RefORF) in an alternate reading frame, are located in the 5′- and 3′-UTR regions of an mRNA, or partially overlap with both the RefORF and an UTR region. This approach led to the discovery of 83,886 unique AltORFs with a minimum size of 40 codons Vanderperre B, 2013. The majority of mRNAs (87%) have at least one predicted AltORF, with an average of 3.88 AltORFs per mRNA. Additionally, the evolutionary conservation of many of these reading frames suggests functional importance. These AltORFs were translated in silico and included in an alternative protein database we used to interpret unmatched MS/MS spectra. So far, we and others have identified nearly 1300 alternative proteins in different human cell lines and tissues Vanderperre B, 2013, Klemke M, 2001 Oyama M, 2004 Vanderperre B, 2011 Bergeron D, 2013 Slavoff SA, 2013 Menschaert G, 2013, including certain of the 44 new proteins mentioned in the Kim et al. study: the alternative protein translated from the AltORFs mapping to the 5’-UTR of the SLC35A4 gene (or AltSLC35A4), was detected in Hela cells and lung tissue; the AltC11orf48 was detected in Hela cells, colon, lung and ovary tissues; and the AltCHTF8 was detected in Hela cells Vanderperre B, 2013. Twenty four of the 44 novel ORFs detected by Kim et al. were, in fact, already present in our AltORF database, and 9 of the 44 proteins translated from these novel ORFs were previously detected: AltASNSD1, AltSLC35A4, AltMKKS, AltSMCR7L, AltCHTF8, AltRPP14, AltSF1, AltC110rf48, AltHNRNPUL12. In this sense, Kim et al.`s study strongly supports the existence of the alternative proteome. Clearly, the alternative proteins detected by Kim et al. and by our team are the proverbial tip of the iceberg. A full map of the human proteome is thus still years away, and will require several important changes in our current thinking concerning the proteome and the concept that each mature mRNA only codes for one protein.
I could spend quite a long time talking about how fascinating this is, how little we know about proteins at present and how dogma has led us down a path to ignore an abundance of proteins just because they do not fit the standard model of thinking. It is truly amazing how little attention this line of work receives. For example, >90% of people I work with are PhD-level molecular and cellular biologists, and I have not yet met one who, when I have spoken of these things to them, is aware such layers of complexity exist. The dogma changes very slowly.
However what I wish to focus on are these AltORFs that are present in different reading frames of an already existing gene. Hopefully some of you will find this as utterly fascinating as I have. Hopefully some of you will even be able to think about the probabilistic implications to the evolutionary paradigm that this may (or may not) raise. I think most of us are fascinated by biology in one way or another so hopefully at least the first purpose will see some fulfilment.
As already discussed, a protein product is translated from an initial DNA code (via messenger RNA). This code is in triple – so 3 bases code for 1 amino acid, usually. However the reading frame is important. For example, consider the set of 3-letter words below:
THE CAT WAS NOT FAT
That makes sense and gives a message. Now let us change the way we read that by starting the 3-letter words from a reading frame shifted 1 letter over:
HEC ATW ASN OTF
What you notice is the message is completely lost – you cannot see any similarity to the first message, when looking at the triplets. Note – this is not one of those naïve attempts to use language to represent what happens with DNA code. That is different because in language you need to have particular combinations of letters to make the message viable. Whereas with DNA code, all you need are 3 of the four bases – any combination will give a message, either one of the 20 (usually) amino acids, or a stop code (or a start but that is the same as an amino acid, Met). That does not mean functionality of the polypeptide, but there is still a message. So this is just to illustrate you change the message – not you lose it!
Equally then, let us shift the reading frame over one again:
ECA TWA SNO TFA
All 3 messages are quite different. So what about with a DNA code and reading in different frames? Here is an example:
ATG CTT CAA TGC AGA TTC CCG GTT TCT TAG
Now ATG in DNA codes for the start codon and is a Methionine (M). TAG is one of several stop codons. So the translated result of such a code would be:
M-L-Q-C-R-F-P-V-S-*(STOP)
However, if you were to read in an alternative frame (shifted over 2 times from the original), you would see that starting at the 9th letter, we now observe a potential start codon of ATG appear:
(AT) GCT TCA ATG CAG ATT CCC GGT TTC CTT (AG)
This would then translate to (caveat – no stop codon here):
M-Q-I-P-G-F-L-…
As you can see, this looks quite different to the first peptidic sequence (ignoring the unavoidable starting Met). Given such a vastly different sequence, one may expect quite a different looking protein to be produced: a protein with different folds, structure and function (obviously this case is an example and neither sequence are long enough to be considered a “protein” as such but rather a short peptide, but this is merely to illustrate a principle).
So again, those that like to be fascinated by these things and consider paradigms let us consider a few things:
1) How does this affect the evolution of a protein when proposed to be through unguided processes?
2) What constraints are placed on apparent “neutral” or near-neutral mutations?
3) How does this affect the way you interpret an apparent “redundant” mutation in the DNA code?
4) Given the vastly different nature of the amino acid code (and thus strong chance of differing structure, folds and function), what are the probabilistic likelihood of such an alternative ORF (AltORF) encoding for a protein that plays a very close role with the common ORF it is found within?
Just to speak to some of those questions in particular with regards to points 1-3, I think broadly speaking this makes unguided evolution a lot harder. The reason being is any evolutionary changes to this region of overlapping ORFs in different frames means that a change has to be tolerated by BOTH proteins simultaneously. Where a mutation may have been neutral/near neutral before for the standard ORF now has to also be likewise (or beneficial) for the AltORF. As you are in a completely different reading frame, a conservative point mutation in the DNA code could very easily insert an aberrant stop codon for example into the AltORF. Suddenly, the layers are complicated (and this is just considering one single AltORF that overlaps). Thus without full understanding of the potential AltORFs present in a gene, one cannot simply state that a mutation is either a) neutral/near neutral/beneficial, b) redundant or c) conservative.
For example, in the above case we have as the 7th triplet in the original ORF, CCG which translates to a P (Proline). Let us say that we have a mutation from CCG to CCC – a single point mutation. This, in the original ORF is redundant – you still encode for a P. However, in the altORF that mutation has now changed a GGT to a CGT which is a Cysteine (C) to an Arginine (R). Those amino acids are not even close to being conservative (e.g. an L to a V might be considered a conservative change as these are both small hydrophobic residues). So you can see that the impact now of a single mutation which under the usual accepted paradigm of DNA code is seen as conservative or even redundant, suddenly becomes the opposite of this.
In part 2, I will try to review and summarise a paper that describes one such AltORF that overlaps with an existing normal gene, with implications in disease. Putting it into this context I think will help fascinate those of interest further, and also demonstrate some of the challenges unguided evolution must overcome.