So, in the first part of this discussion, I have tried to show with real data from scientific literature how much of the human genome is conserved, and how that conservation is evaluated and expressed. Then I have argued that we already have good credible evidence for function in a relevant part of the human genome (let’s say about 20%), and that most of that functional part is non coding, and great part of it is non conserved. While some can disagree on the real figures, I think that it is really difficult to reject the whole argument.
But, as I have anticipated, there are two more important aspects of the issue that I want to discuss ion detail. I will do it now.
3) Conserved function which does not imply conserved sequence.
The reason why sequence is conserved when function is present is because function creates specific constraints to the sequence itself.
For example, in a protein sequence with a well defined biochemical function, some variation will be possible without affecting the protein function, while other kinds of variation will affect it more or less.
We have many examples of important loss of function for the change of even one aminoacid: mendelian diseases in humans are a well known, unpleasant example of that.
We have many examples of important variation in the sequence of functional proteins which does not affect the function: the so called neutral variations in proteins. For example, there are many variants of human hemoglobin, more than 1000, most of them caused by a single aminoacid substitution. While many of them cause some disease, or at least some functional modification of the protein, at least a few of them are completely silent clinically, both in the heterozygote and in the homozygote state.
Now, there is an important consequence of that. Neutral variation happens also in functional sequences, although it happens less in those sequences. How much neutral variation can be tolerated by a functional sequnece depends on the sequence. For proteins, it is well known that some of them can vary a lot while retaining the same structure and function, while others are much more functionally constrained. Therefore, even functional proteins are more or less conserved, in the same span of time.
What about non coding genes? While we understand much (but not all) of the sequence-structure-function relationship for proteins, here we are almost wholly ignorant. Non coding genes, when they are functional, act in very different ways, most of them not well understood. Many of them are transcribed, and we don’t understand much of the structure of the transcribed RNAs, least of all of their sequence-structure-function relationship. IOWs, we have no idea of how functionally constrained is the sequence of a functional non coding DNA element.
While searching for pertinent literature about this issue, I have found this very recent, interesting paper:
Evolutionary conservation of long non-coding RNAs; sequence, structure, function.
The abstract (all emphasis is mine):
Recent advances in genomewide studies have revealed the abundance of long non-coding RNAs (lncRNAs) in mammalian transcriptomes. The ENCODE Consortium has elucidated the prevalence of human lncRNA genes, which are as numerous as protein-coding genes. Surprisingly, many lncRNAs do not show the same pattern of high interspecies conservation as protein-coding genes. The absence of functional studies and the frequent lack of sequence conservation therefore make functional interpretation of these newly discovered transcripts challenging. Many investigators have suggested the presence and importance of secondary structural elements within lncRNAs, but mammalian lncRNA secondary structure remains poorly understood. It is intriguing to speculate that in this group of genes, RNA secondary structures might be preserved throughout evolution and that this might explain the lack of sequence conservation among many lncRNAs.
SCOPE OF REVIEW:
Here, we review the extent of interspecies conservation among different lncRNAs, with a focus on a subset of lncRNAs that have been functionally investigated. The function of lncRNAs is widespread and we investigate whether different forms of functionalities may beconserved.
Lack of conservation does not imbue a lack of function. We highlight several examples of lncRNAs where RNA structure appears to be the main functional unit and evolutionary constraint. We survey existing genomewide studies of mammalian lncRNA conservation and summarize their limitations. We further review specific human lncRNAs which lack evolutionary conservation beyond primates but have proven to be both functional and therapeutically relevant.
Pioneering studies highlight a role in lncRNAs for secondary structures, and possibly the presence of functional “modules”, which are interspersed with longer and less conserved stretches of nucleotide sequences. Taken together, high-throughput analysis of conservation and functional composition of the still-mysterious lncRNA genes is only now becoming feasible.
So, what are we talking here? The point is simple. Function in non coding DNA can be linked to specific structures in RNA transcripts, and those structures, and therefore their function, can be conserved across species even in absence of sequence conservation. Why? Because the sequence/structure/function relationship in this kind of molecules is completely different from what we observe in proteins, and we still understand very little of those issues.
As the authors say:
In contrast to microRNAs, almost all of which are post-transcriptional repressors, the diverse functions of lncRNAs include both positive and negative regulations of protein-coding genes, and range fromlncRNA:RNA and lncRNA:
protein to lncRNA:chromatin interactions [8–11]. Due to this functional diversity, it seems reasonable to presume that different evolutionary constraints might be operative for different RNAs, such as mRNAs, microRNAs, and lncRNAs.
Which is exactly my point.
The authors examine a few cases where the sequence/structure/functional relationship of some lncRNAs has been stiudied more in detail. They conclude:
Tens of thousands of human lncRNAs have been identified during the first genomic decade. Functional studies for most of these lncRNAs are however still lackingwith only a handful having been characterized in detail [8,10,11,87]. Fromthese few studies it is apparent that some lncRNAs are important cellular effectors ranging from splice complex formation  to chromatin and chromosomal complex formation [43,46] to epigenetic regulators of key cellular genes.
It is becoming increasingly apparent that lncRNAs do not show the same pattern of evolutionary conservation as protein-coding genes. Many lncRNAs have been shown to be evolutionary conserved ; but they do not appear to exhibit the same evolutionary constraints as mRNAs of protein-coding genes.
While certain regions of the lncRNAs appear tomaintain the regulatory function, such as bulges and loops, the exact sequence in other regions of lncRNAs appear less important and possibly act as spacers in order to link functional units or modules. Depending on the function, e.g.,whether the RNA sequence is a linker or a functional module, different patterns of conservation might be expected.
It is important to remember that lncRNA genes are only a part of non coding DNA. If someone wonders how big a part, I would suggest the following paper:
The Vast, Conserved Mammalian lincRNome
which estimates human lncRNA genes at about 53,649 genes, more than twice the number of protein coding genes, corresponding to about 2.7% of the whole genome (Figure 2). It’s an important part, but only a part. And it is a part which, while probably functional in many cases, still is poorly conserved at sequence level.
Other parts of the non coding genome will have different types of function, structure, and therefore sequence conservation. For example, the following paper:
Integrated genome analysis suggests that most conserved non-coding sequences are regulatory factor binding sites
argues that most conserved non coding regions (about 3.5% of the genome, conserved across vertebrate phylogeny, strongly suggesting its functional importance, which clusters into >700 000 unannotated conserved islands, 90% of which are <200 bp) “serve as promoter-distal regulatory factor binding sites (RFBSs) like enhancers”, rather than encoding non-coding RNAs. IOWs, these short sequences in the non coding genome which make up another 3.5% of the total would be functional not because of their RNA transcript, but directly as binding sites (enhancers and other distal regulatory elements). Now, these sequences are conserved. That proves the general point: different functions, different relationship between sequence and function, different conservation of functional elements. In general, it seems that function which expresses itself through non coding RNA transcripts is less conserved at sequence level.
And now, the last point, maybe the most important of all.
4) Function which requires non conservation of sequence.
When we analyze conservation of sequences across species as an indicator of function, we are forgetting a fundamental point: in the course of natural history, species change, and function changes with them.
IOWs, the reason why species are different is that they have different molecular functions.
So, there is some implicit contradiction in equating conservation with function. A conserved sequence is very likely to be functional, but it is not true that a function needs a conserved sequence, if it is a new function, or a function which has changed.
Now. we know that protein coding genes have not changed a lot in the last parts of natural history. It is usually recognized that the greatest change, especially in more recent taxa, is probably regulatory. And the functions which have been identified in various parts of non coding DNA are exactly that: regulatory.
So, to sum up:
– Species evolve and change
– The main tool for that change is, realistically, a change in regulatory functions
– If a function changes, the sequences on which the function is based must change too
– Therefore, those important regulatory functions which change for functional reasons will not be conserved across species
This point is different from the previous point discussed here.
In point 3, the reasoning was that the same function can be conserved even if the sequence changes, provided that the structure is conserved.
In point 4, we are saying that in many cases the sequence must change for the function to change with it.
Now, although this reasoning is quite logic and convincing, I will try to backup it with empirical observations. To that purpose, I will use two different models: HARs and the results of the recent FANTOM5 paper about the promoterome.
4a) Human Accelerated Regions (HARs).
Waht are HARs? Let’s take it from Wikipedia:
Human accelerated regions (HARs), first described in August 2006, are a set of 49 segments of the human genome that are conserved throughout vertebrate evolution but are strikingly different in humans.
IOWs, they are sequences which were conserved in primates, and which change in humans.
Are they functional. That’s what is believed for some of them. Again, Wikipedia:
Several of the HARs encompass genes known to produce proteins important in neurodevelopment. HAR1 is an 106-base pair stretch found on the long arm of chromosome 20 overlapping with part of the RNA genes HAR1F and HAR1R. HAR1F is active in the developing human brain. The HAR1 sequence is found (and conserved) in chickens and chimpanzees but is not present in fish or frogs that have been studied. There are 18 base pair mutations different between humans and chimpanzees, far more than expected by its history of conservation.
HAR2 includes HACNS1 a gene enhancer “that may have contributed to the evolution of the uniquely opposable human thumb, and possibly also modifications in the ankle or foot that allow humans to walk on two legs”. Evidence to date shows that of the 110,000 gene enhancer sequences identified in the human genome, HACNS1 has undergone the most change during the evolution of humans following the split with the ancestors of chimpanzees. The substitutions in HAR2 may have resulted in loss of binding sites for a repressor, possibly due to biased gene conversion
Now, for brevity, I will not go into details, but… “active in the developing human brain” and “may have contributed to the evolution of the uniquely opposable human thumb, and possibly also modifications in the ankle or foot that allow humans to walk on two legs” are provocative thoughts enough, and I believe that I don’t need to comment on them.
The important point is: what makes us humans different from chimps? Logic says: something which is different. Not something which is conserved.
4b) The results from FANTOM5 about the promoterome.
FANTOM5 has very recently published a series of papers with very important results. One the most important is probably the following article on Nature:
A promoter-level mammalian expression atlas
Unfortunately, the article is paywalled. I have access to it, so I will try to sum up the points which are needed for my reasoning.
So, what did they do? In brief, they used a very powerful technology, cap analysis of gene expression (CAGE), to study various aspects of the transcriptome in different human cells from different tissues and states. This is probably the most important analysis of the human transcriptome ever realized.
This particular paper focuses on a “promoter atlas”, IOWs an atlas of the expression of promoters (transcription start sites, TSSs, which control the transcription of target genes) in different tissues.
So, according to the level of expression of those promoters in different tissues and cells, they classify genes (both protein coding and non protein coding) in:
– ubiquitous-uniform (‘housekeeping’, 6%): those genes which are expressed at similar levels in most cell types
– ubiquitous non-uniform (14%): expressed in most cell types, but at different levels
– non-ubiquitous (cell-type restricted, 80%)
Each of those types includes both C (protein coding genes) and N (non protein coding genes).
Now. that’s very interesting. Now we know that most genes (80%), both coding and non coding, are expressed only in some cell types.
But the most interesting thing, for our discussion about conservation, is that they studied the promoter expression both in human cells and in other mammals.
Now, we must look at Figure 3 in the paper. For those who cannot access the article, there is a low resolution version of this figure here (just click on Figure 3 in the “at a glance” box; OK, OK, it’s better than nothing!).
The figure is divided into two parts, a and b. In each part, the x axis shows the evolutionary divergence from humans (from 0 to 0.8, the grey vertical lines correspond to macaque, dog and mouse). The y axis shows “Human TSS with aligning orthologous sequence (%)”, IOWs the % conservation of each group of genes in the graph at various points of evolutionary divergence. Each line represents a different group of genes. So, the lines which remain more “horizontal” represent groups of genes which are more conserved, while those which “go down” from lest to right are those less conserved. I hope it’s clear.
On the left (part a) genes are grouped as above: ubiquitous- uniform, etc, each category divided into C or N (coding or non coding).
What are the conserved groups? In order: Non-ubiquitous C (green line); Ubiquitous uniform C (orange line); Ubiquitous non-uniform C (purple line).
IOWs, coding genes are more conserved, and non ubiquitous are most conserved.
That is not news.
Conversely, non coding genes are less conserved, in this order: Non-ubiquitous N (lighter green); Ubiquitous non-uniform N (lighter purple); Ubiquitous uniform N (lighter orange). This last line is definitely less conserved than the random reference (the dotted line).
This part is “Conservation by expression breadth and annotation”.
Well, what is on the right (part b)? It is “Conservation by cell-type biased expression”.
IOWs, the graph is the same, but genes are grouped in different lines according to the cell type where they are preferentially expressed.
The most conserved groups? Those with preferential expression in: Fibroblast of periodontium, Fibroblast of gingiva, Preadipocyte, Chondrocyte, Mesenchymal cell.
The least conserved? Those with preferential expression in: Astrocyte, Hepatocyte, Neuron, Sensory epithelial cell, Macrophage, T-cell, Blood vessel endothelial cell. In decreasing conservation order.
Does that mean something? I leave it to you to decide. For me, I definitely see a pattern. With all due respect for fibroblasts and adipocytes, neurons and T cells smell more of specialized cells which must change in higher taxa (excuse me, Piotr, mice will accuse me of not being politically correct).
So, my humble suggestion is: the things that change more are not necessarily those less functional. In many cases, they could be exactly the opposite: the bearers of new, more complex functions.
And non coding genes are very good candidates for that role.