Conservation of sequence in the course of natural history has always been considered a sign of function. But does function always coincide with sequence conservation? And are there other important aspects which must be considered? This topic has been discussed recently with some passion here, so I will dedicate a series of two posts to it, in the hope that we can base our discussions on reliable data. I apologize in advance if some of the following discussion is necessarily rather technical.
In general, in evolutionary analysis, conservation is considered a sign of function. Protein coding genes which are more strictly conserved in the course of time are usually considered as having greater functional constraint than those genes which change more. The same is supposed to be true for non coding sequences, although the topic is much more controversial.
So, we start here considering how much of the human genome is conserved, and how that conservation relates to function. These will be the first two points in the discussion.
1) How much of the human genome is made of conserved sequences?
Luckily, this is a point which is well understood. After all, conservation can be evaluated objectively aligning the genomes of different species, and that has already been done with enough precision.
However, it is important to remember that the result can be somewhat different according to how we define conservation, and according to the method we use to measure it. That is perfectly normal.
A very complete paper about sequence conservation in genomes is the following:
Adam Siepel, Gill Bejerano, Jakob S. Pedersen, et al.
”Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes”
In that paper, they evaluate conservation in vertebrate genomes. Just to make it short, they find about 4.3% of conservation in the human genome (referred to vertebrates), while allowing that:
These numbers are somewhat sensitive to the methods used for parameter estimation. Various different methods produced coverage estimates of 2.8% – 8.1% for the vertebrates, 36.9% – 53.1% for the insects, 18.4% – 36.6% for the worms, and 46.5% -67.6% for the yeasts (see Supplemental material). Note that the vertebrate coverage is similar to recent estimates of 5% – 8% for the share of the human genome that is under purifying selection (Chiaromonte et al. 2003; Roskinetal. 2003; Cooper et al. 2004), despite the use of quite different methods and datasets.
So, we can say that with most methods the percentage of the human genome which is conserved is about 3 – 8%.
If we look carefully at Figure 3 in the same paper, and in particular to the data about vertebrates, we find other interesting information:
a) Protein coding regions (exons, in red) are highly conserved, about 68%, but they are only 18% of the conserved regions.
b) Introns are less conserved, almost 5%, and they are 28.5% of the conserved regions.
c) Unannotated regions (the rest of non coding DNA) are even less conserved, about 2,5%, and they are 41.2% of the conserved regions.
Other gene associated sequences (5’ UTR, 3’ UTR, etc.) represent smaller fractions.
So, there is no doubt that non coding DNA is less conserved than coding DNA (less than 5% versus 68%), but there is no doubt that most of conserved DNA is non coding (about 70%).
Another important point is that we are discussing here general conservation. The same paper analyzes also highly conserved elements (HCEs). They cover only 0.14% of the human genome, a much smaller fraction: About 42% of these were in gene coding or gene associated regions, and about 58% in non coding regions.
Finally, there is an even more restricted category, ultra conserved elements (UCEs), with 100% identity, which is described in this paper:
G. Bejerano et al.: “Ultraconserved elements in the human genome”.
2004 May 28;304(5675):1321-5. Epub 2004 May 6.
There are 481 segments longer than 200 base pairs (bp) that are absolutely conserved (100% identity with no insertions or deletions) between orthologous regions of the human, rat, and mouse genomes. Nearly all of these segments are also conserved in the chicken and dog genomes, with an average of 95 and 99% identity, respectively. Many are also significantly conserved in fish. These ultraconserved elements of the human genome are most often located either overlapping exons in genes involved in RNA processing or in introns or nearby genes involved in the regulation of transcription and development. Along with more than 5000 sequences of over 100 bp that are absolutely conserved among the three sequenced mammals, these represent a class of genetic elements whose functions and evolutionary origins are yet to be determined, but which are more highly conserved between these species than are proteins and appear to be essential for the ontogeny of mammals and other vertebrates.
This is an even smaller fraction of the genome.
2) Is there functional DNA which is not conserved, in the human genome?
Certainly, and a lot of it!
Everybody knows that the ENCODE project has found that most of human genome is transcribed. That does not necessarily mean that it is functional, as many have pointed out.
A very recent paper from the people at ENCODE discusses the problem of function. It is:
“Defining functional DNA elements in the human genome”
The authors in that paper use three different approaches to infer function in the human genome:
a) Evolutionary approach. That means conservation. They start with what we have already discussed at point 1, but they refer to mammalian conservation, which can be expected to be somewhat higher than vertebrate conservation. They comment:
The lower bound estimate that 5% of the human genome has been under evolutionary constraint was based on the excess conservation observed in mammalian alignments (2, 3, 87) relative to a neutral reference (typically ancestral repeats, small introns, or fourfold degenerate codon positions). However, estimates that incorporate alternate references, shape-based constraint (88), evolutionary turnover (89), or lineage-specific constraint (90) each suggests roughly two to three times more constraint than previously (12–15%), and their union might be even larger as they each correct different aspects of alignment-based excess constraint. Moreover, the mutation rate estimates of the human genome are still uncertain and surprisingly low (91) and not inconsistent with a larger fraction of the genome under relatively weaker constraint (92). Although still weakly powered, human population studies suggest that an additional 4–11% of the genome may be under lineage-specific constraint after specifically excluding protein coding regions (90, 92, 93), and these numbers may also increase as our ability to detect human constraint increases with additional human genomes. Thus, revised models, lineage-specific constraint, and additional datasets may further increase evolution-based estimates.
Now, let’s look at Fig. 1 in the paper, a Venn diagram which sums up the results of a detailed analysis of available data. I have checked the exact numbers on which the figure is based in the Supporting Information file. the purple circle is the protein coding fraction in the genome, about 1.25%. The evolutionary conserved fraction of human genome is the red circle, and it is 7.38% of the whole genome. The greater part of it (6.33%) is non coding DNA. That is in good accord with what reported at point 1.
b) Genetic approach. With that, the authors mean proof of modifications in phenotype with genetic alterations of the sequence. This is the “gold standard” of function. It means that function is certainly there.
The subset of genome for which there is genetic confirmation of function is the green area. I have not found the exact numbers for it in the paper, but I would say that it is about 15%. It can be seen that it somewhat overlaps the conserved circle, but at least 50% of it is not conserved and is not protein coding. As this is the gold standard, we have here a significant portion of non coding DNA which is not conserved while being certainly functional.
c) Biochemical approach. This is the traditional ENCODE approach, the one with indicates possible function in 80% of the genome. It is based on many biochemical evidences, which are explained in the paper.The blue areas indeed include about 80% of the whole genome.
However, the authors divided the blue area in three subsets, according to the level of activity detected. The dark blue area is the area with high level of activity. So, let’s consider only that subsets, leaving the other two as controversial, at present.
For the dark blue area (15.56%), evidence at transcription level and at other biochemical levels is very high. So, the inference of function can be considered very reliable. As can be seen, the dark blue area overlaps the green area and the red circle, but still about two thirds of it are out of both.
If we consider the union of these three different subsets (red circle, green area, dark blue area) we have the total portion of the genome for which there is convincing evidence of function, at the present state. It is about 24%, and most of it is non coding.
Moreover, the percentage of functional genome can only increase in time. While the red circle (conserved elements) and the purple circle (protein coding genes) are more or less final, the green area (gold standard) can only expand, and it can potentially confirm the function of parts of the blue areas (including those with lower activity).
So, to sum up:
– At the present state of knowledge, function is extremely likely for 24% of the human genome.
– For about 15% (green area) it is certain.
– Most of that functional genome (about 95% of it) is non coding.
– Most of that functional genome (about 70% of it) is not conserved.
But that is not all. There are two other important points which must be addressed, and which are even more intriguing. They are:
3) Conserved function which does not imply conserved sequence.
4) Function which requires non conservation of sequence.
I will deal with them in the second part of this discussion.