Uncommon Descent Serving The Intelligent Design Community

Request for help verifying non-random 3mer pattern in Human Chromosome 1

Share
Facebook
Twitter
LinkedIn
Flipboard
Print
Email

3-base periodicity is a well-known non-random feature of the DNA. That is to say, a base will sometimes be repeated 3 nucleotides away. This should happen randomly at a frequency of about 25% if all the bases are equally represented, but I got something that was slightly away from random.

3-base periodicity is a well known pattern that seems to identify exonic regions. For lack of a better word, I use the word “3mer” whenever I encountered the same base 3 nucleotides away. 3mer is a term Dr. Sanford’s DNA Skittle uses, but I have to confer with him whether that is what he means.

I tried to see how frequently A,T,C,G repeated every 3 bases. It seems the Adenenine and Thymine 3mers appeared about twice as frequently as Cytosine or Guanine 3mers, and this seems partly due to increased A or T frequency. I think this is a legitimate non-random pattern. Here were my numbers for Human Chromosome 1:

guanine count 47,016,562
cytosine count 47,024,413
adenine count 65,570,891
thymine count 65,668,756
guanine 3mer count = 10,798,024
cytosine 3mer count = 10,805,795
adenine = 3mer count 20,297,310
thymine = 3mer count 20,355,586
cg_at_3mer_ratio = 0.5314214023030487

This was a follow on to Jean Claude Perez. I didn’t get the golden ratio, but instead I explored a well-acknowledge phenomenon, namely 3-base periodicity. I want to make sure my numbers are correct. Why should this non-random pattern emerge? Is it codon bias or something? Do I have a bug in my code?

I provided the Java code that I used here:

http://creationevolutionuniversity.com/forum/viewtopic.php?f=3&t=91.

I got the Chromosome 1 fasta file from:
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/

Any insights and corrections are especially welcome. Thanks in advance.

Comments
Dears Pietr, Joe or Gordon, I'm sorry for your unapropried comments on SANDWALK of the Sal Cordova entry entitled Vodka! Jean Claude Perez, the golden ratio, dragon curve fractals and musical design in “junk DNA”... The reason is that all (ALL) their comments were done without reading the basic original article: I suggest you reading the original basic peer review article of 2010 published in Interdisciplinary Science: http://fr.scribd.com/doc/95641538/Codon-Populations-in-Single-stranded-Whole-Human-Genome-DNA-Are-Fractal-and-Fine-tuned-by-the-Golden-Ratio-1-618 and my 2013 peer review article: http://www.scirp.org/journal/PaperInformation.aspx?paperID=37457#.U2Mwlfl_trA jean-claude perez
Gordon Davisson I modified the script according to your specs: guanine => 46956489 cytosine => 46964756 adenine => 65491918 thymine => 65586556 guanine 3mer => 11002184 cytosine 3mer => 11010648 adenine 3mer => 20626672 thymine 3mer => 20683917 cg_at_3mer_ratio = 0.53286173189155 The remaining differences between our counts could depend on the different data file. Now I haven't time to download yours. niwrad
niwrad, I tried running your perl script on the chromosome 1 data file I've been using, and (after some data tweaking) get base counts consistent with mine and Sal's; based on that, I'm pretty sure you're using a different data file. You can get the one we used from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/, or from http://genome.ucsc.edu follow the Downloads link in the sidebar, then under "VERTEBRATES - Complete annotation sets" click Human, then under "Feb. 2009 (hg19, GRCh37)" click "Data set by chromosome", then download "chr1.fa.gz". You'll need to either remove the first line (the description) of the data file and uppercase the rest, or modify the program to skip descriptions and use case-insensitive matching. Once that's done, the individual base counts match mine. But the repeat counts are still off, and there are several problems causing this: First, this program has the same line-ending bug Sal's original had -- it treats the line end characters as part of the data stream, and therefore misses repeats across lines (but counts some period-2 repeats that cross lines instead). Second, it still doesn't count multiple repeats right. It does ok through 3 repeats, but it undercounts 4 (AxxAxxAxxAxxA should count as 4 repeats, but it only gets 3) as well as 6 and above. Third, it also fails to count interleaved repeats, for example AAxAA and AxAAxA should both count as 2 repeats. And interleaved repeats, and... Basically, regular expressions are the wrong tool for the sort of data analysis we're trying to do here. Gordon Davisson
You’re going to want them to support ID and/or creation, right? That’s really your motivation for looking into this, isn’t it?
Actually not so much this time. Some in the ID community felt it would be a good project to explore just to get me practiced in doing basic research and learn the art of proposing a hypothesis etc. An idea popped up as to whether the 3mers had any functional significance. The 3mer signal now appears so weak to me personally that I wonder if is of any functional use to the cell. As of today, it's looking like a mostly null result, and if I were to attempt to publish anything, it would be actually rather critical of the prior literature that claimed they found something of genuine significance. When I saw the use of Fourier transforms, my eyes rolled. I thought, "Why?", that's senseless, straightforward code analysis is more accurate. I was also trained in electrical engineering and the Frequency/Fourier domain was the order of the day. But from my vantage point, using Fourier analysis is totally inappropriate to study 3-base periodicity. Fourier analysis is good when you have sinusoids of varying amplitudes superimposed on each other. It works well when you have amplitudes in a wide range over time. And then you can actually decompose the composite wave form into sinusoidal components with amplitudes and phases. But in the case of genomes, at best you have one possible amplitude ( either on or off for a given nucleotide as you read left to right in artificial "time"). So what was the point of using Fourier transforms? None that I see except maybe to make a not-so-interesting result (perhaps substance free) look a little more profound. Maybe the researchers of 3mers over the last 30 years were just needing to get something published to pad their CV. That sounds cynical, but I can't say the suspicion hasn't crossed my mind. Simple java code can detect periodic patterns like: A x x A x x A.... No need for a Fourier transform, that just muddles and confuses the analysis, imho. It's better to approach the problem like someone trying detect a grammar rather than some sort of "audio" signal. 3mers are a non-random pattern, but "stuck keys" on a computer create non-random patterns as well. What I was hoping to find were grammatical constructs that might help identify some of the structure of proteins rather than just their sequences. But in the process I realized I have way too little in basic knowledge of chemistry and molecular biology. WAAAAY too little to even venture an educated guess. I do know a protein engineer who was able to identify catalytic regions of the protein just by sequence comparisons between species. That helped her in her work tremendously. That is one case I know where study of the pure DNA sequences actually informed someone of protein structure. There was a chance 3mer study might have led to something in ID, but now it looks like not a very promising end. Maybe someone else can take off with 3mers, but as of today, I might only complete research on the topic just to publish a mostly negative null result. It's a non-random pattern, but not much of one, imho. scordova
Gordon Davisson @41: Thank you for the well-thought out and detailed comments. While I obviously disagree with you about ID generally, I agree with many of your thoughts, and cautions. With you, I'm also rather skeptical that 3-peat patterns will have much substantive meaning, but I'm willing to go exploring with Sal. :) Thank you also for the updated code you posted on the other site. I've used your version this afternoon to run through a couple of chromosomes. Eric Anderson
Sal @33: On second thought, you must be excluding the long repeating sequences, because (I believe) they are set to 'N' in the FASTA files? Eric Anderson
I found a problem with the way Threemer1 reads the file: it treats everything in the file as DNA bases, including description line and more importantly line break characters. This means that it treats ">chr1" as containing a cytosine base, looks for matches between the newline character and the bases in columns 3 and 47 (3 from the end of line), and period-2 repeats broken between lines. I also added code to calculate the GC ratios (not just the repeat ratios), and handle being passed the filename as a command-line parameter. Updated code is at Creation Evolution University. Here are the results I get for the first two chromosomes. Note that the cytosine counts have decreased by one (because of the description line) and the 3mer counts have increased a little (because of better matching across line breaks): input file = chr1.fa guanine count 47,016,562 cytosine count 47,024,412 adenine count 65,570,891 thymine count 65,668,756 gc/at ratio = 0.716559 gc content = 41.743927% guanine 3mer count = 11,015,663 cytosine 3mer count = 11,024,227 adenine = 3mer count 20,650,395 thymine = 3mer count 20,709,117 cg_at_3mer_ratio = 0.5328856394630574 input file = chr2.fa guanine count 47,947,042 cytosine count 47,915,465 adenine count 71,102,632 thymine count 71,239,379 gc/at ratio = 0.673466 gc content = 40.243782% guanine 3mer count = 10,831,713 cytosine 3mer count = 10,808,637 adenine = 3mer count 22,751,514 thymine = 3mer count 22,830,306 cg_at_3mer_ratio = 0.47475835760836227 But I still have a few gripes with this. I'll start with a superficial one: I don't think 3mer is an appropriate name. In chemistry, the -mer suffix refers to chemicals made of repeating subunits: a monomer is one subunit, a dimer is two, a trimer is three, a polymer is many,... When talking about DNA, I'd expect trimer, threemer, or 3mer to refer to three bases (I think three base pairs would technically be considered a hexamer). The best alternative term I can think of is "period-3 repeat", which isn't nearly as catchy. Sorry... Now for a more important gripe, let me back off and look at what we're actually seeing in the statistics here. The repeat counts are significantly higher than would be expected based on the individual base content, as you (Sal) calculated; but that doesn't actually mean there's anything special happening with a period of 3 bases. Suppose we had a chromosome that started with a region consisting of just A and T (in random order), followed by a region of just C and G. Within these regions, roughly half of the bases will be followed immediately by another of the same base, and half will be followed by a match 2 away, and half will be followed by a match 3 away, and... In other words, we'd see a hugely elevated repeat count at any period, not just 3. Real chromosomes are not so neatly segregated, of course; but they do have regions with elevated AT content, and other regions with elevated GC content, which'll produce a milder form of the same effect. So the results we'll get will show a mix of repeat elevation due to AT and GC segregation, and due to period-3 repeats -- all mixed together, with no way to tell how much of which we're seeing. I think in order to fix this, you need to count period-1 (adjacent) repeats, period-2, period-3, etc and look for spikes at periods 3, 6, 9, etc. If you want to get really fancy, you could experiment with Fourier analysis, but that's probably overkill. But let me back off even further, and ask what you're really looking for. Patterns in the genome that support ID, right? But what kinds of pattern support ID? You shouldn't assume that all patterns support ID, because evolution is a complex process that involves both random and non-random processes, and complex processes like that are expected to produce patterns. Distinguishing patterns that support ID from those that support evolution is ... nontrivial. Let me give a couple of examples. Example pattern 1: codon usage bias You mentioned these biases in this earlier posting here. The short summary is that there are generally several codons that code for each amino acid (i.e. they're synonymous, like "gray" and "grey"). But some codons can be translated into amino acids more efficiently, and in some organisms the more efficient synomyns are more common. You quoted wikipedia on the subject:
It is generally acknowledged that codon preferences reflect a balance between mutational biases and natural selection for translational optimization. Optimal codons in fast-growing microorganisms, like Escherichia coli or Saccharomyces cerevisiae (baker’s yeast), reflect the composition of their respective genomic tRNA pool. It is thought that optimal codons help to achieve faster translation rates and high accuracy. As a result of these factors, translational selection is expected to be stronger in highly expressed genes, as is indeed the case for the above-mentioned organisms. In other organisms that do not show high growing rates or that present small genomes, codon usage optimization is normally absent, and codon preferences are determined by the characteristic mutational biases seen in that particular genome. Examples of this are Homo sapiens (human) and Helicobacter pylori. Organisms that show an intermediate level of codon usage optimization include Drosophila melanogaster (fruit fly), Caenorhabditis elegans (nematode worm), Strongylocentrotus purpuratus (sea urchin) or Arabidopsis thaliana (thale cress).
There are actually three layers of pattern here: some synonyms are used more than others, but only in some organisms, and more so in highly expressed genes. But these layers of pattern fit what we'd expect from evolution! The codon usage bias is strongest where selection would favor it the most, and weak or missing where selection would favor it less. The Humans, for example, does not favor the more efficient synonyms, which is exactly what you'd expect from a lineage that hasn't "paid the cost of maintenance" in quite a long time (on the order of 100 million years if I'm doing the math right). Isn't this exactly what you'd expect from "genetic entropy"? Now, I should clarify that this pattern isn't a problem for ID scenarios that also involve long periods of evolution (e.g. front-loading, guided evolution, etc), but as far as I can see it's a big problem for anything like a recent creation scenario. As you mentioned earlier, intelligent agents can be capricious, so it's certainly possible a creator could have created things in a way that happens to mimic the effects of "genetic entropy"... But in science, a theory that explains why things are the way they are beats a theory that explains why things could be the way they are. Unless you have a predictive theory of creation that explains the existance of this pattern, this pattern winds up supporting evolution. Example pattern 2: isochores (/compositional domains) These are the regions of different CG content I mentioned above (and you described in "Too much of something can be a good thing for ID"). I'm not familiar enough with the subject to give a detailed explanation, but as I understand it they're a somewhat messy subject (i.e. it's clear that variations exist, but not clear how to describe and catgorize them), and their evolution is not well understood (maybe due to biased gene conversion?). If you argue that "not well understood in evolutionary terms" means that this pattern supports ID, you're basically making an argument from ignorance (or ID-of-the-gaps). This is a really really weak form of argument. If a solid evolutionary explanation for GC content variations develops (and is tested and supported), then this pattern joins codon usage bias in supporting evolution. If anything, this then becomes evidence against ID (or at least creation). If you develop (and test and support) a ID- or creation-based explanation for these patterns (and no good evolutionary explanation develops), then this becomes clear evidence for ID and/or creationism. But as with codon usage bias, this is going to be very difficult -- you need something like a predictive model of the ID/creation process, and intelligent agents can be capricious. If you can make a solid case that evolution cannot explain this pattern, then... well, you'd have evidence that evolution is incorrect, or at least incomplete. But that doesn't necessarily equate to support for ID, just for something other than (or in addition to) evolution. Note that in the meantime, if/while we're in this we-don't-understand-this-well state, evolutionary theory provides a framework for developing and testing hypotheses about where this pattern came from. How can you do this in the ID/creation framework? Example pattern 3: a new pattern you just found... I'm not trying to dampen your curiosity here, but I'm worried that you're setting a trap for yourself. Suppose you indeed find patterns in genomes -- period-3 repeats, fractal somethingorothers, whatever. You're going to want them to support ID and/or creation, right? That's really your motivation for looking into this, isn't it? If so, that means that every time you spot a pattern (or possible pattern), you're going to be biased toward both thinking the pattern is real (not just an artifact of the analysis, statistical quirk, etc), and thinking it's evidence for ID. And I'm pretty sure you'll (at least mostly) be disappointed. The various genome data has been looked over in many ways by a great many people (many of whom know far far more about biochemistry etc than either of us do), which means all of the easy-to-spot patterns have been spotted. Your probability of spotting something new (and real) is pretty small. If you do find something new, you're going to try to interpret it as supporting ID and/or creation. But it may not (see codon usage bias), and even if it does showing that it does will be very hard (see isochores). You run the risk of spending a lot of time finding nothing, then finding something interesting (it has been said that the most exciting thing to hear in science is not "Eureka!", but "Huh, that's interesting."), investigating it, getting emotionally invested in it... and then having some spoilsport evolutionist come along and say "oh, yeah, that's a side effect of this other thing we've known about forever and understand completely." So what should you do? If you do continue looking for patterns in the genome, please do it without the expectation that you're going to find something important. But what I'd really love to see you do is to develop a predictive model of ID and/or creation, make predictions from that of what specific patterns you'd expect to see based on that, and then go looking for those specific patterns. This will, of course, be very hard (capriciousness, etc); but if you really want to support ID and/or creation, I think this has far more potential than casting around for patterns and hoping they can be shown to support ID. Mind you, I don't expect this approach to work, because I don't think ID is actually correct; but if I'm wrong, this approach is far more likely to produce something that'll convince me to change my mind. Gordon Davisson
Grazie niwrad, mi amigo! scordova
# scordova.pl v1.1 # search for 3mer use strict; use warnings; my $p1 = "G"; my $p2 = "C"; my $p3 = "A"; my $p4 = "T"; my $p1m = "G..G"; my $p2m = "C..C"; my $p3m = "A..A"; my $p4m = "T..T"; my $p1mc = "G..G..G"; my $p2mc = "C..C..C"; my $p3mc = "A..A..A"; my $p4mc = "T..T..T"; open(IN,"human\\fa\\chr01.fa"); # from http://genome.ucsc.edu undef $/; my $d = ; close IN; my $c1 = my $c2 = my $c3 = my $c4 = my $c1m = my $c2m = my $c3m = my $c4m = 0; my $c1mc = my $c2mc = my $c3mc = my $c4mc = 0; while ($d =~ /$p1/g) {$c1++;} while ($d =~ /$p2/g) {$c2++;} while ($d =~ /$p3/g) {$c3++;} while ($d =~ /$p4/g) {$c4++;} while ($d =~ /$p1m/g) {$c1m++;} while ($d =~ /$p2m/g) {$c2m++;} while ($d =~ /$p3m/g) {$c3m++;} while ($d =~ /$p4m/g) {$c4m++;} while ($d =~ /$p1mc/g) {$c1mc++;} while ($d =~ /$p2mc/g) {$c2mc++;} while ($d =~ /$p3mc/g) {$c3mc++;} while ($d =~ /$p4mc/g) {$c4mc++;} $c1m+=$c1mc; $c2m+=$c2mc; $c3m+=$c3mc; $c4m+=$c4mc; my $r = ($c1m + $c2m) / ($c3m + $c4m); open(OUT,">scordova.txt"); print OUT "guanine => $c1\n"; print OUT "cytosine => $c2\n"; print OUT "adenine => $c3\n"; print OUT "thymine => $c4\n"; print OUT "guanine 3mer => $c1m\n"; print OUT "cytosine 3mer => $c2m\n"; print OUT "adenine 3mer => $c3m\n"; print OUT "thymine 3mer => $c4m\n"; print OUT "cg_at_3mer_ratio = $r\n"; close OUT; 1; niwrad
niwrad, I'm trying to learn PERL partly because Jeff Tomkins uses it a lot, so does WD400. Can you post you're code for the alternate way of counting to help me understand PERL a little better. :-) Thank you again for doing this. This is a conversation that needs to happen, and I will have to discuss this with various parties in person. These topics are a little beyond my present understanding. Thank you again for your time and assistance. scordova
scordova Yes. If they must count 2 then my report becomes: guanine => 46956489 cytosine => 46964756 adenine => 65491918 thymine => 65586556 guanine 3mer => 9933673 cytosine 3mer => 9940925 adenine 3mer => 17470366 thymine 3mer => 17515230 cg_at_3mer_ratio = 0.568079446181223 niwrad
Pattern AxxAxxA counts 2 for you?
Yes. Does your PERL script count differently? I don't know how to read PERL. I also don't know how other authors do their counting. Thank you again for doing this. I have to figure out how authors claiming 3mers do their counting. scordova
scordova #33 Pattern AxxAxxA counts 2 for you? To be precise it should count 1 (they aren't two independent AxxA patterns, because they share the central A). niwrad
for your information, I demonstrate that ALL DNA code is located in "G" bases from the double strand then "CG" bases from the single strand. T and A bases plays role of spacial shifts like silent in music... please see details in my BEIJING conference here: http://fr.scribd.com/doc/57828.....jing032011 jean-claude perez
Does your count exclude long repeating sequences, or does it include them?
It includes them. Of concern is niwrad and I are getting different numbers. I will try to see why, I think it could be the different Fasta files. His code is above, my code is here (it's only 1 page): http://creationevolutionuniversity.com/forum/viewtopic.php?f=3&t=91 scordova
Eric, I made a somewhat humorous take on the situation here: Too much of something can be a good thing for ID It is based on the more mundane issue here: No Universal Transition-Transversion Mutation Bias Whether one accepts mutation bias or not, it leads to huge contradictions at every level! What do I mean. Say you saw a pattern of numbered coins: HHHHTTTTHHHHTTTT..... it can't be explained by unfairness or bias in the coins. The same problem arises in biology because of creatures with different CG ratios and even differing CG ratios in certain DNA regions. I discuss in detail: Too much of something can be a good thing for ID scordova
Sal: Following up on my comment #30. I could well be doing something wrong, but based on the genetic code, it seems to me that the ratio of nucleotides showing up (calculated on an assumed equal ratio of amino acids occurring), is approximately, listed by TCAG: 77:63:85:74 Thus, T and A do have a somewhat higher occurrence naturally (although T is only about 3 points higher than G), with C being the lowest. Note, I'm not talking about 3mers yet, just the ratio of nucleotides showing up in a stretch of DNA. Of course, if there is any chemical or mutational tendency toward any particular nucleotide, then the ratios could tend toward a different number, without impacting the amino acid production. Not to mention a possible greater tendency for particular amino acids to show up in proteins, which would certainly change the numbers. ----- Anyway, I don't know if this has anything to do with what you are exploring, but it seems it might impact the numbers. Eric Anderson
Sal, Interesting issue. Does your count exclude long repeating sequences, or does it include them?
I posted the computer code to first see if I’m miscalculating (which I did at first, and hopefully the bugs are gone), but it seems there is too much adenine and thymine.
Yes, it looks like too much adenine and thymine. At least for a random distribution. But there are a couple of possible things that could account for higher A and T. Things that don't necessarily indicate a meaningful/intentional reason. Incidentally, rather than a 25%-25%-25%-25% ratio, it seems to me that even if we assume an equal 1/20 likelihood of each particular amino acid, the nucleotide ratio will be slightly off. This is because some amino acids require a specific sequence, while others have several ways to make the same amino acid.* ----- * Incidentally, this flexibility in the code, among other things, allows for greater possibility of creating overlapping strings, backward strings and so on. Lots of interesting stuff there to explore. Eric Anderson
jerry here there is no consideration of parents because we analyse the REFERENCE HUMAN GENOME which is a mix of individual ones http://en.wikipedia.org/wiki/Reference_genome jean-claude perez
I just want to clarify the bases being used in these analyses. In any chromosome there are two sets of bases. If one considers both chromosomes there are four sets of bases, two from the mother and two from the father. So is a single strand whole genome just one side of the DNA for just one one of the parents? That would be one of the four strands of DNA. Actually it would be 23 of the 92 strands of DNA since there are four strands for each chromosome. jerry
and we considere data from HUMAN GENOME REFERENCE http://genome.ucsc.edu/ we think that analysing PERSONAL GENOMES increase results jean-claude perez
JERRY, -for codon population analyse papers the single strand whole genome is considered -for others (equation of life and master code) we consider double stranded DNA whole genome jean-claude perez
http://fr.scribd.com/doc/57828784/jcperezBeijing032011 jean-claude perez
Are the numbers in the OP for all bases in the double helix of both the mother and father's chromosomes? If they are for both strands of the DNA then would expect the the G and C to be roughly the same and so would the T and the A bases. If all the bases are being considered then one would expect a skew. The human genome is about 98% non protein coding. Within this 98% the majority of the elements are repetitive. So if one repetitive element is more represented than another, one would expect a skew of the bases. I am just trying to understand what the numbers are. jerry
SCORDOVA my MASTER CODE analysis of whole human genome match perfectly with chromosomes banding! see my BEIJING conference on the web: http://fr.scribd.com/doc/57828.....jing032011 jean-claude perez
the work reported in chapter3 of CODEX BIOGENESIS is for GENES or rich genes genomes like HIV. In whole chromosomes there are 2 remarks: 1/ my results published on codon populations (independantly NNN bases) shows meta structure at whole level of chromosomes and genomes 2/ at DNA level I show with EQUATION OF LIFE and MASTER show DNA not only as genetic code but more: as an INFORMATIONAL FIELD unifying and overlapping the 3 genetic languages DNA RNA and AMINO ACIDS languages... see: my BEIJONG conference on the web: http://fr.scribd.com/doc/57828784/jcperezBeijing032011 jean-claude perez
Before I forget to mention, Guanine rich regions are most certainly way above random in their distribution. It's so blatantly obvious we have what is known as Chromosome Banding. That's actually a very strong non-random pattern you can see with your own eyes. http://en.wikipedia.org/wiki/File:NHGRI_human_male_karyotype.png scordova
Meanwhile in whole chromosomes from human genome, periodicities are very different. I done a big work to be published on this topic... https://plus.google.com/103572438711329205534/posts/A6xo6vK1F8R jean-claude perez
I searched the literature more, and where the 3-periodicity is significantly above random is for patterns involving 10 bases or more. Let X be any base: G X X G X X G X X G is significantly above random. I ran it with Java and a pattern is there. The problem as Joe Felsenstein pointed out is we might be getting a "stuck keys" effect where lots of tandem repeats cluster. If 3 periodicity is more pronounced than say 4,5,6,7 periodicity, the statistics should bear this out. Fourier transform were used to detect the various periodicities since the 1980's. If the 3 periodicity is substantially stronger than the others, the question is why? My preliminary run with 10 bases guanine count 47,016,562 cytosine count 47,024,413 adenine count 65,570,891 thymine count 65,668,756 guanine 3mer count = 979,644 cytosine 3mer count = 984,582 adenine = 3mer count 2,643,554 thymine = 3mer count 2,657,609 cg_at_3mer_ratio = 0.37052737295570803 Expectation based on weighted A,T,C,G densities: guanine 3mer count = 427,379 cytosine 3mer count = 427,683 adenine = 3mer count 1,616,861 thymine = 3mer count 1,626,535 What will be difficult is filtering out the "stuck keys" pattern of tandem repeats. scordova
SCORDOVA welcome also... jean-claude perez
figures are available in my book jean-claude perez
details: Chapitre 3 – Une « onde fossile » de période « 3 » vibre dans l’ADN de tous les gènes. 3.1 – RESUME DU CHAPITRE : Je démontre ici ce qui pourrait expliquer la formation de l’ADN en triplets codons conduisant à l’émergence des gènes et structures codantes tels que l’ARN messager. Pour cela je démontre l’existence d’une sorte d’onde de période 3 qui émergerait d’un certain mode d’analyse de toute séquence d’ADN codant telle que ARN, « splicings », et gènes. La méthode d’analyse est simple et universelle : elle consiste à « restructurer » la séquence linéaire (mono-dimensionnelle) d’ADN en matrices à 2 dimensions de 2 3 4 5 6 7 8 9… colonnes. C’est donc une sorte d’analyse modulo (modulo 2 3 4 5 6 7 8 9…) de la séquence. Je mets alors en évidence des sortes de « pics d’organisation » autour des seules structures dont le modulo est un multiple de 3 : exemple 3 6 9 etc… Je démontre ainsi que l’ADN, bien avant que ne se forment les codons triplets de 3 bases lors de la transcription et de la traduction de l’ADN en gènes, préfigure déjà, de manière implicite, cette période 3. Outre l’universalité de cette propriété étendue ici à toute séquence codante à l’origine des gènes, je démontre au contraire la disparition de cette propriété dans l’ADN génomique (génome humain, génomes eucaryotes, chromosomes, « contigs » quelconques d’ADN tels que junk-DNA, etc…) et même dans les gènes formés d’introns et d’exons. 3.2 – INTRODUCTION : Soit ! Cher auteur… Tu viens de démontrer, par chance - ou par hasard - que le génome humain entier s’organiserait autour du nombre 13… O.K. ! Tu viens aussi de démontrer que tout ce qui est à l’origine de la matière non vivante, les éléments de la table périodique de Mendeleïev s’organiserait autour du nombre 2… Soit encore ! Mais seras-tu capable d’en dire autant à propos de ce qui fait de nous des êtres vivants constitués de chair, d’os et, quelquefois même , d’intelligence ? Je voulais juste parler des quelques pour cent du génome humain codant pour gènes et protéines : l’ADN des quelques dizaines de milliers de gènes… Contiennent-ils, eux aussi, la trace omniprésente de NOMBRES ? Bon ! L’auteur, plutôt que d’engager le fer de la polémique, s’effacera ici pour laisser au premier d’entre nous – d’autres auraient dit « le meilleur d’entre nous » - le soin de préciser la question, comme il le faisait déjà dès 1968 alors que votre cher auteur restait encore bercé par de naïves utopies Don Quichottesques et se soixanthuitardisait tout juste avant d’entrer dans une « normalisation » pas tout à fait définitive : Francis Crick écrivait en 1968 : « Why a Triplet? We have argued that the code must have been basically a triplet code from a very early stage, so that one is not entitled to use sophisticated arguments which would apply only to a later stage, although one could argue that early organisms with doublet or quadruplet codes actually existed but became extinct, only the triplet code surviving. …/... It must have, to some extent, a definite structure and this is likely to be based on stretches of double-helix. Thus ,the diameter of a double.helix (since two may have to lie side by side) may have dictated the size of the codon, in that a doublet-code (moving along two bases at a time) would present an impossible recognition problem.” F. H. C. CRICK in The Origin of the Genetic Code, published in 1968, J. Mol. Biol. (1968) 38, 367-379. En d’autres termes : l’un des plus grands problèmes ouverts de la science génétique, c’est celui de l’émergence des codons à partir de la séquence linéaire d’ADN lors des phases de transcription et de traduction des gènes. Pourquoi et comment les nucléotides TCAG s’auto-organisent en sous-structures de 3 bases – les codons – plutôt qu’en sous-structures de 2 4 5 6 ou 7 bases ? Cet « ordre » des codons est-il déjà crypté de manière latente et implicite, dans la séquence d’ADN qui formera le gène ? J’ai découvert, par chance ou par hasard, la réponse à cette question fondamentale de la Biologie ! J’ai en effet découvert une loi simple, que comprendrait tout enfant, démontrant que toute séquence d’ADN appelée à former un gène est structurée par une sorte d’onde fossile de période 3… Cette loi est universelle : on la vérifie dans toute séquence d’ADN, d’ARN, de « splicing » appelée à former gènes et protéines. Elle disparaît totalement dans l’ADN riche en introns ou en régions non codantes telles que gènes avec introns, junk-DNA (ADN poubelle) chromosomes ou génomes. Cependant, de petits génomes, très riches en gènes et pauvres en régions non codantes tels que les génomes des virus, du SIDA (voir figures 3.4 et 3.5) ou du SARS continuent de présenter cette propriété à l’échelle du génome entier. 3.3 – METHODOLOGIE : Considérons une séquence d’ADN codant pour un gène. Cette séquence TCAG est linéaire et mono-dimensionnelle de longueur « s ». Formons les n matrices bidimensionnelles de type : Modulo 2: tableau « s/2, 2 », nota : on ignorera le reste éventuel (résidus). Modulo 3: tableau « s/3, 3 » Modulo 4: tableau « s/4, 4 » Modulo 5: tableau « s/5, 5 » Modulo 6: tableau « s/6, 6 » Etc... Dans chacun de ces tableaux, cumulons les populations respectives T C A G de chacune des colonnes. Si « p » est l’indice du modulo, on obtient donc p listes de 4 nombres, populations respectives des bases T C A ou G dans chacune des p colonnes. Pour chacun de ces quadruplets calculons le rapport de la population la plus nombreuse (exemple : base C) à la moyenne des 3 bases restantes (exemple : bases T A G). Un tel calcul est effectué pour chacun des p quadruplets. Appliquons la « « norme du max » en ne retenant pour chaque groupe modulo p que le rapport le plus important. On obtient en synthèse un vecteur de n valeurs correspondant aux meilleurs rapports relatifs aux modulos respectifs : 2 3 4 5 6 7 8 9 …n. On observe alors que les rapports relatifs aux modulos 3 6 9… sont très supérieurs aux autres rapports 2 4 5 7 8… Nota : tel que le montrent les figures ci-dessous la même méthode peut aussi être appliquée en maintenant la distinction entre les 4 bases T C A et G. On obtient alors une forme plus détaillée et analytique conduisant cependant aux mêmes conclusions : une période 3 structure et sous-tend l’ADN de tout gène. 3.4 – EXEMPLE DE 2 « GENES CELEBRES », BRCA1 et DMD : Voici ci-dessous une illustration de la méthode décrite ci-dessus appliquée à deux célèbres gènes : BRCA1, impliqué dans les cancers du sein et de l’ovaire, d’une part, et le très grand gène DMD (10689 bases sous sa forme ARN constituée des exons seuls, contre 1.7 millions de bases sous sa forme génomique incluant les exons et les introns), il s’agit de l’un des plus grands gènes du génome humain, dont des mutations ou malformations sont à l’origine de la terrible maladie dégénérative : la Dystrophie Musculaire de Duchenne (DMD). Les figures 3.1 (gène BRCA1), puis 3.2 et 3.3 (gène DMD) démontrent unanimement l’évidence d’une prédominance de toutes les périodes multiples de 3. Figure 3.1. Evidence d’une période = 3 structurant l’ARN splicing du gène BRCA1 impliqué dans les cancers du sein ou des ovaires (gène ARN référencé : Alternate splicing BRCA1.a). Figure 3.2. Evidence d’une période = 3 structurant l’ADN codant pour le grand DMD (Dystrophie Musculaire de Duchenne). Figure 3.3. Une autre représentation de l’évidence d’une période = 3 structurant l’ADN codant pour le grand DMD (Dystrophie Musculaire de Duchenne). 3.5 – UNIVERSALITE DE CETTE DECOUVERTE, L’EXTENSION A TOUS LES GENOMES DU SIDA : Dans les figures 3.4 et 3.5 ci-dessous nous démontrons l’universalité de cette découverte appliquée ici à l’intégralité de tous les génomes du SIDA connus : 169 génomes HIV1 et HIV2 de l’homme et SIV des singes. Figure 3.4. Evidence d’une période = 3 structurant les génomes entiers de toutes les souches des 169 virus du SIDA HIV1-HIV2et SIV (ceci est dû au fait le la quasi-totalité de cet ADN code pour des gènes). Figure 3.5. Une autre représentation mettant en évidence une période = 3 structurant les génomes entiers de toutes les souches des 169 virus du SIDA HIV1-HIV2et SIV (ceci est dû au fait le la quasi-totalité de cet ADN code pour des gènes). 3.6 – CONCLUSIONS : C’est un fait désormais établi : une sorte d’ONDE de période 3 structure l’ADN formant chaque gène. En d’autres termes cela signifie que parmi l’ensemble des restructurations en matrices de ces séquences, seules les restructurations en matrices comportant un nombre de colonnes multiple de 3 vont mettre en avant une position dans la séquence ainsi que toutes les positions relatives suivantes situées à 3 6 9… bases. Ainsi, ce seront, ici, les positions 1 4 7 10… Ou bien là les positions 2 5 8 11… Ou encore les positions 3 6 9 12… Non, la notion de CODON n’est pas un concept « sorti du panier » et découvert de manière empirique par les biologistes : Le codon existe, pré-existe déjà à l’état de « trace » dans toute séquence d’ADN d’un gène… Quel est le support biophysique de cette découverte ? Cela reste à découvrir. Mais je suggère que ces sortes d’ondes se traduisent par de véritables résonances, une sorte de « souffle » ou plutôt de « rythme »… jean-claude perez
Welcome to Uncommon Descent Dr. Perez. scordova
on the specific period3 analyse you can read the chapter3 of my bpook CODEX BIOGENESIS Chapitre 3 – Une « onde fossile » de période « 3 » vibre dans l’ADN de tous les gènes. 3.1 – RESUME DU CHAPITRE : Je démontre ici ce qui pourrait expliquer la formation de l’ADN en triplets codons conduisant à l’émergence des gènes et structures codantes comme l’ARN messager. Pour cela je démontre l’existence d’une sorte d’onde de période 3 qui émergerait d’un certain mode d’analyse de toute séquence d’ADN codant tel que ARN, « splicings », et gènes. La méthode d’analyse est simple et universelle : elle consiste à « restructurer » la séquence linéaire (mono-dimensionnelle) d’ADN en matrices à 2 dimensions de 2 3 4 5 6 7 8 9… colonnes. C’est donc une sorte d’analyse modulo (modulo 2 3 4 5 6 7 8 9…) de la séquence. Je mets alors en évidence des sortes de « pics d’organisation » autour des seules structures dont le modulo est un multiple de 3 : exemple 3 6 9 etc… Je démontre ainsi que l’ADN, bien avant que ne se forment les codons triplets de 3 bases lors de la transcription et de la traduction de l’ADN en gènes, préfigure déjà, de manière implicite, cette période 3. Outre l’universalité de cette propriété étendue ici à toute séquence codante à l’origine des gènes, je démontre au contraire la disparition de cette propriété dans l’ADN génomique (génome humain, génomes eucaryotes, chromosomes, « contigs » quelconques d’ADN tels que junk-DNA, etc…) et même dans les gènes formés d’introns et d’exons. jean-claude perez
more on http://fr.scribd.com/doc/200166060/Why-Human-and-Chimp-whole-genomes-are-99-99-close-pdf and http://fr.scribd.com/doc/186894835/jcperezEvolutionFibonacciPrimatesChromosomes4UK jean-claude perez
Dear sir If you plan demonstrating possible GOD evidence in human genome, please search on HUMAN CHROMOSOME 4 ! https://plus.google.com/103572438711329205534/posts/26WczM9aQbS jean-claude perez
scordova #8 38 Perl lines: # scordova.pl # search for 3mer use strict; use warnings; my $p1 = "G"; my $p2 = "C"; my $p3 = "A"; my $p4 = "T"; my $p1m = "G..G"; my $p2m = "C..C"; my $p3m = "A..A"; my $p4m = "T..T"; open(IN,"human\\fa\\chr01.fa"); # from http://genome.ucsc.edu undef $/; my $d = ; close IN; my $c1 = my $c2 = my $c3 = my $c4 = my $c1m = my $c2m = my $c3m = my $c4m = 0; while ($d =~ /$p1/g) {$c1++;} while ($d =~ /$p2/g) {$c2++;} while ($d =~ /$p3/g) {$c3++;} while ($d =~ /$p4/g) {$c4++;} while ($d =~ /$p1m/g) {$c1m++;} while ($d =~ /$p2m/g) {$c2m++;} while ($d =~ /$p3m/g) {$c3m++;} while ($d =~ /$p4m/g) {$c4m++;} my $r = ($c1m + $c2m) / ($c3m + $c4m); open(OUT,">scordova.txt"); print OUT "guanine => $c1\n"; print OUT "cytosine => $c2\n"; print OUT "adenine => $c3\n"; print OUT "thymine => $c4\n"; print OUT "guanine 3mer => $c1m\n"; print OUT "cytosine 3mer => $c2m\n"; print OUT "adenine 3mer => $c3m\n"; print OUT "thymine 3mer => $c4m\n"; print OUT "cg_at_3mer_ratio = $r\n"; close OUT; 1; execution time: ~1 minute niwrad
Using the numbers above for the given densities of the bases: p(G) = 20.870% p( C ) = 20.874% p(A) = 29.106% p(T) = 29.150% p(3merG) = p(G) p(G) = 4.36% p(3merC) = 4.36% p(3merA) = 8.47% p(3merT) = 8.50% Given G is encountered at beginning of string E ( 3merG) = Number G nucleotides * p(G) = 47,016,562 * 20.87% = 9,812,460 G3mers Actual number 3merG = 10,798,024 deviation from expectation = 985,563 Using normal approximation for binomial distribution: 1 std deviation = sqrt ( 47,016,562 * .2087 (1-.2087) ) = 2786 std deviation from expectation = 985,563 / 2,786 = 353 sigma The pattern seems to be confirmed as non-random for guanine Similarly for Adenine, I got 329 standard deviations from expectation for the number of Adenine 3mers given the 1st base in the reading frame is Adenine. scordova
Geneticist Joe Felsenstein weighed in: Sandwalk Comment
There is reason to expect at least a weak signal of 3-base autocorrelation. Genomes have large numbers of tandem repeat families, and many of those are 3-base repeats such as CGACGACGA ,,, CGA. Also, in exons in coding sequences, differences of base composition among the three codon positions would be expected to create a weak signal of correlation of bases 3 apart.
scordova
niwrad, Thank you for your input. What language did you use to calculate the numbers and what data file? I posted my computer code here. It is hardly a page long. http://creationevolutionuniversity.com/forum/viewtopic.php?f=3&t=91 Sal scordova
Law-like patterns are very rare to find (http://www.sciencedaily.com/releases/2010/05/100527013329.htm, http://dx.doi.org/10.1371/journal.pone.0010613). They normally give further insight in the human psychology. WGalbraith
Sal, this is a little off-topic, but here's something you might add to your list to look into. A linguistic model for the rational design of antimicrobial peptide (Nature, 2006):
Our preliminary studies of natural AmPs [anti-microbial peptides] indicated that their amphipathic structure gives rise to a modularity among the different AmP amino-acid sequences. The repeated usage of sequence modules—which might be a relic of evolutionary divergence and radiation—is reminiscent of phrases in a natural language, such as English. For example, the pattern QxEAGxLxKxxK (where ‘x’ is any amino acid) is found in more than 90% of the insect AmPs known as cecropins. On the basis of this observation, we modelled the AmP sequences as a formal language—a set of sentences using words from a fixed vocabulary. In this case, the vocabulary is the set of naturally occurring amino acids, represented by their one-letter symbols. We conjectured that the ‘language of AmPs’ could be described by a set of regular grammars. Regular grammars are, in essence, simple rules that describe the allowed arrangements of words. These grammars, such as the cecropin pattern mentioned previously, are commonly written as regular expressions and are widely used to describe patterns in nucleotide and amino-acid sequences. To find a set of regular grammars to describe AmPs we used the Teiresias pattern discovery tool11. With Teiresias, we derived a set of 684 regular grammars that occur commonly in 526 well-characterized eukaryotic AmP sequences from the Antimicrobial Peptide Database (APD)(see Methods). Together, these ~700 grammars describe the ‘language’ of the AmP sequences. In this linguistic metaphor, the peptide sequences are analogous to sentences and the individual amino acids are analogous to the words in a sentence. Each grammar describes a common arrangement of amino acids, similar to popular phrases in English. For example, the frog AmP brevinin-1E contains the amino-acid sequence fragment PKIFCKITRK, which matches the grammar P[KAYS][ILN][FGI]C[KPSA][IV][TS][RKC] [KR] from our database (the bracketed expression [KAYS] indicates that, at the second position in the grammar, lysine, alanine, tyrosine or serine is equally acceptable). On the basis of this match, we would say that the brevinin-1E fragment is ‘grammatical’. By design, each grammar in this set of ~700 grammars is ten amino-acids long and is specific to AmPs—at least 80% of the matches for each grammar in Swiss-Prot/TrEMBL13 (the APD is a subset of Swiss-Prot/TrEMBL) are found in peptides annotated as AmPs.
I haven't read it past the first page. I wonder if Teiresias be useful to you in finding patterns? I've never used it myself. JoeCoder
no one wanted to publish it.
No kidding, if this is real. Who would want to publish it unless it had some medical or biotech significance, and even than. Thanks for the info! I really appreciate the help. scordova
I read that for 3-base periodicity all 3n-distances does count, not only the first one (https://www.sciencedirect.com/science/article/pii/S0014579306012853?np=y, Abstract). For the correct derivation take a look at http://arxiv.org/pdf/1305.5524 under "Computing the 3-base periodicity of a DNA sequence". I was involved myself in a research project about genetically created patterns. At the end the pattern supported ID and no one wanted to publish it. Luckily journals like BIO-Complexity exists that can accept such results. Until publication it may still take some months but a first print is already available at http://vixra.org/abs/1404.0436, like Perez work the patterns were also non-random with patterns within the pattern. WGalbraith
Dr. JDD, Thank you so much for weighing in. It is possible I'm totally misunderstanding things. There was a paper which mentioned 3-base periodicity. http://www.ncbi.nlm.nih.gov/pubmed/22100873
Abstract Genomes of almost all organisms have been found to exhibit several periodicities, the most prominent one is the three base periodicity. It is more pronounced in the gene coding regions and has been exploited to identify the segments of a genome that code for a protein. The reason for this three base periodicity in the gene-coding region has been attributed to inhomogeneous nucleotide compositions in the three codon positions. However, this reason cannot explain the three base periodicity present at the level of the whole genome where the codon concept is not applicable. Even though the distribution of each nucleotide is uniform at the positions 0(mod 3), 1(mod 3) and 2(mod 3) when the whole genome data is considered, our analysis reveals that the three base periodicity is arising because of higher correlations among the nucleotides separated by three bases.
I'm not even sure I'm reading the paper correctly. 3mers were mentioned in DNA Skittle. I will talk to Dr. Sanford in couple days, but I wanted to get spooled up on some of the concepts. I posted the computer code to first see if I'm miscalculating (which I did at first, and hopefully the bugs are gone), but it seems there is too much adenine and thymine. For chromosome 2: guanine count 47,947,042 cytosine count 47,915,466 adenine count 71,102,632 thymine count 71,239,379 guanine 3mer count = 10,617,392 cytosine 3mer count = 10,595,383 adenine = 3mer count 22,365,385 thymine = 3mer count 22,446,412 cg_at_3mer_ratio = 0.47337479012502 for chromosome 3 guanine count 38,670,110 cytosine count 38,653,198 adenine count 58,713,343 thymine count 58,760,485 guanine 3mer count = 8,406,956 cytosine 3mer count = 8,415,831 adenine = 3mer count 18,593,998 thymine = 3mer count 18,636,472 cg_at_3mer_ratio = 0.45185534858947524 I agree the excess largely drives the differences in the 3mer pattern. Or I'm thinking I'm doing something wrong. Do I have a bad data file? scordova
scordova My counts: guanine => 46956489 cytosine => 46964756 adenine => 65491918 thymine => 65586556 guanine 3mer => 7883937 cytosine 3mer => 7887020 adenine 3mer => 13296189 thymine 3mer => 13325970 cg_at_3mer_ratio = 0.592399624688591 niwrad
Hi, I may be well off here, but the numbers do not surprise me for randomness. But as said, I may be over-simplifying this. If we say the true ratio is not 1:1:1:1 but is ~ 47:47:65.5:65.5 then you get a frequency for G or C of 47/[47+47+65.5+65.5] = ~21%. If you do the same for A or T you get ~29%. Now go back to the original numbers and calculate the likelihood of a 3mer occurring. For G or C let’s say 47 x 0.21 = 9.87 – not far off the 10.8 million you get perhaps? 9% out. For A or T you get 65.5 x 0.29 = 19 – not far off the 20.3 (within 6-7%). So there is slight bias perhaps, i.e. less than completely random but that seems to be largely driven by the differences in starting points of each base. Or am I oversimplifying it? JD Dr JDD

Leave a Reply