Request for help verifying non-random 3mer pattern in Human Chromosome 1

_{scordova
April 28, 2014

'Junk DNA'

2}_{Categories
'Junk DNA'}

Share: Facebook; Twitter/X; LinkedIn; Flipboard; Print; Email

3-base periodicity is a well-known non-random feature of the DNA. That is to say, a base will sometimes be repeated 3 nucleotides away. This should happen randomly at a frequency of about 25% if all the bases are equally represented, but I got something that was slightly away from random.

3-base periodicity is a well known pattern that seems to identify exonic regions. For lack of a better word, I use the word “3mer” whenever I encountered the same base 3 nucleotides away. 3mer is a term Dr. Sanford’s DNA Skittle uses, but I have to confer with him whether that is what he means.

I tried to see how frequently A,T,C,G repeated every 3 bases. It seems the Adenenine and Thymine 3mers appeared about twice as frequently as Cytosine or Guanine 3mers, and this seems partly due to increased A or T frequency. I think this is a legitimate non-random pattern. Here were my numbers for Human Chromosome 1:

guanine count 47,016,562
cytosine count 47,024,413
adenine count 65,570,891
thymine count 65,668,756
guanine 3mer count = 10,798,024
cytosine 3mer count = 10,805,795
adenine = 3mer count 20,297,310
thymine = 3mer count 20,355,586
cg_at_3mer_ratio = 0.5314214023030487

This was a follow on to Jean Claude Perez. I didn’t get the golden ratio, but instead I explored a well-acknowledge phenomenon, namely 3-base periodicity. I want to make sure my numbers are correct. Why should this non-random pattern emerge? Is it codon bias or something? Do I have a bug in my code?

I provided the Java code that I used here:

http://creationevolutionuniversity.com/forum/viewtopic.php?f=3&t=91.

I got the Chromosome 1 fasta file from:
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/

Any insights and corrections are especially welcome. Thanks in advance.

Comments

Dears Pietr, Joe or Gordon, I'm sorry for your unapropried comments on SANDWALK of the Sal Cordova entry entitled Vodka! Jean Claude Perez, the golden ratio, dragon curve fractals and musical design in “junk DNA”... The reason is that all (ALL) their comments were done without reading the basic original article: I suggest you reading the original basic peer review article of 2010 published in Interdisciplinary Science: http://fr.scribd.com/doc/95641538/Codon-Populations-in-Single-stranded-Whole-Human-Genome-DNA-Are-Fractal-and-Fine-tuned-by-the-Golden-Ratio-1-618 and my 2013 peer review article: http://www.scirp.org/journal/PaperInformation.aspx?paperID=37457#.U2Mwlfl_trAjean-claude perez_{May 1, 2014
May
05
May
1
01
2014
10:54 PM
10
10
54
PM
PDT}

Gordon Davisson I modified the script according to your specs: guanine => 46956489 cytosine => 46964756 adenine => 65491918 thymine => 65586556 guanine 3mer => 11002184 cytosine 3mer => 11010648 adenine 3mer => 20626672 thymine 3mer => 20683917 cg_at_3mer_ratio = 0.53286173189155 The remaining differences between our counts could depend on the different data file. Now I haven't time to download yours.niwrad_{April 30, 2014
April
04
Apr
30
30
2014
09:01 AM
9
09
01
AM
PDT}

niwrad, I tried running your perl script on the chromosome 1 data file I've been using, and (after some data tweaking) get base counts consistent with mine and Sal's; based on that, I'm pretty sure you're using a different data file. You can get the one we used from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/, or from http://genome.ucsc.edu follow the Downloads link in the sidebar, then under "VERTEBRATES - Complete annotation sets" click Human, then under "Feb. 2009 (hg19, GRCh37)" click "Data set by chromosome", then download "chr1.fa.gz". You'll need to either remove the first line (the description) of the data file and uppercase the rest, or modify the program to skip descriptions and use case-insensitive matching. Once that's done, the individual base counts match mine. But the repeat counts are still off, and there are several problems causing this: First, this program has the same line-ending bug Sal's original had -- it treats the line end characters as part of the data stream, and therefore misses repeats across lines (but counts some period-2 repeats that cross lines instead). Second, it still doesn't count multiple repeats right. It does ok through 3 repeats, but it undercounts 4 (AxxAxxAxxAxxA should count as 4 repeats, but it only gets 3) as well as 6 and above. Third, it also fails to count interleaved repeats, for example AAxAA and AxAAxA should both count as 2 repeats. And interleaved repeats, and... Basically, regular expressions are the wrong tool for the sort of data analysis we're trying to do here.Gordon Davisson_{April 29, 2014
April
04
Apr
29
29
2014
11:05 PM
11
11
05
PM
PDT}

You’re going to want them to support ID and/or creation, right? That’s really your motivation for looking into this, isn’t it?
Actually not so much this time. Some in the ID community felt it would be a good project to explore just to get me practiced in doing basic research and learn the art of proposing a hypothesis etc. An idea popped up as to whether the 3mers had any functional significance. The 3mer signal now appears so weak to me personally that I wonder if is of any functional use to the cell. As of today, it's looking like a mostly null result, and if I were to attempt to publish anything, it would be actually rather critical of the prior literature that claimed they found something of genuine significance. When I saw the use of Fourier transforms, my eyes rolled. I thought, "Why?", that's senseless, straightforward code analysis is more accurate. I was also trained in electrical engineering and the Frequency/Fourier domain was the order of the day. But from my vantage point, using Fourier analysis is totally inappropriate to study 3-base periodicity. Fourier analysis is good when you have sinusoids of varying amplitudes superimposed on each other. It works well when you have amplitudes in a wide range over time. And then you can actually decompose the composite wave form into sinusoidal components with amplitudes and phases. But in the case of genomes, at best you have one possible amplitude ( either on or off for a given nucleotide as you read left to right in artificial "time"). So what was the point of using Fourier transforms? None that I see except maybe to make a not-so-interesting result (perhaps substance free) look a little more profound. Maybe the researchers of 3mers over the last 30 years were just needing to get something published to pad their CV. That sounds cynical, but I can't say the suspicion hasn't crossed my mind. Simple java code can detect periodic patterns like: A x x A x x A.... No need for a Fourier transform, that just muddles and confuses the analysis, imho. It's better to approach the problem like someone trying detect a grammar rather than some sort of "audio" signal. 3mers are a non-random pattern, but "stuck keys" on a computer create non-random patterns as well. What I was hoping to find were grammatical constructs that might help identify some of the structure of proteins rather than just their sequences. But in the process I realized I have way too little in basic knowledge of chemistry and molecular biology. WAAAAY too little to even venture an educated guess. I do know a protein engineer who was able to identify catalytic regions of the protein just by sequence comparisons between species. That helped her in her work tremendously. That is one case I know where study of the pure DNA sequences actually informed someone of protein structure. There was a chance 3mer study might have led to something in ID, but now it looks like not a very promising end. Maybe someone else can take off with 3mers, but as of today, I might only complete research on the topic just to publish a mostly negative null result. It's a non-random pattern, but not much of one, imho.scordova_{April 29, 2014
April
04
Apr
29
29
2014
11:02 PM
11
11
02
PM
PDT}

Gordon Davisson @41: Thank you for the well-thought out and detailed comments. While I obviously disagree with you about ID generally, I agree with many of your thoughts, and cautions. With you, I'm also rather skeptical that 3-peat patterns will have much substantive meaning, but I'm willing to go exploring with Sal. :) Thank you also for the updated code you posted on the other site. I've used your version this afternoon to run through a couple of chromosomes.Eric Anderson_{April 29, 2014
April
04
Apr
29
29
2014
10:33 PM
10
10
33
PM
PDT}

Sal @33: On second thought, you must be excluding the long repeating sequences, because (I believe) they are set to 'N' in the FASTA files?Eric Anderson_{April 29, 2014
April
04
Apr
29
29
2014
10:24 PM
10
10
24
PM
PDT}

I found a problem with the way Threemer1 reads the file: it treats everything in the file as DNA bases, including description line and more importantly line break characters. This means that it treats ">chr1" as containing a cytosine base, looks for matches between the newline character and the bases in columns 3 and 47 (3 from the end of line), and period-2 repeats broken between lines. I also added code to calculate the GC ratios (not just the repeat ratios), and handle being passed the filename as a command-line parameter. Updated code is at Creation Evolution University. Here are the results I get for the first two chromosomes. Note that the cytosine counts have decreased by one (because of the description line) and the 3mer counts have increased a little (because of better matching across line breaks): input file = chr1.fa guanine count 47,016,562 cytosine count 47,024,412 adenine count 65,570,891 thymine count 65,668,756 gc/at ratio = 0.716559 gc content = 41.743927% guanine 3mer count = 11,015,663 cytosine 3mer count = 11,024,227 adenine = 3mer count 20,650,395 thymine = 3mer count 20,709,117 cg_at_3mer_ratio = 0.5328856394630574 input file = chr2.fa guanine count 47,947,042 cytosine count 47,915,465 adenine count 71,102,632 thymine count 71,239,379 gc/at ratio = 0.673466 gc content = 40.243782% guanine 3mer count = 10,831,713 cytosine 3mer count = 10,808,637 adenine = 3mer count 22,751,514 thymine = 3mer count 22,830,306 cg_at_3mer_ratio = 0.47475835760836227 But I still have a few gripes with this. I'll start with a superficial one: I don't think 3mer is an appropriate name. In chemistry, the -mer suffix refers to chemicals made of repeating subunits: a monomer is one subunit, a dimer is two, a trimer is three, a polymer is many,... When talking about DNA, I'd expect trimer, threemer, or 3mer to refer to three bases (I think three base pairs would technically be considered a hexamer). The best alternative term I can think of is "period-3 repeat", which isn't nearly as catchy. Sorry... Now for a more important gripe, let me back off and look at what we're actually seeing in the statistics here. The repeat counts are significantly higher than would be expected based on the individual base content, as you (Sal) calculated; but that doesn't actually mean there's anything special happening with a period of 3 bases. Suppose we had a chromosome that started with a region consisting of just A and T (in random order), followed by a region of just C and G. Within these regions, roughly half of the bases will be followed immediately by another of the same base, and half will be followed by a match 2 away, and half will be followed by a match 3 away, and... In other words, we'd see a hugely elevated repeat count at any period, not just 3. Real chromosomes are not so neatly segregated, of course; but they do have regions with elevated AT content, and other regions with elevated GC content, which'll produce a milder form of the same effect. So the results we'll get will show a mix of repeat elevation due to AT and GC segregation, and due to period-3 repeats -- all mixed together, with no way to tell how much of which we're seeing. I think in order to fix this, you need to count period-1 (adjacent) repeats, period-2, period-3, etc and look for spikes at periods 3, 6, 9, etc. If you want to get really fancy, you could experiment with Fourier analysis, but that's probably overkill. But let me back off even further, and ask what you're really looking for. Patterns in the genome that support ID, right? But what kinds of pattern support ID? You shouldn't assume that all patterns support ID, because evolution is a complex process that involves both random and non-random processes, and complex processes like that are expected to produce patterns. Distinguishing patterns that support ID from those that support evolution is ... nontrivial. Let me give a couple of examples. Example pattern 1: codon usage bias You mentioned these biases in this earlier posting here. The short summary is that there are generally several codons that code for each amino acid (i.e. they're synonymous, like "gray" and "grey"). But some codons can be translated into amino acids more efficiently, and in some organisms the more efficient synomyns are more common. You quoted wikipedia on the subject:
It is generally acknowledged that codon preferences reflect a balance between mutational biases and natural selection for translational optimization. Optimal codons in fast-growing microorganisms, like Escherichia coli or Saccharomyces cerevisiae (baker’s yeast), reflect the composition of their respective genomic tRNA pool. It is thought that optimal codons help to achieve faster translation rates and high accuracy. As a result of these factors, translational selection is expected to be stronger in highly expressed genes, as is indeed the case for the above-mentioned organisms. In other organisms that do not show high growing rates or that present small genomes, codon usage optimization is normally absent, and codon preferences are determined by the characteristic mutational biases seen in that particular genome. Examples of this are Homo sapiens (human) and Helicobacter pylori. Organisms that show an intermediate level of codon usage optimization include Drosophila melanogaster (fruit fly), Caenorhabditis elegans (nematode worm), Strongylocentrotus purpuratus (sea urchin) or Arabidopsis thaliana (thale cress).
There are actually three layers of pattern here: some synonyms are used more than others, but only in some organisms, and more so in highly expressed genes. But these layers of pattern fit what we'd expect from evolution! The codon usage bias is strongest where selection would favor it the most, and weak or missing where selection would favor it less. The Humans, for example, does not favor the more efficient synonyms, which is exactly what you'd expect from a lineage that hasn't "paid the cost of maintenance" in quite a long time (on the order of 100 million years if I'm doing the math right). Isn't this exactly what you'd expect from "genetic entropy"? Now, I should clarify that this pattern isn't a problem for ID scenarios that also involve long periods of evolution (e.g. front-loading, guided evolution, etc), but as far as I can see it's a big problem for anything like a recent creation scenario. As you mentioned earlier, intelligent agents can be capricious, so it's certainly possible a creator could have created things in a way that happens to mimic the effects of "genetic entropy"... But in science, a theory that explains why things are the way they are beats a theory that explains why things could be the way they are. Unless you have a predictive theory of creation that explains the existance of this pattern, this pattern winds up supporting evolution. Example pattern 2: isochores (/compositional domains) These are the regions of different CG content I mentioned above (and you described in "Too much of something can be a good thing for ID"). I'm not familiar enough with the subject to give a detailed explanation, but as I understand it they're a somewhat messy subject (i.e. it's clear that variations exist, but not clear how to describe and catgorize them), and their evolution is not well understood (maybe due to biased gene conversion?). If you argue that "not well understood in evolutionary terms" means that this pattern supports ID, you're basically making an argument from ignorance (or ID-of-the-gaps). This is a really really weak form of argument. If a solid evolutionary explanation for GC content variations develops (and is tested and supported), then this pattern joins codon usage bias in supporting evolution. If anything, this then becomes evidence against ID (or at least creation). If you develop (and test and support) a ID- or creation-based explanation for these patterns (and no good evolutionary explanation develops), then this becomes clear evidence for ID and/or creationism. But as with codon usage bias, this is going to be very difficult -- you need something like a predictive model of the ID/creation process, and intelligent agents can be capricious. If you can make a solid case that evolution cannot explain this pattern, then... well, you'd have evidence that evolution is incorrect, or at least incomplete. But that doesn't necessarily equate to support for ID, just for something other than (or in addition to) evolution. Note that in the meantime, if/while we're in this we-don't-understand-this-well state, evolutionary theory provides a framework for developing and testing hypotheses about where this pattern came from. How can you do this in the ID/creation framework? Example pattern 3: a new pattern you just found... I'm not trying to dampen your curiosity here, but I'm worried that you're setting a trap for yourself. Suppose you indeed find patterns in genomes -- period-3 repeats, fractal somethingorothers, whatever. You're going to want them to support ID and/or creation, right? That's really your motivation for looking into this, isn't it? If so, that means that every time you spot a pattern (or possible pattern), you're going to be biased toward both thinking the pattern is real (not just an artifact of the analysis, statistical quirk, etc), and thinking it's evidence for ID. And I'm pretty sure you'll (at least mostly) be disappointed. The various genome data has been looked over in many ways by a great many people (many of whom know far far more about biochemistry etc than either of us do), which means all of the easy-to-spot patterns have been spotted. Your probability of spotting something new (and real) is pretty small. If you do find something new, you're going to try to interpret it as supporting ID and/or creation. But it may not (see codon usage bias), and even if it does showing that it does will be very hard (see isochores). You run the risk of spending a lot of time finding nothing, then finding something interesting (it has been said that the most exciting thing to hear in science is not "Eureka!", but "Huh, that's interesting."), investigating it, getting emotionally invested in it... and then having some spoilsport evolutionist come along and say "oh, yeah, that's a side effect of this other thing we've known about forever and understand completely." So what should you do? If you do continue looking for patterns in the genome, please do it without the expectation that you're going to find something important. But what I'd really love to see you do is to develop a predictive model of ID and/or creation, make predictions from that of what specific patterns you'd expect to see based on that, and then go looking for those specific patterns. This will, of course, be very hard (capriciousness, etc); but if you really want to support ID and/or creation, I think this has far more potential than casting around for patterns and hoping they can be shown to support ID. Mind you, I don't expect this approach to work, because I don't think ID is actually correct; but if I'm wrong, this approach is far more likely to produce something that'll convince me to change my mind.Gordon Davisson_{April 29, 2014
April
04
Apr
29
29
2014
08:32 PM
8
08
32
PM
PDT}

Grazie niwrad, mi amigo!scordova_{April 29, 2014
April
04
Apr
29
29
2014
06:05 PM
6
06
05
PM
PDT}

# scordova.pl v1.1 # search for 3mer use strict; use warnings; my $p1 = "G"; my $p2 = "C"; my $p3 = "A"; my $p4 = "T"; my $p1m = "G..G"; my $p2m = "C..C"; my $p3m = "A..A"; my $p4m = "T..T"; my $p1mc = "G..G..G"; my $p2mc = "C..C..C"; my $p3mc = "A..A..A"; my $p4mc = "T..T..T"; open(IN,"human\\fa\\chr01.fa"); # from http://genome.ucsc.edu undef $/; my $d = ; close IN; my $c1 = my $c2 = my $c3 = my $c4 = my $c1m = my $c2m = my $c3m = my $c4m = 0; my $c1mc = my $c2mc = my $c3mc = my $c4mc = 0; while ($d =~ /$p1/g) {$c1++;} while ($d =~ /$p2/g) {$c2++;} while ($d =~ /$p3/g) {$c3++;} while ($d =~ /$p4/g) {$c4++;} while ($d =~ /$p1m/g) {$c1m++;} while ($d =~ /$p2m/g) {$c2m++;} while ($d =~ /$p3m/g) {$c3m++;} while ($d =~ /$p4m/g) {$c4m++;} while ($d =~ /$p1mc/g) {$c1mc++;} while ($d =~ /$p2mc/g) {$c2mc++;} while ($d =~ /$p3mc/g) {$c3mc++;} while ($d =~ /$p4mc/g) {$c4mc++;} $c1m+=$c1mc; $c2m+=$c2mc; $c3m+=$c3mc; $c4m+=$c4mc; my $r = ($c1m + $c2m) / ($c3m + $c4m); open(OUT,">scordova.txt"); print OUT "guanine => $c1\n"; print OUT "cytosine => $c2\n"; print OUT "adenine => $c3\n"; print OUT "thymine => $c4\n"; print OUT "guanine 3mer => $c1m\n"; print OUT "cytosine 3mer => $c2m\n"; print OUT "adenine 3mer => $c3m\n"; print OUT "thymine 3mer => $c4m\n"; print OUT "cg_at_3mer_ratio = $r\n"; close OUT; 1;niwrad_{April 29, 2014
April
04
Apr
29
29
2014
02:20 PM
2
02
20
PM
PDT}

niwrad, I'm trying to learn PERL partly because Jeff Tomkins uses it a lot, so does WD400. Can you post you're code for the alternate way of counting to help me understand PERL a little better. :-) Thank you again for doing this. This is a conversation that needs to happen, and I will have to discuss this with various parties in person. These topics are a little beyond my present understanding. Thank you again for your time and assistance.scordova_{April 29, 2014
April
04
Apr
29
29
2014
02:13 PM
2
02
13
PM
PDT}

scordova Yes. If they must count 2 then my report becomes: guanine => 46956489 cytosine => 46964756 adenine => 65491918 thymine => 65586556 guanine 3mer => 9933673 cytosine 3mer => 9940925 adenine 3mer => 17470366 thymine 3mer => 17515230 cg_at_3mer_ratio = 0.568079446181223niwrad_{April 29, 2014
April
04
Apr
29
29
2014
01:35 PM
1
01
35
PM
PDT}

Pattern AxxAxxA counts 2 for you?
Yes. Does your PERL script count differently? I don't know how to read PERL. I also don't know how other authors do their counting. Thank you again for doing this. I have to figure out how authors claiming 3mers do their counting.scordova_{April 29, 2014
April
04
Apr
29
29
2014
01:17 PM
1
01
17
PM
PDT}

scordova #33 Pattern AxxAxxA counts 2 for you? To be precise it should count 1 (they aren't two independent AxxA patterns, because they share the central A).niwrad_{April 29, 2014
April
04
Apr
29
29
2014
01:01 PM
1
01
01
PM
PDT}

for your information, I demonstrate that ALL DNA code is located in "G" bases from the double strand then "CG" bases from the single strand. T and A bases plays role of spacial shifts like silent in music... please see details in my BEIJING conference here: http://fr.scribd.com/doc/57828.....jing032011jean-claude perez_{April 29, 2014
April
04
Apr
29
29
2014
01:00 PM
1
01
00
PM
PDT}

Does your count exclude long repeating sequences, or does it include them?
It includes them. Of concern is niwrad and I are getting different numbers. I will try to see why, I think it could be the different Fasta files. His code is above, my code is here (it's only 1 page): http://creationevolutionuniversity.com/forum/viewtopic.php?f=3&t=91scordova_{April 29, 2014
April
04
Apr
29
29
2014
12:20 PM
12
12
20
PM
PDT}

Eric, I made a somewhat humorous take on the situation here: Too much of something can be a good thing for ID It is based on the more mundane issue here: No Universal Transition-Transversion Mutation Bias Whether one accepts mutation bias or not, it leads to huge contradictions at every level! What do I mean. Say you saw a pattern of numbered coins: HHHHTTTTHHHHTTTT..... it can't be explained by unfairness or bias in the coins. The same problem arises in biology because of creatures with different CG ratios and even differing CG ratios in certain DNA regions. I discuss in detail: Too much of something can be a good thing for IDscordova_{April 29, 2014
April
04
Apr
29
29
2014
12:17 PM
12
12
17
PM
PDT}

Sal: Following up on my comment #30. I could well be doing something wrong, but based on the genetic code, it seems to me that the ratio of nucleotides showing up (calculated on an assumed equal ratio of amino acids occurring), is approximately, listed by TCAG: 77:63:85:74 Thus, T and A do have a somewhat higher occurrence naturally (although T is only about 3 points higher than G), with C being the lowest. Note, I'm not talking about 3mers yet, just the ratio of nucleotides showing up in a stretch of DNA. Of course, if there is any chemical or mutational tendency toward any particular nucleotide, then the ratios could tend toward a different number, without impacting the amino acid production. Not to mention a possible greater tendency for particular amino acids to show up in proteins, which would certainly change the numbers. ----- Anyway, I don't know if this has anything to do with what you are exploring, but it seems it might impact the numbers.Eric Anderson_{April 29, 2014
April
04
Apr
29
29
2014
10:43 AM
10
10
43
AM
PDT}

Sal, Interesting issue. Does your count exclude long repeating sequences, or does it include them?
I posted the computer code to first see if I’m miscalculating (which I did at first, and hopefully the bugs are gone), but it seems there is too much adenine and thymine.
Yes, it looks like too much adenine and thymine. At least for a random distribution. But there are a couple of possible things that could account for higher A and T. Things that don't necessarily indicate a meaningful/intentional reason. Incidentally, rather than a 25%-25%-25%-25% ratio, it seems to me that even if we assume an equal 1/20 likelihood of each particular amino acid, the nucleotide ratio will be slightly off. This is because some amino acids require a specific sequence, while others have several ways to make the same amino acid.* ----- * Incidentally, this flexibility in the code, among other things, allows for greater possibility of creating overlapping strings, backward strings and so on. Lots of interesting stuff there to explore.Eric Anderson_{April 29, 2014
April
04
Apr
29
29
2014
10:19 AM
10
10
19
AM
PDT}

jerry here there is no consideration of parents because we analyse the REFERENCE HUMAN GENOME which is a mix of individual ones http://en.wikipedia.org/wiki/Reference_genomejean-claude perez_{April 29, 2014
April
04
Apr
29
29
2014
10:03 AM
10
10
03
AM
PDT}

I just want to clarify the bases being used in these analyses. In any chromosome there are two sets of bases. If one considers both chromosomes there are four sets of bases, two from the mother and two from the father. So is a single strand whole genome just one side of the DNA for just one one of the parents? That would be one of the four strands of DNA. Actually it would be 23 of the 92 strands of DNA since there are four strands for each chromosome.jerry_{April 29, 2014
April
04
Apr
29
29
2014
08:44 AM
8
08
44
AM
PDT}

and we considere data from HUMAN GENOME REFERENCE http://genome.ucsc.edu/ we think that analysing PERSONAL GENOMES increase resultsjean-claude perez_{April 29, 2014
April
04
Apr
29
29
2014
08:07 AM
8
08
07
AM
PDT}

JERRY, -for codon population analyse papers the single strand whole genome is considered -for others (equation of life and master code) we consider double stranded DNA whole genomejean-claude perez_{April 29, 2014
April
04
Apr
29
29
2014
08:00 AM
8
08
00
AM
PDT}

http://fr.scribd.com/doc/57828784/jcperezBeijing032011jean-claude perez_{April 29, 2014
April
04
Apr
29
29
2014
07:52 AM
7
07
52
AM
PDT}

Are the numbers in the OP for all bases in the double helix of both the mother and father's chromosomes? If they are for both strands of the DNA then would expect the the G and C to be roughly the same and so would the T and the A bases. If all the bases are being considered then one would expect a skew. The human genome is about 98% non protein coding. Within this 98% the majority of the elements are repetitive. So if one repetitive element is more represented than another, one would expect a skew of the bases. I am just trying to understand what the numbers are.jerry_{April 29, 2014
April
04
Apr
29
29
2014
07:51 AM
7
07
51
AM
PDT}

SCORDOVA my MASTER CODE analysis of whole human genome match perfectly with chromosomes banding! see my BEIJING conference on the web: http://fr.scribd.com/doc/57828.....jing032011jean-claude perez_{April 29, 2014
April
04
Apr
29
29
2014
07:48 AM
7
07
48
AM
PDT}

the work reported in chapter3 of CODEX BIOGENESIS is for GENES or rich genes genomes like HIV. In whole chromosomes there are 2 remarks: 1/ my results published on codon populations (independantly NNN bases) shows meta structure at whole level of chromosomes and genomes 2/ at DNA level I show with EQUATION OF LIFE and MASTER show DNA not only as genetic code but more: as an INFORMATIONAL FIELD unifying and overlapping the 3 genetic languages DNA RNA and AMINO ACIDS languages... see: my BEIJONG conference on the web: http://fr.scribd.com/doc/57828784/jcperezBeijing032011jean-claude perez_{April 29, 2014
April
04
Apr
29
29
2014
07:25 AM
7
07
25
AM
PDT}

Before I forget to mention, Guanine rich regions are most certainly way above random in their distribution. It's so blatantly obvious we have what is known as Chromosome Banding. That's actually a very strong non-random pattern you can see with your own eyes. http://en.wikipedia.org/wiki/File:NHGRI_human_male_karyotype.pngscordova_{April 29, 2014
April
04
Apr
29
29
2014
07:18 AM
7
07
18
AM
PDT}

Meanwhile in whole chromosomes from human genome, periodicities are very different. I done a big work to be published on this topic... https://plus.google.com/103572438711329205534/posts/A6xo6vK1F8Rjean-claude perez_{April 29, 2014
April
04
Apr
29
29
2014
07:06 AM
7
07
06
AM
PDT}

I searched the literature more, and where the 3-periodicity is significantly above random is for patterns involving 10 bases or more. Let X be any base: G X X G X X G X X G is significantly above random. I ran it with Java and a pattern is there. The problem as Joe Felsenstein pointed out is we might be getting a "stuck keys" effect where lots of tandem repeats cluster. If 3 periodicity is more pronounced than say 4,5,6,7 periodicity, the statistics should bear this out. Fourier transform were used to detect the various periodicities since the 1980's. If the 3 periodicity is substantially stronger than the others, the question is why? My preliminary run with 10 bases guanine count 47,016,562 cytosine count 47,024,413 adenine count 65,570,891 thymine count 65,668,756 guanine 3mer count = 979,644 cytosine 3mer count = 984,582 adenine = 3mer count 2,643,554 thymine = 3mer count 2,657,609 cg_at_3mer_ratio = 0.37052737295570803 Expectation based on weighted A,T,C,G densities: guanine 3mer count = 427,379 cytosine 3mer count = 427,683 adenine = 3mer count 1,616,861 thymine = 3mer count 1,626,535 What will be difficult is filtering out the "stuck keys" pattern of tandem repeats.scordova_{April 29, 2014
April
04
Apr
29
29
2014
07:04 AM
7
07
04
AM
PDT}

SCORDOVA welcome also...jean-claude perez_{April 29, 2014
April
04
Apr
29
29
2014
07:02 AM
7
07
02
AM
PDT}

1 2 Next

You must be logged in to post a comment.

Leave a Reply