Uncommon Descent Serving The Intelligent Design Community

The alignment nightmare (part 1)


Sleep of Reason

Alignment is probably the most difficult and least understood component of a phylogenetic analysis from sequence data.

— David L. Swofford and Gary J. Olsen, chapter on Phylogeny Reconstruction, in Molecular Systematics (Sinauer, 1990, eds. D.M. Hillis and C. Moritz), p. 417.

Twenty years ago, as a 2nd-year graduate student, I attended the first Molecular Evolution Workshop at the Marine Biological Laboratory-Woods Hole. (There’s a comical Expelled-type story from my two weeks there, involving the workshop director Mitchell Sogin, which I might tell here some time. I did design the workshop t-shirt, however, which most of the participants bought.) The overwhelming lesson I brought home from the workshop, aside from the pricey beauty of that part of Cape Cod, was the utterly seat-of-the-pants, lost-in-space nature of molecular sequence alignment methods.

Thus, a couple of years later, when I heard Scott Lanyon (at the time, on the staff at the Field Museum in Chicago) state, “The first thing I ask myself, when I see a published molecular phylogeny is, ‘How much of this should I believe?'” I said to myself, man do I know what you’re talking about — and let’s start with the alignments. An alignment is the first analytical step, after raw molecular data (typically, DNA sequences converted to amino acid sequences) are obtained.

Those who live and work with molecular systematics don’t need any introduction to the alignment nightmare. They already know about it, and have known for a long time; see the cautionary note from Swofford and Olsen, above. But now, after the accumulation of literally tens of thousands of molecular phylogenies — a vast literature groaning on library shelves — a larger audience is waking up to the fact that something may be very amiss in the house of molecular evolution.

This is a complicated story, so I’m splitting it into two parts. To get a flavor for what’s at stake, consider the advice that Swofford and Olsen offered in the first edition of the widely-used textbook Molecular Systematics:

When regions of the sequences are so divergent that a reasonable alignment cannot be attained by manual methods using a sequence editor (“by eye”), those regions should probably be eliminated from the analysis. Of course, selectively eliminating regions opens the door to the criticism that the researcher is arbitrarily “throwing away data.” We would counter this argument with the rebuttal that the researcher has already discarded data, in a sense, by choosing to sequence one molecule rather than another because the latter did not evolve at an appropriate rate for the problem of interest. (p. 417)

So how does one know which portions of the sequence to omit? Well, start by aligning the homologous sites, and go from there. But which are the homologous sites? Weren’t the sequences themselves supposed to provide us with that signal?

Actually, the sequences don’t tell one that (for reasons I’ll explore in part 2): for most investigators, an alignment software package such as Clustal-W or Muscle does the job. But never turn your thinking over to software.

That should be a bumper sticker; and now, news of the nightmare is getting around. The latest issue of Science (25 January 2008) carries a report by Wong et al., and accompanying Perspectives article by Antonis Rokas, about the problem:

To assess whether the choice of alignment method affects evolutionary analyses, [Wong et al.] generated gene phylogenies and predicted the amino acid changes driven by selection for every possible gene by alignment combination. They report that a staggering 46.2% of the genes examined exhibit variation in the phylogeny produced dependent on the choice of alignment method…

Wong et al. don’t agree with Swofford and Olsen’s advice, BTW. Jettisoning data is not a good idea:

We also do not believe that the alignment uncertainty problem is one that can be resolved by simply throwing away genes, or portions of genes, for which alignment differs. (p. 474)

But this ain’t the half of it. Neither Wong et al. nor Rokas cites the most disturbing recent analysis of the alignment problem, nor do they probe the deeper reasons the problem will not be solved any time soon. For that, see part 2 of this blog series.

Might have to go to three parts…

bFast: "I personally find this to be an ID platitude every bit as much as “overwhelming evidence” is an evolutionist’s platitude." It has nothing to do with ID. Just a strongly growing skepticism on my part regarding the explanatory power of traditional theory. bFast, I'm trying to make sure I understand how the neo-Darwinian model would approach the alignment question. Specifically, how does the neo-Darwinian model determine the amount of time between current organisms and their common ancestor? Is that done independently of molecular phylogeny? Eric Anderson
A little off-topic to the OP. Dealing with networks, I am interested in looking at fault tolerance and redundancy when it comes to data transmission. I saw a video of the bacterial flagellum and how the flagellar filament is built up starting from the base (the outer membrame) to the hook which creates rotation at a angular velocity to propel the flagellum, thus creating the momentum to move it. A couple of things I noticed is that it looks as though the flagellar filament (the propeller) needs to be (these are all very obvious of course): -a specified length in order to move the flagellum efficiently, perhaps even balance out the load -multiple flagella-r filaments are used to propel -too short and the ratio in weight distribution between the "body" and the "propellers" thrust, would be insufficient -the specified angle on the hook must be there to provide the momentum I know there must be incredible amounts of information and specs on the flagellum. But, I was wondering if it is possible to do test to somehow check for fault tolerance on the flagellum: If a sensory mechanism that would detect if the flagella-r filament when "cut-off" its initial length, would notice this and perhaps some kind of "stability" check for the weight distribution or other feedback mechanism that would then signal to build the flagella-r filament back to its initial size? The bacterial flagellum obviously is a irreducibly complex system, where all the parts have to be there for it to function. I think that is definitely a characteristic of intelligent design, but what if something went wrong during its operation? and what steps does it take to repair or prevent any sort of fall backs? If the bacterial flagellum is reducible on a certain level, then that should tell us more design then previously thought, no? since it would signal some sort of fault tolerant and redundant back-up mechanism. (BTW, I have read about half of "The design of life" and its incredibly well written and clear-cut) godslanguage
Eric Anderson, "Setting aside for a moment the plasticity of the “neo-Darwinian model” and the general lack of predictive capability..." I personally find this to be an ID platitude every bit as much as "overwhelming evidence” is an evolutionist's platitude. Darwinism is plenty predictive, and if it is false, it will fall as genetics continues. That said, darwinian evolution has a very clear and obvious prediction wrt this issue. If the neo-Darwinian model is correct, then the amount of difficulty alligning dna sequences will generally correlate to the amount of time between the current organisms and their common ancestor. So, for any given gene (different genes are likely to decorrelate at different rates) one should be able to gestimate the amount of time that separates two organisms from their common ancestor by observing the amount of challenge there is in correlating the DNA. While this phenomenon may also be consistent with an ID scenerio, especially with one that is based on common descent; if this phenomenon is not valid, it would be a significant challenge to the neo-Darwinian model. bFast
There is not just this frame shift. See: Decoding DNA Note particularly Segal et al. discovery of the Nucleosome code superimposed on DNA associated with nucleosomes. Segal E, Fondufe-Mittendorf Y, Chen L, Thastrom A, Field Y, Moore IK, Wang JP, Widom J. A genomic code for nucleosome positioning. Nature. 2006 Aug 17;442(7104):772-8. Epub 2006 Jul 19. Such multiple superimposed codes are a natural fit for ID, but a major challenge for Darwinism. This combination of nucleosome coding coupled with the physical nucleosomes suggest an irreducibly complex system. Does anyone know of proteins in nucleosomes that have been shown to be essential? DLH
bFast wrote: "As the neo-Darwinian model is counting on DNA being, well, stirred up by random mutation, I don’t find this report to be particularly unexpected by the theory." Setting aside for a moment the plasticity of the "neo-Darwinian model" and the general lack of predictive capability, I think a related question is whether molecular systematics is the panacea it is often made out to be. For example, just a couple of weeks ago I had a geneticist friend over for dinner who, as conversation turned, fell back on the oft-repeated refrain of "overwhelming evidence" confirming the so-called tree of life -- based, so he told me on the molecular phylogenies. Personally, I need to spend a bit more time on this specific issue, but based on a relatively detailed review of other aspects of evolutionary theory, as well as comments from individuals, such as those cited by Paul above, I am skeptical that my friend's molecular phylogenies actually provide the "overwhelming evidence" for traditional evolutionary theory he thinks they do. I applaud efforts to rationally understand and systematize molecular sequences, and while I agree with Paul that this is not an ID vs. Darwinian evolution question, I am confident that individuals who approach the issue without an a priori Darwinian mindset can contribute significantly to the field. Eric Anderson
This isn't really an ID vs. Darwinian evolution question, at least not prima facie. IDers would (and do) face many if not all of the same questions about how best to align molecular sequences, when doing systematics or comparative biology generally. Some (most?) hard problems in biology are challenges for everyone. Paul Nelson
A compounding factor may be that an intelligent designer might use multiple codings by selecting different alignment frames, for more compact efficient storage. DLH
As the neo-Darwinian model is counting on DNA being, well, stirred up by random mutation, I don't find this report to be particularly unexpected by the theory. A more meaningful question would be, is the stirring effect more pronounced when looking at disparent species, or is it just as pronounced no matter what species you compare? The neo-Darwinian theory would suggest that comparing, say a mouse and a trout should reveal much more stirring than comparing a mouse and a rat. If you are implying that the lack of DNA consistency is evidence for the "common design" model, then you would expect no correlation between the amount of stirring and the amount of taxonomical separation between two species. bFast
I always felt there was something not so clear in that alignment stuff... Thank you, I'm looking forward to part two! gpuccio

Leave a Reply