Alignment is probably the most difficult and least understood component of a phylogenetic analysis from sequence data.
— David L. Swofford and Gary J. Olsen, chapter on Phylogeny Reconstruction, in Molecular Systematics (Sinauer, 1990, eds. D.M. Hillis and C. Moritz), p. 417.
Twenty years ago, as a 2nd-year graduate student, I attended the first Molecular Evolution Workshop at the Marine Biological Laboratory-Woods Hole. (There’s a comical Expelled-type story from my two weeks there, involving the workshop director Mitchell Sogin, which I might tell here some time. I did design the workshop t-shirt, however, which most of the participants bought.) The overwhelming lesson I brought home from the workshop, aside from the pricey beauty of that part of Cape Cod, was the utterly seat-of-the-pants, lost-in-space nature of molecular sequence alignment methods.
Thus, a couple of years later, when I heard Scott Lanyon (at the time, on the staff at the Field Museum in Chicago) state, “The first thing I ask myself, when I see a published molecular phylogeny is, ‘How much of this should I believe?'” I said to myself, man do I know what you’re talking about — and let’s start with the alignments. An alignment is the first analytical step, after raw molecular data (typically, DNA sequences converted to amino acid sequences) are obtained.
Those who live and work with molecular systematics don’t need any introduction to the alignment nightmare. They already know about it, and have known for a long time; see the cautionary note from Swofford and Olsen, above. But now, after the accumulation of literally tens of thousands of molecular phylogenies — a vast literature groaning on library shelves — a larger audience is waking up to the fact that something may be very amiss in the house of molecular evolution.
This is a complicated story, so I’m splitting it into two parts. To get a flavor for what’s at stake, consider the advice that Swofford and Olsen offered in the first edition of the widely-used textbook Molecular Systematics:
When regions of the sequences are so divergent that a reasonable alignment cannot be attained by manual methods using a sequence editor (“by eye”), those regions should probably be eliminated from the analysis. Of course, selectively eliminating regions opens the door to the criticism that the researcher is arbitrarily “throwing away data.” We would counter this argument with the rebuttal that the researcher has already discarded data, in a sense, by choosing to sequence one molecule rather than another because the latter did not evolve at an appropriate rate for the problem of interest. (p. 417)
So how does one know which portions of the sequence to omit? Well, start by aligning the homologous sites, and go from there. But which are the homologous sites? Weren’t the sequences themselves supposed to provide us with that signal?
Actually, the sequences don’t tell one that (for reasons I’ll explore in part 2): for most investigators, an alignment software package such as Clustal-W or Muscle does the job. But never turn your thinking over to software.
That should be a bumper sticker; and now, news of the nightmare is getting around. The latest issue of Science (25 January 2008) carries a report by Wong et al., and accompanying Perspectives article by Antonis Rokas, about the problem:
To assess whether the choice of alignment method affects evolutionary analyses, [Wong et al.] generated gene phylogenies and predicted the amino acid changes driven by selection for every possible gene by alignment combination. They report that a staggering 46.2% of the genes examined exhibit variation in the phylogeny produced dependent on the choice of alignment method…
Wong et al. don’t agree with Swofford and Olsen’s advice, BTW. Jettisoning data is not a good idea:
We also do not believe that the alignment uncertainty problem is one that can be resolved by simply throwing away genes, or portions of genes, for which alignment differs. (p. 474)
But this ain’t the half of it. Neither Wong et al. nor Rokas cites the most disturbing recent analysis of the alignment problem, nor do they probe the deeper reasons the problem will not be solved any time soon. For that, see part 2 of this blog series.
Might have to go to three parts…