Missing the Forest for the Trees: Phylogenetic Compression and Its Implications for Inferring Complex Evolutionary Histories
Posted on: Wednesday, 6 April 2005, 03:00 CDT
Abstract.-
Phylogenetic tree reconstruction is difficult in the presence of lateral gene transfer and other processes generating conflicting signals. We develop a new approach to this problem using ideas borrowed from algorithmic information theory. It selects the hypothesis that simultaneously minimizes the descriptive complexity of the tree(s) plus the data when encoded using those tree(s). In practice this is the hypothesis that can compress the data the most. We show not only that phylogenetic compression is an efficient method for encoding most phylogenetic data sets and is more efficient than compression schemes designed for single sequences, but also that it provides a clear information theoretic rule for determining when a collection of conflicting trees is a better explanation of the data than a single tree. By casting the parsimony problem in this more general framework, we also conclude that the so- called total-evidence tree-the tree constructed from all the data simultaneously-is not always the most economical explanation of the data. [Compression; information; Kolmogorov complexity; phylogenetics; total evidence.]
Recombination, lateral gene transfer, hybridization, and other biological processes generate conflict between phylogenetic trees constructed from different loci or different partitions of a sequence data set. Lateral gene transfer in bacteria provides many examples (Kurland et al., 2003; Lerat et al., 2003), and transfers between mitochondrial genomes of distantly related plants are now well documented (Bergthorsson et al., 2003). Hybridization and introgression is another source of conflict, as suggested by numerous disagreements between trees based on nuclear and organellar genes (Cronn and Wendel, 2003; Doyle et al., 2003). Conflicts can also emerge when different partitions have the same phylogenetic history but very different patterns of molecular evolution, causing biased inferences in one or more partitions (Rokas et al., 2003). This can arise, for example, if one partition is subject to long- branch attraction but another is not (e.g., Sanderson et al., 2000). No conceptual issue has generated more discussion in phylogenetics in the last decade than the treatment of these conflicts (Bull et al., 1993; Cunningham, 1997a, 1997b; de Queiroz et al., 1995; Farris et al., 1995; Huelsenbeck et al., 1996; Thornton and DeSaIIe, 2000).
The problem can be reduced to deciding when a collection of trees- a "forest"-is a better explanation for evolutionary relationships among a set of sequences than is a single tree. In this article we present a new framework for addressing this issue based on algorithmic information theory (Li and Vitanyi, 1997). An important tool in that field is an elegant formulation of parsimony as a form of data compression, a tool general enough to apply equally well to forests and to a single tree. This perspective places the question of "forest versus tree" within a unified inferential setting. It also provides a deterministic decision rule for choosing between these two hypotheses that does not depend on randomization tests such as the widely used incongruence length difference (ILD) test (Farris et al., 1995). Finally, it settles the long-standing phylogenetic controversy over whether the maximum parsimony tree based on the entire data set, the "total evidence" (Kluge, 1989), is always preferable to separate analyses of subsets of the data.
Maximum parsimony (MP) finds the tree for a given data matrix for which the sum across characters of the number of evolutionary changes required on that tree is minimized. Initially introduced as a heuristic approximation to likelihood methods (Edwards and Cavalli- Sforza, 1964), MP eventually would become the method of choice for phylogeneticists, until the more recent ascendancy of model-based inference methods beginning in the early 1990s (Felsenstein, 2001). In a widely cited paper, Farris (1979) proposed two rationales for constructing the MP tree: it is the most economical evolutionary explanation of the data, and it is the most efficient summary of its information content-the summary that requires literally the fewest symbols. Later workers (Kluge, 1989; Kluge and Wolf, 1993; Nixon and Carpenter, 1996) took up this view, arguing that subdividing data sets and constructing separate phylogenetic trees is inappropriate, because trees from partitions of any data set will tend to be suboptimal by the parsimony criterion relative to the entire data set. This spawned the so-called total evidence view of the problem (Kitching et al., 1998), which argues that "pooling the data and determining the most parsimonious solution for all the data... maximizes information content" (Nixon and Carpenter, 1996). Here we show that this statement cannot be universally true in complex data sets with conflicting signals.
Algorithmic information theory provides a very general framework for assessing the information content of hypotheses. It rests on the concept of Kolmogorov complexity, defined as the length of the shortest computer program that faithfully describes an object (Li and Vitanyi, 1997). A complex object requires a long program; a simple object with much regularity does not. A long random string of digits is complex because it cannot be coded as a computer program any shorter than the trivial program that just prints out all its digits: it cannot be compressed. The irrational number, π, has an infinite number of digits but is not very complex, because a short computer program can be written that calculates this long string to any desired accuracy: it can be compressed dramatically.
This idea can be extended to hypothesis selection via the minimum description length principle (Li and Vitanyi, 1997). The best hypothesis is the one that minimizes the length of the description of the hypothesis plus the length of the description of the data when encoded or compressed with the help of that hypothesis. Scientific inference based on some version of this idea has been applied to many biological problems, including sequence alignment (Allison et al., 1999, 2000; Allison and Yee, 1990), phylogeny reconstruction (Cheeseman and Kanefsky, 1993; Li et al., 2001; Milosavljevic et al., 1990; Otu and Sayood, 2003; Ren et al., 1995), and comparison of classifications (Day, 1983), though not to collections of trees constructed from separate data sets. In this framework, a hypothesis is either a tree or a collection of trees and the best hypothesis is the one that permits the maximum compression of that hypothesis plus the sequence alignment.
MATERIALS AND METHODS
Preliminaries
Assume the data set, D, consists of n nucleotide sequences on the state set {a, c, g, t}, each sequence consisting of m sites. These sequences may be associated with a tree, T, with n leaves. A parsimony score, L = L(T, D), for the data set with respect to tree T can be determined by standard algorithms (Semple and Steel, 2003). Write lg(x) for [log^sub 2^(x)], where [x] is the smallest integer larger than x. We frequently need an efficient coding scheme for integers of arbitrary size. The size in bits of an encoding of integer, k, will be denoted des(k) (see Appendix A).
Compression Using a Most-Parsimonious Tree
For phylogenetically related sequences, the information contained at any site in an alignment is partly redundant, because changes in sequence occur only occasionally in its evolutionary history. Many lineages simply retain a conserved site. Presumably, this observation motivated Farris's (1979,1980) claim that the parsimony tree permits the most efficient summary of the data possible. He sketched the outlines of a coding scheme but did not develop it in sufficient detail to derive an expression for its description length. Details of our coding schemes are described in Appendices A to C; here we summarize the results.
FIGURE 1. Phylogenetic compression rate based on Equation 1. The quantity L / m is the ratio of the inferred most parsimonious number of changes on the tree to the number of sites.
Compression Using a Forest
A data set with phylogenetic signal can be compressed using the most parsimonious tree. However, it can sometimes be compressed even further by reference to a collection of trees rather than a single tree. Consider a partition of the sequence alignment, D, into l subsets of characters, the ;'th labeled D^sub i^, each with m^sub i^ characters. This corresponds to a forest, F, of l binary trees, the ith labeled T^sub i^. Define the "total evidence" length as L^sup (TE)^ = L(T, D), where T is the tree constructed from the entire data set, D, and the length of the forest as L^sup forest^ = Σ^sub i^ L(T^sub i^, D^sub i^). Let the incongruence between the total evidence tree and the forest of individual trees be Δ L = L^sup (TE)^ - L^sup forest^. This "incongruence length" is the number of extra evolutionary steps required to fit the individual data sets in the partition to the overall total evidence tree (Farris et al., 1995).
FIGURE 2. Example discussed in text in which it is more efficient to code a data matrix with a forest of two trees than the single MP tree from the combined matrix.
Figure 2 is a contrived da\ta set with some of these surprising properties. It has two partitions, D^sub 1^ and D^sub 2^, each of which includes four informative sites favoring one tree, seven informative sites favoring a different tree, and r invariant sites. However, the trees favored in the two partitions are different, and the total evidence tree is different from either of those (see Page, 1996, for a similar example and a useful visualization). Let T be the total evidence tree, (ad)(bc), T^sub 1^ be the MP tree favored by partition D^sub 1^, (ab)(cd), and T^sub 2^ be the MP tree favored by D^sub 2^, (ac)(bd).
The parsimony scores for the partitions are L (T^sub 1^, D^sub 1^) = L(T^sub 2^, D^sub 2^) = 15, which are each better than the score for the total evidence tree on either of these partitions, L(T, D^sub 1^) = L(T, D^sub 2^) = 18, and are also better than the scores of these trees on the opposite partition: L(T^sub 1^, D^sub 2^) = L(T^sub 2^, D^sub 1^) - 22. This confirms that the MP tree for each partition is indeed better than any other tree for that partition.
In addition, the score of trees T^sub 1^ and T^sub 2^ with respect to the whole matrix, D, is worse than the score of the total evidence tree with respect to D: L(T^sub 1^, D) = L(T^sub 2^, D) = 37, whereas L(T, D) = 36.
Now, the extra steps entailed by the incongruence, ΔL = L(T, D) - (L(T^sub 1^, D^sub 1^) + L(T^sub 2^, D^sub 2^)) = 36 - (15 + 15) = 6, satisfy Equation 4 above, and the forest consisting of T^sub 1^ and T^sub 2^ is preferred over the total evidence tree T in our compression scheme. This is a striking result: a solution composed entirely of "suboptimal" trees-according to the conventional parsimony criterion-is preferred over the more parsimonious total evidence tree. The conclusion also holds using the exact cutoff values provided in Appendix B.
DATA SETS
Compression Efficiency in a Large Sample of Phylogenetic Data Sets
We compared compression efficiency in 638 nucleotide sequence data sets for protein coding genes in green plants. Details of how these data were obtained from GenBank are described elsewhere (Sanderson et al., 2003: data available at http:// ginger.ucdavis.edu). Data sets ranged in size from 4 to 1079 taxa and 72 to 3375 characters. At one extreme a few data sets contain identical or nearly identical sequences; at the other extreme data sets contain sequences that are diverged by up to 40% at the amino acid level from other sequences in that data set.
Available DNA sequence compression programs rarely compress single sequences to less than about 1.6 bits per character (Chen et al., 2002). For comparative purposes, we examined the compression efficiency of a standard DNA sequence compression program, GenCompress (Chen et al., 2002), on the same data sets we compressed with our phylogenetic method. GenCompress uses a type of Lempel-Ziv coding (Cover and Thomas, 1991), tailored to DNA sequence data sets, and looks for approximate substring matches in the file. Both compression schemes are increasingly effective as the number of taxa increases. Both take advantage of the evolutionary redundancy of the similar sequences in phylogenetic data sets (Fig. 3A). With phylogenetic compression, data sets with 100 taxa can typically be compressed 90% to 95%. Large data sets were almost always more efficiently compressed via phylogenetic compression than GenCompress (Fig. 3B). The most compressible data set, consisting of 102 taxa, 2155 sites, and a parsimony length of 278 could be compressed to 0.06 bits/site and generated a file less than half the size of the corresponding GenCompress file. As expected, the poorest performance occurred in small trees with high homoplasy. The worst was a data set of four taxa, 960 sites, and a tree length of 799 steps, which is an extraordinarily high level of homoplasy (nearly one change per site). Phylogenetic compression actually produced a file slightly larger than 2 bits/site for this data set, whereas GenCompress compressed these data twice as much. Although compression methods aimed at single DNA sequences usually do not achieve lower than 1.6 bits/site on single sequences, they perform better than this on data sets with collections of sequences with phylogenetic structure to them. Nonetheless, explicit phylogenetic compression is usually more efficient. It might be expected that GenCompress would perform better on phylogenetic data matrices that are transposed. Transposing rows and columns would, for example, turn constant characters (columns) into rows of constant blocks of identical letters. However, at least in all but the largest data sets (where the program simply would not terminate) most of the data sets actually were compressed less by GenCompress when transposed.
FIGURE 3. (A) Compression efficiency in 638 DNA sequence data sets extracted from GenBank. Filled circles are data sets compressed using phylogenetic compression as described in text. Open circles are data sets compressed using GenCompress (nonphylogenetic) compression. (B) Relative efficiency of these compression procedures. Vertical axis is the compression level obtained for phylogenetic compression minus compression level for GenCompress for a given data set. Values below 0 indicate better compression efficiency for phylogenetic compression.
A Case Study of Conflicting Signals
Sanderson et al. (2000) analyzed phylogenetic relationships in 19 land plant species for two plastid genes, psaA and psbB, using parsimony and maximum likelihood (see also Magallon and Sanderson, 2002). An ILD test (Farris et al., 1995) for data set heterogeneity suggested congruent signals between genes but not between codon positions. Parsimony analysis of the two genes separately each yielded one most parsimonious tree, the "psaA-tree" and "psbB- tree," respectively, which are only slightly different from each other. Third codon positions from both genes combined yielded one most parsimonious tree (the "3-tree"), whereas first and second positions from both genes combined yielded four equally parsimonious alternative MP trees (the "12-tree"), all quite different from the 3- tree. The total evidence (TE) tree was the same as the 3-tree. Figure 4 shows a schematic of the various trees resulting from these analyses with their pairwise NNI distances. These were calculated using the program COMPONENT (Page, 1993). Recent discussions of phylogenetic relationships in seed plants have focused on the dispute between these two basic trees, which disagree strongly about (among other things) the position of the Gnetales, an enigmatic clade of seed plants whose relationships have long been the subject of controversy (Donoghue and Doyle, 2000). Rydin et al. (2002) reiterated older findings that the TE tree and 3-tree, which support the placement of Gnetales as the sister group of the remaining seed plants, is the best estimate of seed plant phytogeny. Other workers argue that the 12-tree is closer to the truth, basing this in part on extensive use of maximum likelihood methods (Aris-Brosou, 2003; Magallon and Sanderson, 2002; Sanderson et al., 2000), which generally favor that result. High rates of substitution in the 3rd position data make long-branch attraction a possible explanation for these differences (Sanderson et al., 2000).
FIGURE 4. Schematic diagram showing trees resulting from parsimony analyses of combined total evidence data described in Table 1. Trees from the total evidence data set and the two different partitioning schemes are shown. One partition divides the data set into two genes (psnA versus psbB); the other divides it into two classes of codon positions (first and second [12] versus third [3]). Integers on internodes indicate the NNI distance between trees found in parsimony analyses (Li et al., 1996).
We excluded any sites having missing or ambiguous data, because the compression scheme does not allow more than four character states (but see Appendix C). The data compressed with respect to the total evidence tree required 50,083 bits (see Table 1). The data compressed with respect to the two-gene partition is larger than this: 50,197 bits when the trees are coded separately, 50,086 bits when they are coded using NNI distances. However, the data compressed with respect to the codon partition trees is smaller than the total evidence case: 49,957 bits using separate trees or 49,880 bits with NNI compression. Thus, compression is improved by using the codon partition, but degraded by using the gene partition. These results are consistent with the approximate predictions of Equation 4. For the total evidence tree to be favored, the incongruence length difference, ΔL, should be less than (approximately) the smaller of n, the number of taxa, 19 (or 17.25 exactly), or k, the NNI distance between the two trees (3 for the gene partition, 9 for the codon partition, approximately, or 3.375 and 7.625, respectively, for the exact cutoff). For the gene partition, ΔL = 3, and the total evidence hypothesis is favored, but for the codon partition ΔL = 33, so the forest is favored. These results agree with the conventional ILD test of homogeneity, which says that the two genes are not significantly different (P = 0.77) but the two codon partitions are (P = 0.01).
TABLE 1. Phylogenetic compression analysis of photosystem gene data sets. All data sets have 19 taxa. Missing or ambiguous sites were removed from alignments. Number of nucleotides in combined matrix is 66,291 (number of bits = 132,582).
The "best" representation of the data is therefore a forest of two trees derived from the separate codon positions. The two trees do not have the same MP score with respect to the entire data matrix. Indeed, the 3-tree is more parsimonious at 4499 steps compared to the 12tree, which has 4611 steps-112 steps longer. Nonetheless, the most economical explanation of all the data is a joint hypothesis that retains both tre\es.
GenCompress (also with ambiguous and missing sites removed) only compressed the alignment to 90,128 bits, almost twice as large as the phylogenetically compressed version.
DISCUSSION
Identifying Conflicting Phylogenetic Signals or Non-Treelike History
Discovery of conflict between phylogenetic data sets provides some of the most compelling evidence for conflicting evolutionary histories. This conflict may be a difference in the branching relationships of the subsets of the data involved, or a difference in the molecular substitution processes tracking the same phylogeny, or both. Application of algorithmic information theory provides a new tool to diagnose these conflicts. Roughly speaking, whenever ΔL > min(n, k) for two data sets, the data support the added complexity entailed by a hypothesis that there is a conflicting history for the sequences.
In the example of the plastid photosystem data, it is unlikely that the true phylogenetic tree for the first two codon positions is different from the tree for the third codon positions. No biological mechanism is known that would allow for such an outcome, yet our compression scheme leads unambiguously to an inference of heterogeneity. This is an example in which the substitution process is clearly heterogeneous but biology suggests that the tree per se is not. We interpret this to imply that tree inference methods would do well to consider different models for the different partitions. Such heterogeneous models might well then lead to an inference of a single tree, as was found when maximum likelihood was applied to the two codon partitions separately (Magallon and Sanderson, 2002). There are parallel generalizations of parsimony methods that allow different "models" (weighting schemes or methods that count site patterns differently than conventional parsimony methods; Willson, 1999).
Relationship to Other Tests for Conflict
Information theory provides a solution to one conundrum of the conventional parsimony approach to incongruence between data sets. Two data sets are incongruent whenever ΔL > 0, which is almost always the case for any two subsets of a data set. Nonetheless, no one has seriously argued that data sets should be regarded as conflicting any time ΔL > 0. Instead, the ILD randomization test is often used to establish a null distribution on AL, and incongruence is accepted only when ΔL is far enough out on the tail of this null distribution. The relative merits of this test have been scrutinized extensively (Barker and Lutzoni, 2002; Darlu and Lecointre, 2002; Dolphin et al., 2000; Hipp et al., 2004; Yoder et al., 2001). Our equation (4) is an alternative condition. It, too, establishes a cutoff value for ΔL, but it does not rely on null models. Instead it suggests that the natural penalty for added complexity of a hypothesis of two trees should be the cost of describing that additional tree. Formal measures of complexity based on compression allow description of the data to be placed on the same footing as descriptions of the tree, thus making it possible to evaluate the overall simplicity of a hypothesis involving multiple evolutionary histories. We have undertaken a small simulation study of the four-taxon case, which indicates that the level and power of the ILD test and our compression test are quite similar when using the separate trees encoding and for large data sets for the NNI encoding (results not shown). However, the ILD test is more conservative than the compression test, which is somewhat too liberal, for small data sets.
The computational advantages of the compression approach may be important with the increasing size of concatenated data matrices drawn from whole genome analyses (Lerat et al., 2003; Rokas et al., 2003) or broad surveys (Bapteste et al., 2002; Murphy et al., 2001). Generalizations of the ILD test or statistical tests like it are not obvious, although a promising Bayesian approach has recently been reported (Vogl et al., 2003). The calculations outlined in Equations 1 to 4 require the construction of parsimony trees, but these need to be done only once per partition, whereas randomization tests require large numbers of replicate searches. This means many combinations of subsets might be examined in the time it takes one typical ILD run.
Implications for the "Total Evidence" School of Inference
An intense debate has surrounded the argument that the maximum parsimony tree is the most faithful reflection of any data set, as opposed to several trees entailed by subsets of the data (Nixon and Carpenter, 1996). Critics of this view have largely assailed it on biological grounds: that disparate evolutionary processes ("process partitions") sometimes underlie a single data set, and combination of these signals into a "total evidence" analysis will thus obscure their distinct histories (Bull et al., 1993). Our argument against the total evidence view is more direct. When simplicity, complexity, and information are cast in a sufficiently general framework, the most economical hypothesis associated with a given data set may well entail multiple conflicting trees, any one of which may be "suboptimal" by the conventional parsimony criterion. In our view, this defeats the fairly abstract but nonetheless compelling argument that the maximum parsimony tree is always the vehicle providing the most efficient summary of a data matrix.
Relationship to Work on Recombination
Hein (1990, 1993) comes close in spirit to the present work in using a parsimony criterion to identify a collection of trees implied by an alignment when recombination occurs between sequences. His method respects the order of sites in a sequence and chooses a tree for every site by minimizing an optimality criterion having two elements: the parsimony score of possible trees at that site and the rearrangement distance between trees at neighboring sites. This penalizes both substitutions and recombination events that are either too numerous among sites or too complex between any two neighboring sites. Different weights can be assigned to substitution events versus recombination events. Hein's method imposes additional structure on the problem, which is appropriate for the special case of recombination, but this is not generally required by our approach and could be removed from his. Perhaps more importantly, by casting our inference problem in terms of data compression and Kolmogorov complexity, we "remove" the issue of relative weights between different kinds of events, which naturally arises both in Hein's approach and other gene-tree-parsimony type methods (Page and Charleston, 1997). Some might argue that this merely gives equal weight to substitution and recombination in that problem, but it is worth noting that the compression approach does not stipulate that the source of data set conflict need be recombination or anything else.
Complexity of a Phylogenetic Data Set and its Evolutionary Implications
A description of the information content or complexity of a phylogenetic data set by the length of its most efficient encoding provides a novel kind of summary of the evolutionary history of a clade. It combines information about sequence divergence with speciation and extinction patterns. For example, a low complexity, highly compressible data set can arise when rates of sequence evolution are very slow (sequences are highly conserved), when most speciation events in the tree are very recent (high redundancy), and/ or when extinction has selectively removed large clades, rather than removed lineages at random across the tree. A high complexity, minimally compressible data set arises when rates of evolution are high and/or the tree is nearly starlike.
Oddly enough, low complexity is ideal for tree-based prediction. One of the triumphs of phylogenetic biology has been the realization that a phylogenetic tree permits prediction about character states not yet observed in taxa whose relationships are known. This works best in areas of the tree with high redundancy or low information content. In other words, high character state diversity (high complexity) is not ideal for prediction. This idea finds formal support in Fano's inequality (Cover and Thomas, 1991), which provides bounds on the error associated with prediction in the presence of different levels of information. Although this is derived using probabilistic notions of entropy, the idea should hold for algorithmic measures of information content.
Entropy
If sites in a sequence were independent and identically distributed (i.i.d.), then the optimal compression rate of an infinite sequence would be the entropy of a random nucleotide ("nucleotide entropy"). This is computed from the base frequencies and is exactly 2 bits/nucleotide when base frequencies are equal. The optimal compression rate for real DNA sequences is actually less than the nucleotide entropy because sites are not i.i.d. Existing compression programs take advantage of repeated patterns to compress close to or below the entropy. For example, in the sequence data from the combined photosystem data set, the nucleotide entropy is 1.36 bits/nucleotide, and the GenCompress program achieves almost exactly this compression rate. Phylogenetic compression improves on this considerably, however, to about 0.76 bits/nucleotide. This is possible because our compression scheme takes advantage of the fact that a sequence alignment is not i.i.d. An alignment can be regarded as a set of nucleotide "patterns" (or "characters"), each of which corresponds to the combination of nucleotides at a given site for all taxa. The ideal compression rate of an infinite sequence of i.i.d. patterns is the entropy of a random pattern ("pattern entropy"). Our compression rate, in terms of bits per pattern, aims at being close to this pattern entropy. As the sequences are phylogenetically dependent, this entropy is muchbelow 2n, yielding an optimal compression rate much below 2n bits/pattern; i.e., 2 bits/ nucleotide. The pattern entropy for the photosystem data implies a lower bound on compression of about 0.20 bits/nucleotide. Our compression comes closer than others to this lower bound, but still could be improved. One way to do this would be to compress the reference sequence used in our compression scheme (Appendix A). This would then account for dependence within a single sequence as well as across the tree.
Compression Optimality and Robustness of Model Selection
Choosing between a forest and a tree is a problem in model selection (Burnham and Anderson, 1998). Our model selection scheme is based on the idea of compression. For any given data set, it is possible that a different compression scheme might lead to a different choice of models. If different schemes favor different models, it would make sense to favor the model associated with the best compression scheme.
On the other hand, we always deal with finite sequences, and the overhead due to the tree or the translation table definition is important. Our code may well perform better than a Huffman code on most data sets of a fixed size. No code has optimal performance on all sequences of all size (Li and Vitanyi, 1997), and no code has optimal performance (on average) on sequences drawn from any distribution and of any size. Making the assumption that the distribution of the data has some relationship to a tree, we built a code aimed at phylogenetically structured data.
Optimality of the code is a desirable property for the purposes of model selection. However, it is important to achieve a balance between the tree coding (overhead due to the model) and the matrix coding (information remaining in the data once the model is known). An excellent tree coding combined with a poor matrix coding will tend to favor complex models, whereas a poor tree coding combined with an excellent matrix coding will favor too simple models. For the purpose of model selection, the balance in model/data coding may be a more desirable property than overall optimality.
Extensions and Generalizations
The methods described here can be extended in several directions. Appendix C describes procedures for including missing data (as well as alignment gaps) and increasing the alphabet size to handle amino acid data or other data with more than four states. Generalization to nonbinary trees is also fairly straightforward.
A more interesting extension is to consider subsets of data that do not share the same taxon set. This is related to the problem of constructing phylogenetic supertrees (Bininda-Emonds et al., 2002), which are trees assembled from smaller trees so as to minimize any conflict between overlapping taxa. A given data set might be best compressed by associating subsets of the data with proper subtrees rather than trees having all the taxa.
This raises the much more general problem of finding the partition that globally minimizes the length of the code among all such partitions. This is obviously a hard problem because even if all subsets have the same taxa, the number of subsets is exponential in the number of characters in the matrix. Finding the MP tree for any element of any partition is already an NP hard problem, so finding it for all elements of all possible partitions is unlikely to be easy. However, it might be possible to impose some biologically relevant constraints on the kinds of partitions to be examined.
Finally, for some data it is conceivable that a graph rather than a tree would provide a better compression scheme. Consensus networks (Holland et al., 2004) have been used to summarize and visualize information about conflicts. However, graphs require more symbols to encode than trees do, and then yet more symbols are required to encode the character state changes associated with a graph, so presumably only data sets with extraordinarily high levels of conflict would benefit from this. Homoplasy can often be reduced by adding enough reticulations, but this seems like it offers too ad hoc a strategy for explaining away homoplasy-perhaps best avoided as a general method of tree inference. However, an information compression approach imposes a rather stiff penalty that the data must overcome before they support the added complexity of a reticulation hypothesis. Only if the savings in description length because of reduced homoplasy exceeds the extra complexity of handling graphs will such an evolutionary inference be warranted.
ACKNOWLEDGMENTS
We thank J. G. Burleigh, O. Eulenstein, and Raul Piaggio for discussion of these issues, and Mike Steel, Ming Li, and an anonymous reviewer for helpful comments.
REFERENCES
Allison, L., D. Powell, and T. I. Dix. 1999. Compression and approximate matching. Comp. J. 42:1-10.
Allison, L., L. Stern, T. Edgoose, and T. I. Dix. 2000. Sequence complexity for biological sequence analysis. Computers and Chemistry 24:43-55.
Allison, L., and C. N. Yee. 1990. Minimum message length encoding and the comparison of macromolecules. Bull. Math. Biol. 52:431-453.
Aris-Brosou, S. 2003. Least and most powerful tests to elucidate the origin of the seed plants in the presence of conflicting signals under misspecified models. Syst. Biol. 52:781-793.
Bapteste, E., H. Brinkmann, J. A. Lee, D. V. Moore, C. W. Sensen, P. Gordon, L. Durufle, T. Gaasterland, P. Lopez, M. Muller, and H. Philippe. 2002. The analysis of 100 genes supports the grouping of three highly divergent amoebae: Dictyostelium, Entamoeba, and Mastigamoeba. Proc. Natl. Acad. Sci. USA 99:1414-1419.
Barker, F. K., and F. M. Lutzoni. 2002. Utility of the incongruence length difference test. Syst. Biol. 51:625-637.
Bergthorsson, U., K. L. Adams, B. Thomason, and J. D. Palmer. 2003. Widespread horizontal transfer of mitochondrial genes in flowering plants. Nature 424:197-201.
Bininda-Emonds, O. R. P., J. Gittleman, and M. Steel. 2002. The (super)tree of life: procedures, problems, and prospects. Ann. Rev. Ecol. Syst. 33:265-290.
Bull, J. J., J. P. Huelsenbeck, C. W. Cunningham, D. L. Swofford, and P. J. Waddell. 1993. Partitioning and combining data in phylogenetic analysis. Syst. Biol. 42:384-397.
Burnham, K. P., and D. R. Anderson. 1998. Model selection and inference. Springer, New York.
Cheeseman, P., and B. Kanefsky. 1993. The reconstruction of evolutionary trees using minimal description length. Pages 91-100 in Advances in computer methods for systematic biology: Artificial intelligence, databases, computer vision (R. Fortuner, ed.). Johns Hopkins Press, Baltimore.
Chen, X., M. Li, B. Ma, and J. Tromp. 2002. DNACompress: Fast and effective DNA sequence compression. Bioinformatics (Oxford) 18:1696- 1698.
Cover, T. M., and J. A. Thomas. 1991. Elements of information theory. John Wiley & Sons, New York.
Cronn, R., and J. F. Wendel. 2003. Cryptic trysts, genomic mergers, and plant speciation. New Phytologist 161:133-142.
Cunningham, C. W. 1997a. Can three incongruence tests predict when data should be combined? MoI. Biol. Evol. 14:733-740.
Cunningham, C. W. 1997b. Is congruence between data partitions a reliable predictor of phylogenetic accuracy? Empirically testing an iterative procedure for choosing among phylogenetic methods. Syst. Biol. 46:464-478.
Darlu, P., and G. Lecointre. 2002. When does the incongruence length difference test fail? MoI. Biol. Evol. 19:432.
Day, W. H. E. 1983. The role of complexity in comparing classifications. Math. Biosci. 66:97-114.
de Queiroz, A., M. J. Donoghue, and J. Kim. 1995. Separate versus combined analysis of phylogenetic evidence. Ann. Rev. Ecol. Syst. 26:657-681.
Dolphin, K., R. Belshaw, C. D. L. Orme, and D. L. J. Quicke. 2000. Noise and incongruence: Interpreting results of the incongruence length difference test. MoI. Phylogenet. Evol. 17:401- 106.
Donoghue, M. J., and J. A. Doyle. 2000. seed plant phylogeny: demise of the anthophyte hypothesis? Curr. Biol. 10:R106-R109.
Doyle, J. J., J. L. Doyle, J. T. Rauscher, and A. H. D. Brown. 2003. Diploid and polyploid reticulate evolution throughout the history of the perennial soybeans (Glycine subgenus Glycine). New Phytologist 161:121-132.
Edwards, A. W. F, and L. L. Cavalli-Sforza. 1964. Reconstruction of evolutionary trees. Pages 67-76 in Phenetic and phylogenetic classification (V. H. Heywood and J. McNeill, eds.). Systematics Association Publication, London.
Farris, J. S. 1979. The information content of the phylogenetic system. Syst. Zool. 28:483-519.
Farris, J. S. 1980. THe efficient diagnoses of the phylogenetic system. Syst. Zool. 29:386-101.
Farris, J. S., M. Kallersjo, A. G. Kluge, and C. BuIt. 1994. Testing significance of incongruence. Cladistics 10:315-319.
Farris, J. S., M. Kallersjo, A. G. Kluge, and C. BuIt. 1995. Constructing a significance test for incongruence. Syst. Biol. 44:570-572.
Felsenstein, J. 2001. The troubled growth of statistical phylogenetics. Syst. Biol. 50:465-467.
Hein, J. 1990. Reconstructing evolution of sequences subject to recombination using parsimony. Math. Biosci. 98:185-200.
Hein, J. 1993. A heuristic methdo to reconstruct the history of sequences subject to recombination. J. MoI. Evol. 36:396-405.
Hipp, A. L., J. C. Hall, and K. J. Sytsma. 2004. Congruence versus accuracy: Revisiting the incongruence length difference test. Syst. Biol. 53:81-89.
Holland, B. R., K. T. Huber, V. Moulton, and P. J. Lockhart. 2004. Using consensus networks to visualize contradictory evidence for species phylogeny. MoI. Biol. Evol. 21:1459-1461.
Huelsenbeck, J. P., J. J. Bull, and C. W. Cunningham. 1996. Combining data in phylogenetic analysis. Trends Ecol. Evol. 11:152- 158.
Kitching, I.J., P.L. Forey, C. J. Humphries,and D. Williams. 1998 Cladistics: The theory and practice of parsimony analysis, 2nd edition. Oxford University Press, New York.
Kluge, A. G. 1989. A concern for evidence and a phylogenetic hypothesis of relationships among Epicrates (Bo\idae, Serpentes). Syst. Zool. 38:7-25.
Kluge, A. G., and A. J. Wolf. 1993. Cladistics: What's in a word? Cladistics 9:183-199.
Kurland, C. G., B. Canback, and O. G. Berg. 2003. Horizontal gene transfer: A critical view. Proc. Natl. Acad. Sci. USA 100:9658- 9662.
Lerat, E., V. Daubin, and A. Moran. 2003. From gene trees to organismal phylogeny in prokaryotes: The case of the gamma- Proteobactera. PLoS Biol. 1:1-9.
Li, M., J. Badger, X. Chen, S. Kwong, P. Kearney, and H. Y. Zhang. 2001. An information based distance and its application to whole mitochondrial genome phylogeny. Bioinformatics (Oxford) 17:149154.
Li, M., J. Tromp, and L. Zhang. 1996. On the nearest neighbor interchange distance between evolutionary trees. J. Theor. Biol. 182:463-467.
Li, M., and P. Vitanyi. 1997. An introduction to Kolmogorov complexity and its applications, 2nd edition. Springer-Verlag, New York.
Magallon, S., and M. J. Sanderson. 2002. Relationships among seed plants inferred from highly conserved genes: Sorting conflicting phylogenetic signals among ancient lineages. Am. J. Bot. 89:1991- 2006.
Milosavljevic, A., D. Haussier, and J. Jurka. 1990. Clustering of macromolecular sequences by minimal length encoding. Pages 90-94 in Artificial intelligence and molecular biology: Working notes, Stanford University.
Murphy, W. J., E. Eizirik, W. E. Johnson, Y. P. Zhang, O. A. Ryder, and S. J. O'Brien. 2001. Molecular phylogenetics and the origins of placental mammals. Nature 409:614-618.
Nixon, K. C., and J. M. Carpenter. 1996. On simultaneous analysis. Cladistics 12:221-241.
Otu, H. H., and K. Sayood. 2003. A new sequence distance measure for phylogenetic tree construction. Bioinformatics (Oxford) 19:2122- 2130.
Page, R. D. M. 1993. COMPONENT user's manual (version 2.0). Trustees of The Natural History Museum, London.
Page, R. D. M. 1996. On consensus, confidence and "total" evidence. Cladistics 12:83-92.
Page, R. D. M., and M. A. Charleston. 1997. From gene to organismal phylogeny: Reconciled trees and the gene tree/species tree problem. MoI. Phylogenet. Evol. 7:231-240.
Ren, F., H. Tanaka, and T. Gojobori. 1995. Construction of molecular evolutionary phylogenetic trees from DNA sequences based on minimum complexity principle. Computer methods and programs in biomedicine 46:121-130.
Rokas, A., B. Williams, N. King, and S. Carroll. 2003. Genome- scale approaches to resolving incongruence in molecular phylogenies. Nature 425:798-804.
Rydin, C., M. Kallersjo, and E. M. Friis. 2002. seed plant relationships and the systematic position of Gnetales based on nuclear and chloroplast DNA: Conflicting data, rooting problems and the monophyly of conifers. Int. J. Plant Sci. 163:197-214.
Sanderson, M.}., A.C.Driskell, R. H. Ree,O. Eulenstein, and S. Langley. 2003. Obtaining maximal concatenated phylogenetic data sets from large sequence databases. MoI. Biol. Evol. 20:1036-1042.
Sanderson, M. J., M. F. Wojciechowski, J. M. Hu, T. S. Khan, and S. G. Brady. 2000. Error, bias, and long-branch attraction in data for two chloroplast photosystem genes in seed plants. MoI. Biol. Evol. 17:782-797.
Semple, C., and M. Steel. 2003. Phylogenetics. Oxford University Press, New York.
Thornton, J. W., and R. DeSaIIe. 2000. A new method to localize and test the significance of incongruence: detecting domain shuffling in the nuclear receptor subfamily. Syst. Biol. 49:183- 201.
Vogl, C., J. Badger, P. Kearney, M. Li, M. Clegg, and T. Jian. 2003. Probabilistic analysis indicates discordant gene trees in chloroplast evolution. J. MoI. Evol. 56:330-340.
Willson, S. J. 1999. A higher order parsimony method to reduce long-branch attraction. MoI. Biol. Evol. 16:694-705.
Yoder, A. D., J. A. Irwin, and B. A. Payseur. 2001. Failure of the ILD to determine data combinability for slow loris phylogeny. Syst. Biol. 50:408-424.
First submitted 2 April 2004; reviews returned 27 May 2004; final acceptance 10 July 2004
Associate Editor: Mike Steel
CCILE AN1,2 AND MICHAEL J. SANDERSON1
1 Section of Evolution and Ecology, University of California, Davis, California 95626, USA; E-mail: mjsamierson@ucdavis.edu (M.J.S.)
2 Current address: Department of Statistics, University of Wisconsin-Madison, Medical Science Center, 1300 University Ave., Madison, WI53706, USA; E-mail: ane@cs.wisc.edu
APPENDIX
APPENDIX
APPENDIX
Copyright Society of Systematic Biologists Feb 2005
Source: Systematic Biology
Related Articles
- Scientists Set Forth Evidence That Global Warming Has Begun; Surge in Greenhouse Gases is Human Induced and Not Within Normal Fluctuations
- Do You Think Networking is Tough? Try Implementing One in Disaster Areas and Hostile Environments; New Jersey-Based TCPNS Helps Government Establish Data Networks and Communications Systems in Disaster and Hostile Areas
- Parature Offers Talisma Customers Free Migration and Set-Up
- The Dallas Morning News Danielle DiMartino Column: Housing Data May Hold Key for Dow
- Katrina, Rita Kids Light Rockefeller Tree
- Rockefeller Tree Lit by Katrina, Rita Kids
- Embarcadero Technologies Offers Most Comprehensive Data Lifecycle Management Solution for MySQL Network; Embarcadero Announces MySQL Network Version of DBArtisan at MySQL Users Conference 2005
- A Single Measure of FEV^Sub 1^ Is Associated With Risk of Asthma Attacks in Long-Term Follow-Up*
- Molecular Data and the Evolutionary History of Dinoflagellates
User Comments (0)

RSS Feeds