Profiling Phylogenetic Informativeness
By Townsend, Jeffrey P
Abstract.-
The resolution of four controversial topics in phylogenetic experimental design hinges upon the informativeness of characters about the historical relationships among taxa. These controversies regard the power of different classes of phylogenetic character, the relative utility of increased taxonomic versus character sampling, the differentiation between lack of phylogenetic signal and a historical rapid radiation, and the design of taxonomically broad phylogenetic studies optimized by taxonomically sparse genome-scale data. Quantification of the informativeness of characters for resolution of phylogenetic hypotheses during specified historical epochs is key to the resolution of these controversies. Here, such a measure of phylogenetic informativeness is formulated. The optimal rate of evolution of a character to resolve a dated four-taxon polytomy is derived. By scaling the asymptotic informativeness of a character evolving at a nonoptimal rate by the derived asymptotic optimum, and by normalizing so that net phylogenetic informativeness is equivalent for all rates when integrated across all of history, an informativeness profile across history is derived. Calculation of the informativeness per base pair allows estimation of the cost- effectiveness of character sampling. Calculation of the informativeness per million years allows comparison across historical radiations of the utility of a gene for the inference of rapid adaptive radiation. The theory is applied to profile the phylogenetic informativeness of the genes BRCA1, RAG1, GHR, and c- myc from a muroid rodent sequence data set. Bounded integrations of the phylogenetic profile of these genes over four epochs comprising the diversifications of the muroid rodents, the mammals, the lobe- limbed vertebrates, and the early metazoans demonstrate the differential power of these genes to resolve the branching order among ancestral lineages. This measure of phylogenetic informativeness yields a new kind of information for evaluation of phylogenetic experiments. It conveys the utility of the addition of characters a phylogenetic study and it provides a basis for deciding whether appropriate phylogenetic power has been applied to a polytomy that is proposed to be a rapid radiation. Moreover, it provides a quantitative measure of the capacity of a gene to resolve soft polytomies.
[Character; information; noise; phylogeny; power; rapid radiation; signal; taxon.]
(ProQuest-CSA LLC: … denotes formulae omitted.)
Phylogenetic analyses seek to reveal the evolutionary relationships of taxa by comparing their characters. Four long- standing phylogenetic controversies hinge upon the informativeness of characters. These debates are: Which types of characters are most informative (Collins et al., 2005; Dequeiroz and Wimberger, 1993; Graybeal, 1994; Naylor and Brown, 1997; Rokas and Holland, 2000; Wiens and Servedio, 1997; Yang, 1998; Zwickl and Hillis, 2002)? Would increased taxonomic or character sampling be more informative (Graybeal, 1998; Hillis, 1998; Kim, 1996,1998; Poe, 1998; Pollock et al., 2002; Rannala et al., 1998; Rokas and Carroll, 2005; Rosenberg and Kumar, 2001,2003; Sullivan et al., 1999)? Can we accurately identify historical polytomies that are attributable to rapid radiations (Berbee et al., 2000; Berbee and Taylor, 2001; Nee, 2001; Poe and Chubb, 2004; Ree, 2005; Rokas et al., 2003a, 2005)? What is the optimal procedure for using genome-scale sequence data to empower taxonomically broad phylogenetic studies (Goldman, 1998; Rokas et al., 2003b; Shpak and Churchill, 2000)? One gap in knowledge that perpetuates these debates is the lack of a theory predicting the phylogenetic power of characters for explicit historical epochs. Here it is demonstrated that the informativeness of a character can be quantified over a historical time scale. This formulation may play a role in resolving these controversies.
The phylogenetic informativeness of characters has long been debated. Partisans have variously argued for the utility of morphological, DNA sequence, amino acid sequence, and recently for rare genomic characters (Rokas and Holland, 2000). One school of thought deems useful only those characters that change in state in ways that are unique, irreversible, and indisputable. However, such irreversible states are rare and seldom indisputable. Moreover, disregarding characters that occupy recurrent states dispenses with potentially useful information, including most molecular sequence data. A variety of measures have been proposed to empirically characterize the phylogenetic informativeness of classes of data (Collins et al., 2005; Dequeiroz and Wimberger, 1993; Graybeal, 1994; Naylor and Brown, 1997, 1998; Wiens and Servedio, 1997). For instance, the amount of signal present may be assessed by the skewness of the tree length distribution (Huelsenbeck, 199Ia), the consistency index (CI; Farris, 1989), and various other measures. However, genes are known to differ in their informativeness over historical time (Graybeal, 1994). Whole-tree measures fail to reflect the heterogeneity of information across different parts of the tree. Support for individual branches by a given sequence may be characterized with bootstrap values (Felsenstein, 1985), Bremer support (Baker and DeSalle, 1997), or Bayesian posterior probabilities (Huelsenbeck and Ronquist, 2001). These measures are vital to analysis of the validity of an inferred phylogeny. However, their value is critically dependent on the unknown actual branch length(s), and therefore they are ambiguous indicators of phylogenetic informativeness. Graybeal (1994) pioneered the use of empirical saturation plots to evaluate the utility of genes for vertebrate phylogeny. Empirical saturation plots convey a qualitiative sense of temporal utility and feature variable rates of evolution of characters by plotting cumulative taxon-taxon divergence against time. However, theoretical attempts to clarify phylogenetic power have not resulted in explicit quantitative procedures for judicious experimental design (Mossel and Steel, 2004; Shpak and Churchill, 2000) or have been forsaken (Goldman, 1998) due to the intricacy of practical implementation. Molecular phylogenetic studies have instead generally relied on imprecise heuristics for choosing gene sequences to survey among relevant taxa for a given phylogenetic hypothesis. Conventional wisdom recognizes that it is important to select a gene that evolves at an appropriate pace to resolve the unknown ancestral branching order linking particular taxa of interest. Particular genes have become renowned for their perceived utility in resolving ancient (e.g., rDNA, elongation factors) and recent (e.g., cytochrome b, albumin) divergences. Ideally, this general perception could be captured and enhanced by quantitative measures of the utility of specified genes.
Application of such a measure could play a role in the resolution of the longstanding debate regarding the relative utilities of increasing taxonomic versus character sampling in phylogenetic experimental design (Berbee et al., 2000; Graybeal, 1998; Hillis, 1998; Kim, 1996,1998; Poe, 1998, 2003; Pollock et al., 2002; Rannala et al., 1998; Rokas and Carroll, 2005; Rosenberg and Kumar, 2001, 2003; Zwickl and Hillis, 2002). In this debate it has been demonstrated that the informativeness of increasing taxonomic sampling is critically dependent on the chronology of ancestral linkages of the historical lineages of the taxa added to the data set (Fiala and Sokal, 1985; Huelsenbeck, 1991b; Kirn, 1996, 1998; Poe, 2003). However, quantitative procedures for selection of characters that exhibit appropriate rates of evolution to resolve soft polytomies have not been explored. Clearly, combining acquisition of character data for new taxa that branch close to the time of a specified polytomy with acquisition of new characters that are most informative about that time period will yield the greatest phylogenetic resolution. To design such an ideal experiment, it is necessary to identify characters that contribute optimally to inferential power.
Debate within phylogenetic communities has frequently erupted with regard to putative examples of rapid radiations. Debates occur not only because of their biological importance for understanding the causes of evolutionary diversification but also because the rapid radiations can be difficult to infer using current phylogenetic methods (Berbee et al., 2000; Poe and Chubb, 2004; Rokas et al., 2005; Slowinski, 2001; Walsh et al., 1999; Weisrock et al., 2005). Rapid radiations are characterized by short internodes with few to zero featured synapomorphies. The inferential difficulty arises because measures of phylogenetic support, such as bootstrap values (Felsenstein, 1985), tree-length distribution skewness (Hillis and Huelsenbeck, 1992; Huelsenbeck, 199Ia), Bremer support (Baker and DeSaIIe, 1997), or posterior probabilities (Huelsenbeck and Ronquist, 2001) only convey the degree to which data support a particular clade or tree. They do not convey the power of the characters examined to have revealed any true internodes (regardless of actual branch length) that define clades during a specific epoch. Thus, current methodologies would be enh\anced by a measure of the degree to which the selected characters are sufficiently informative to justify a conclusion of rapid radiation.
Finally, the recent sequencing of multiple whole genomes within major branches of the tree of life has occasioned speculation regarding the best way to employ such genome-wide data sets to inform molecular phylogenetic studies that encompass much broader taxonomic sampling (Dacks and Doolittle, 2001; Delsuc et alv 2005). How can molecular phylogeneticists working with large sets of taxa exploit the breadth of information in genome sequence to improve their chances of conclusively addressing particular phylogenetic hypotheses? With profiles of the phylogenetic informativeness of particular genes during particular epochs, orthologous sequences from genome projects could be used to provide data on the rate of evolution of the sites in many genes. Profiles of phylogenetic informativeness could then be calculated from this data to identify the most informative genes. Here, such a method of profiling informativeness is presented.
THEORY
The Optimal Rate of Change of a Phylogenetic Character
Consider a star phylogeny in which four taxa had a common ancestor at time T (Fig. Ia and b). When parsimony is used to select an optimal tree, only a character that changes along an internode between two sister clades (Fig. 1 c and d, segments fi + fe) will be informative about the actual branching order underlying the polytomy. Additionally, an informative character that changes during the ancestral internode must thereafter remain unchanged during the subsequent evolution of the four taxa. The longer the tips, and the shorter the internal branch, the less likely it is that such an informative character will be discovered. Both rapid and sluggish rates of change can make characters unfavorable for phylogenetic reconstruction. Characters that evolve too slowly will have negligible probability of change on the short internal branch; characters that evolve too quickly will nearly always change on one or more of the long tips.
Informativeness is maximized at an intermediate rate that optimizes the joint probability of change on the short internal branch and lack of change on the long tips. Assuming that evolutionary changes of a character state are randomly distributed at rate λ across the lineages descending from the common ancestor, the probability P of a random variable X equaling k changes of state on any internode of time length b may be calculated via the Poisson distribution:
…
The probability that at least one change occurs on the short internal branch is
… (1)
The probability that the character would subsequently remain unchanged in the four tips is
… (2)
For simplicity, let t^sub 0^ = t^sub 1^ + t^sub 2^. Then, the probability that a character as described would be informative, π(T, t^sub 0^; λ), is the product of probabilities expressed in Equations 1 and 2,
… (3)
The optimal rate, … maximizes this function, and is revealed by solving
… (4)
Further algebra yields
However, for any polytomy we wish to resolve, t^sub 0^ is unknown. Nevertheless, it is frequently known that t^sub 0^ is very small compared to T. Therefore, assuming t^sub 0^ [much less than] T, we may take the limit of Equation 5 as t^sub 0^ approaches zero,
…
Thus, the character that evolves at the optimal rate of character change for resolution of a four-taxon polytomy dated at time T in the past is the one that evolves at a rate of one change along the sum of the lengths of the four branch tips subsequent to the polytomy at time T.
The Phylogenetic Informativeness Profile of a Character
The rate of change for a character that maximizes informativeness is fundamental to phylogenetic theory. Examining five archetypal four-taxon trees with nonzero internodes, Yang (1998) used computer simulations to reveal the utility of an “intermediate” evolutionary rate that is in rough agreement with Equation 6. This intermediate, optimal rate would be expected to be higher for Yang’s trees with nonzero internodes than is predicted by Equation 6, and increasingly so as the ratio of internode length to tip length increases. Consistent with this expectation, the optimal rates lay either at the rate predicted by Equation 6 or slightly higher than that asymptotic prediction.
Yet in phylogenetic practice, one will never discover a set of characters that all evolve at the same rate, let alone a set of characters that all evolve at the optimal rate (Felsenstein, 2001; Yang, 1996). Thus, it is necessary to establish the relative informativeness of characters that evolve at rates that are not optimal. Clearly, characters that evolve at rates close to the optimal rate will be more useful in resolving a polytomy than those that evolve at a dramatically different rate. Here the functional form of the relationship between the optimal and all suboptimal rates is established.
The probability of informativeness of a character evolving at rate λ is given by Equation 3. However, the value of the key parameter t^sub 0^ is unknown, and as to asymptotically approaches zero, the probability of informativeness (Equation 3) aptly approaches zero as well. To profile the phylogenic informativeness of a character evolving at rate λ, we must derive an index of the informativeness that, in contrast, approaches a nonzero limit as the length of the internode t^sub 0^ approaches zero. Such an index may be derived from Equation 3 by taking the ratio of the informativeness of a character evolving at the rate λ to the informativeness of a character evolving at the ideal rate …,
… (7)
With t^sub 0^ [much less than] T, as assumed above, we may take
… (8)
The function … ranges from zero to a maximum of one for all real values of and T greater than zero. If and only if …. Thus, as expected, Equation 8 is maximized at the optimal rate of character change.
However, integration of the right-hand side of Equation 8 from zero to infinity yields e/4X, a result that attributes a greater net informativeness to a character that evolves at a slower rate (smaller ). In contrast, characters should supply net information equivalence for each rate when integrated over all of time. Thus, a normalized profile is generated by obtaining (;) such that …. Such a function is readily computed as
…
A Profile of the Phylogenetic Informativeness of a Set of Characters
Here, Equation 9 is generalized to profile the informativeness of n characters to resolve polytomies at sequential depths of a phylogenetic tree. Denoting a rate of change for each character λ^sub 1^,… λ^sub n^, the phylogenetic informativeness profile can then be
… (10)
The informativeness of a particular data set at a continuum of depths of a phylogenetictree (Fig. 2a) may be conveyed by a plot of Equation 10. Figure 2b shows such a phylogenetic informativeness profile for a set of characters each evolving near the optimal rate to resolve the obscure branchings within the more recent of the two depicted polytomies. Note that Equation 10 is uninformative as to whether there is sufficient data to resolve a particular node, as that depends critically upon the unknown length of the internode t^sub 0^. Rather, Equation 10 provides the degree to which a set of characters will be informative in comparison to another character set for which it may also be evaluated. For instance, the data set resulting in the phylogenetic informativeness profile plotted in Figure 2b would have been a fairly uninformative choice for resolving the ancient polytomy depicted in Figure 2a, because the characters evolve at a rate too likely to result in change along tips, obscuring signal that might have arisen within the time comprising the ancient polytomy.
A different character set underlies the informativeness profile depicted in Figure 2c, composed of the same number of characters as in Figure 2b, and evolving at about the same average rate. However, in this new set, one fifth of the characters are evolving at a rate fivefold faster than, and four fifths are evolving fourfold slower than, the characters that underly the profile in Figure 2b. Such a bimodal distribution of rates could correspond to synonymous and replacement sites in the DNA sequence of a functional gene. In this scenario, the more slowly evolving replacement sites yield some power for the resolution of the deep polytomy. The more rapidly evolving synonymous sites evolve yet too fast for accurate resolution of the relatively recent polytomy. Thus, the set of characters underlying the profile in Figure 2c would pbe a poor choice for the resolution of obscure branching events within the more recent polytomy of Figure 2b. Although the average rate of evolution of the two genes is approximately equal, the phylogenetic informativeness profile is radically different.
This differential phylogenetic informativeness of character sets among historical epochs can be evaluated quantitatively by integrating Equation 10 over the time period of interest. Specifying that period by its commencement, h^sub 1^, and its terminus, h^sub 2^, calculations of
… (11)
yield measures of the relative utility of character sets for resolving ancestral branching order within that epoch. Assigning h^sub 1^ and H^sub 2^ so as to encompass all branching points of a phylogeny provides a summary of the relative informativeness of the character sets to resolve the whole phylogeny. Assigning h^sub 1^ and h^sub 2^ so that they encompass one polytomy or a subset of sequential weakly supported branches provides a more focused appraisal. To establish character sets that will be most informative for compound hypotheses relating to more than one epoch, integrals over multiple epochs of interest may be calculated and either jointly considered or summed to create a single index of informativeness.
Example: Profiling th\e Phylogenetic Informativeness of Genes
To briefly illustrate the theory developed here, I apply it to molecular data to generate profiles of the phylogenetic informativeness of four genes characterized by a DNA sequence data set. Alignments of the DNA sequences of genes c-myc, BRCA1, GHR, and RAG1 were extracted from the data set of Steppan et al. (2004) on the phylogeny of muroid rodents. Taxon sampling for this data set was sufficiently large (Pollock and Bruno, 2000; Sullivan et al., 1999) for rates of evolution of the sites to be estimated using the maximum likelihood program DNARates (by Gary Olsen) on the fossil- calibrated global clock-enforced phylogenetic trees of Steppan et al. (2004). Despite being carefully selected by experts for the purpose of resolving the muroid rodent phylogeny, the nucleotide sequences of these four genes demonstrate differential power for the resolution of ancestral branching order. Sequence of BRCAl is predicted to be the most likely to be informative, followed by RAGl, CHR, and lastly, c-myc (Table 1).
This differential power is illustrated by phylogenetic informativeness profiles over the history encompassed by the inferred phylogeny of Steppan et al. (2004). The informativeness profiles of the four genes are graphed above the phylogeny in Figure 3. The gene BRCAI has the greatest informativeness during the epoch of interest, followed by RAGl, GHR, and c-myc. Compared to RAGl, BRCAl has nearly twice the informativeness over the region of interest and yet is composed of about half the number of nucleotides. This remarkable difference in power is possible because the sites of BRCA1 evolve at nearly uniformly rates compared to most genes (Fig. 4; Adkins et al., 2001; Delsuc et al., 2002), including RAGl. The rates of substitution of the majority of nucleotides in GHR and c-myc appear to be slower as well as more diverse.
Despite its highly conserved sequence, the informativeness profile of the relatively slowly evolving gene c-myc still indicates that it has some power to resolve recent divergences. The solution to this paradox is clear: variation of rates among sites. This variation is most readily observed in aggregate by comparing codon positions within the gene. Sites residing at the third position in each codon (that can usually withstand substitutions without changing the amino acid sequence of the protein) yield phylogenetic informativeness for recent divergences, while first and second sites yield only phylogenetic informativeness for very ancient divergences (Fig. 4).
The net informativeness of the four genes during three key ancient radiations is tallied in Table ?. Calculation of the informativeness on a per base pair basis (Tables 1, 2) allows estimation of the cost-effectiveness of character sampling across genes. Calculation of the informativeness on a per million years basis (Table 2) allows comparison of the utility of a gene for the inference of phylogenetic relationships across historical radiations. Regardless of the unit used, the rank order of genes by informativeness varies over history. For example, BRCAl is uniformly the most powerful gene of these four for resolving the Muroid rodents (cf. Adkins et al., 2001) and rivals RAGl in utility for the mammalian radiation (cf. Scally et al, 2002). Yet, it is uniformly the least powerful gene for resolving the early metazoa.
From this analysis it can be predicted that removal of BRCAl would have the most adverse effect upon the bootstrap support of the inferred phylogeny of the Muroid rodents. Another predicted consequence of removing BRCAl would be to disproportionally reduce the bootstrap support of more recent nodes in the phylogeny, where its phylogenetic informativeness is particularly large compared to other genes. Removal of RAGI, with more numerous but generally slower evolving sites would, in contrast, diminish bootstrap support deeper in the phylogeny. These predicted consequences are in fact demonstrated in the analysis of Steppan et al. (2004).
DISCUSSION
Here, I used an asymptotic instance of the four-taxon case to derive the evolutionary rate at which a character would be optimally phylogenetically informative for a given historical time. This result was then extended to formulate a chronological measure of the phylogenetic informativeness corresponding to the rate of evolution of a character. Lastly, informativeness profiles of individual characters were summed to create a profile of the informativeness of a set of characters. An example case, profiling the informativeness of four genes for resolving the muroid rodent phylogeny, demonstrated the application of the method and validated its predictions. The theory provides quantitative characterization of the potential inferential power of character sets for resolving soft polytomies.
The four-taxon case used to derive the analytical results presents a tractable and versatile framework for theoretical study of phylogeny. One concern, however, may be the applicability of theory from the four-taxon case to phylogenies with larger numbers of taxa. In this regard, two intuitive extremes may be noted. With uniformly dense branching and sampling over all epochs, it is possible that faster rates of evolution may contribute to a greater degree to phylogenetic inference. Dense and deep sampling may subdivide tips such that it becomes less probable that rapid evolution would completely obscure ancient signal arising at a deep short internode (Poe, 2003). In contrast, when many taxa are sampled that all have extremely short internodes within a brief epoch of interest, as is the case in a rapid radiation, it seems unlikely that faster or slower rates than that predicted here for the four- taxon case would be optimal. Further work will be required to establish the interaction between taxon sampling and the optimal rate for inference. However, the four-taxon case has a sterling record of theoretical utility (Felsenstein, 1978; Gaut and Lewis, 1995; Huelsenbeck and Hillis, 1993) for revealing optimal phylogenetic methodology for larger data sets, due not only to its analytical and computational tractability but also because results based on analysis of the four-taxon case may be readily extrapolated to trees of more taxa (Cummings et al., 2003).
When profiling phylogenetic informativeness to select character sets to assay for phylogenetic analyses, several other points should be kept in mind. First, the informativeness profile conveys the historical epochs during which a character or set of characters are most likely to provide parsimony-informative phylogenetic signal but does not account for the misleading effects of noise caused by convergence to the same character state in divergent lineages (Collins et al, 2005; Felsenstein, 1978). Such convergence will occur more in faster evolving sites than in slower evolving sites (Grundy and Naylor, 1999). Thus, all else being equal, designers of phylogenetic experiments may prefer to select character sets with phylogenetic informativeness profiles that peak very slightly prior to, rather than subsequent to, the epoch of interest. This choice should minimize selection of characters that may have too frequently evolved to convergent states.
However, the effect of convergence should be negligible when character sets evolve at close to the optimal rate. At the optimal rate, multiple changes of character state will be rare-fewer than 3% of characters will have more than one change in a branch of length T. Consequent convergent characters will be randomly dispersed among taxa and should not be significantly misleading (Wenzel and Siddall, 1999). To the extent that lineages vary in rate of character change, those lineages whose characters change state more rapidly will tend to evolve a greater proportion of convergent states and thus may be positively misleading to phylogenetic analysis (Felsenstein, 1978), producing the phenomenon frequently termed “long branch attraction.” The degree (but not the nature) of this misleading effect depends upon the number of states that a character may adopt. The greater the number of states that are accessible to the character, the lower the potentially misleading noise arising from rapidly evolving characters will be (Mossel and Steel, 2004; Steel and Penny, 2000). Specification of the effect of the evolutionary state space available to the character upon estimates of the phylogenetic utility for particular epochs remains to be performed.
A difficulty for phylogenetic analysis that is closely related to convergence and long-branch attraction has been the accommodation of genome-wide shifts in substitution rate. Evidence has demonstrated that some clades have experienced elevated or reduced rates of substitution compared to sister clades. For example, rodents are known to have an elevated rate of substitution, compared to most of the mammals (Li et al., 1996; Weinreich, 2001). Thus, to reveal the informativeness of genes previously examined within rodents for the phylogenetic analysis of vertebrates or mammals, the time axis of the phylogenetic informativeness profile derived from rodent sequence evolution may require appropriate scaling by the ratio of the relative substitution rate within the rodent clade to the substitution rate among the nonrodent lineages. Provided that the shape of the site rate distribution remains constant across clades, this procedure may produce an appropriate phylogenetic informativeness profile of those character sets for the new experimental clade.
The shape of the site rate distribution within genes presumably remains stable when there is retention of functional constraints on the protein products (Naylor and Brown, 1997). However, a shift in the site rate distribution has been inferred for some data sets (Lockhart et al., 1998; Miyamoto and Fitch, 1995; Penny et al., 2001), most clearly in the case of the evolution ofgene families after functional divergence (Wang and Gu, 2001). However, violations of the assumption of a static site rate distribution (Susko et al, 2002) and of the assumption of stationarity of nucleotide frequency (Fedrigo et al., 2005) may be readily tested for particular data sets. Specification of phylogenetic power profiles predicted from highly parameterized evolutionary models that may ameliorate such deviations (e.g., Galtier, 2001; Gu, 2001; Whelan et al., 2001) remains to be performed.
In addition to enabling judicious choice of character sets for phylogenetic analysis, profiling phylogenetic informativeness allows evaluation of the relative utility of disparate types of characters for their value in phylogenetic studies. Here, first, second, and third positions of codons were partitioned for four genes, and third positions were shown to result in greater inferential power for recent epochs and lesser inferential power for ancient epochs. The striking third-position effect depicted in Figure 4 is an outcome of the frequently rapid rate of evolution of unconstrained third- position sites within codons; the effect in three of four genes examined is dramatic, despite site-to-site variation of synonymous substitution rates (Pond and Muse, 2005). Other categorizations of characters may be conceived, such as coding versus noncoding sites, or nucleotide versus amino acid characters. Detailed comparisons of the informativeness of diverse characters are currently underway and will guide selection of the most powerful data at hand for the purpose of testing phylogenetic hypotheses.
Although the rate of speciation in a rapid radiation may be readily characterized for well-resolved trees (e.g., Nee, 2001, 2005), poor resolution in a phylogenetic tree presents an inferential dilemma with regard to potential rapid radiation. Consideration of the phylogenetic informativeness profile of a data set conveys new insight into whether unresolved branches (e.g., those with poor bootstrap support) are due to short branches (rapid radiation) or due to poor signal in the genes used. Low support values for a branch can arise from either situation, but only sites evolving at inappropriate rates would result in a low phylogenetic informativeness profile during the epoch of interest. Low informativeness paired with poor resolution calls for more data. High informativeness and poor resolution indicate rapid radiation. As examples of rapid radiation are considered with regard to the phylogenetic informativeness applied, it may become possible to establish a quantitative relationship between the informativeness of the data applied, resolution achieved, and the rapidity of radiation that may be reliably inferred. Because all approaches that use characters from extant taxa to infer evolutionary history are predicated upon an assumption that the evolutionary process has left a recoverable signature of historical parameters in those characters, such a relationship between informativeness, resolution, and rapidity of radiation would ultimately provide quantitative means for evaluating the long-term feasibility of resolving the most recalcitrant ancestral relationships (Table 2).
Most importantly, profiling phylogenetic informativeness, performed after completion of a previous or preliminary investigation, informs the choice of genes for future studies. By furnishing quantitative estimates of the informativeness for specific time periods, profiling phylogenetic power enables optimal experimental design. Historically, the utility of a gene for a study has largely been decided by qualitative heuristics based upon the average rate of evolution of a gene or from experiential impressions of the gene’s utility in studies of taxa more or less divergent from the taxa of interest. With the sequence of genomes dispersed across the tree of life, simultaneous estimation of the phylogenetic informativeness profile of many orthologous genes is possible. These estimates may be compared to optimize gene choice. It is hoped that quantitative profiling of the phylogenetic informativeness of candidate genes using genome sequence data, preliminary data, or data from previous genie studies will supplant the contentious opinions of experts with an accurate and precise methodology for choosing character sets during the experimental design phase of a phylogenetic study.
Conclusions about the relative utility of adding characters or taxa to a current phylogenetic study have subtly hinged upon the appropriateness of the rate of evolution of the characters added for resolution of the phylogeny in question. Clearly, the addition of characters evolving at optimal rates will have much greater impact upon accurate phylogenetic analysis than will addition of characters with an inappropriate rate of evolution. Development of practical analytical predictions of the asymptotic impact of adding additional taxa (cf. Goldman, 1998; Huelsenbeck, 1991b; Kirn, 1996, 1998; Poe, 2003) would complement computational investigations of the relative utility of these two methods of expanding acquired data (Graybeal, 1994; Pollock and Bruno, 2000; Rokas and Carroll, 2005). Synthesized with complementary elaboration of the quantitative theory presented herein, such a development could culminate in a rigorous and comprehensive theory for phylogenetic experimental design.
ACKNOWLEDGEMENTS
Thanks to Robert Friedman for bioinformatic assistance converting sequence files. Thanks also to John Taylor, Paul Lewis, Elizabeth Jockusch, Robert Friedman, Peter Gogarten, Alison Galvani, an anonymous reviewer, and associate editor Gavin Naylor for helpful comments on drafts of the manuscript.
REFERENCES
Adkins, R. M., E. L. Gelke, D. Rowe, and R. L. Honeycutt. 2001. Molecular phylogeny and divergence time estimates for major rodent groups: Evidence from multiple genes. Mol. Biol. Evol. 18:777-791.
Baker, R. H., and R. DeSaIIe. 1997. Multiple sources of character information and th phylogeny of Hawaiian Drosophilids. Syst. Biol. 46:654-673.
Berbee, M. L., D. A. Carmean, and K. Winka. 2000. Ribosomal DNA and resolution of branching order among the ascomycota: How many nucleotides are enough? MoI. Phylogenet. Evol. 17:337-344.
Berbee, M. L., and J. W. Taylor. 2001. Fungal molecular evolution: Gene trees and geologic time. Pages 229-245 in The Mycota. VII. Part B. Systematics and evolution (D. J. McLaughlin, E. G. McLaughlin, and P. A. Lemke, eds.). Springer-Verlag, Berlin Heidelberg.
Collins, T. M., O. Fedrigo, and G. J. P. Naylor. 2005. Choosing the best genes for the job: The case for stationary genes in genome- scale phylogenetics. Syst. Biol. 54:493-500.
Cummings, M. P., S. A. Handley, D. S. Myers, D. L. Reed, A. Rokas, and K. Winka. 2003. Comparing bootstrap and posterior probability values in the four-taxon case. Syst. Biol. 52:477-487.
Dacks, J. B., and W. F. Doolittle. 2001. Reconstructing/ deconstructing the earliest eukaryotes: How comparative genomics can help. Cell 107:419-425.
Delsuc, F., H. Brinkmann, and H. Philippe. 2005. Phylogenomics and the reconstruction of the tree of life. Nat. Rev. Genet. 6:361- 375.
Delsuc, E, M. Scally, O. Madsen, M. J. Stanhope, W. W. de Jong, F. M. Catzeflis, M. S. Springer, and E. J. P. Douzery. 2002. Molecular phylogeny of living xenarthrans and the impact of character and taxon sampling on the placental tree rooting. MoI. Biol. Evol. 19:1656-1671.
Dequeiroz, A., and P. H. Wimberger. 1993. The usefulness of behavior for phylogeny estimation-Levels of homoplasy in behavioral and morphological characters. Evolution 47:46-60.
Farris, J. S. 1989. The Retention Index and the Rescaled Consistency Index. Cladistics Int. J. Willi Hennig Soc. 5:417-419.
Fedrigo, O., D. C. Adams, and G. J. Naylor. 2005. DRUIDS- Detection of regions with unexpected internal deviation from stationarity. J. Exp. Zool. B MoI. Dev. Evol. 304:119-128.
Felsenstein, J. 1978. cases in which parsimony and compatibility methods will be positively misleading. Syst. Zool. 27:401-410.
Felsenstein, J. 1985. Confidence limits on phytogenies: An approach using the bootstrap. Evolution 39:783-791.
Felsenstein, J. 2001. Taking variation of evolutionary rates between sites into account in inferring phylogenies. J. MoI. Evol. 53:447-455.
Fiala, K. L., and R. R. Sokal. 1985. Factors determining the accuracy of cladogram estimation: Evaluation using computer simulation. Evolution 39:609-622.
Galtier, N. 2001. Maximum-likelihood phylogenetic analysis under a covarion-like model. MoI. Biol. Evol. 18:866-873.
Gaut, B. S., and P. O. Lewis. 1995. Success of maximum likelihood phylogeny inference in the four-taxon case. MoI. Biol. Evol. 12:152162.
Goldman, N. 1998. Phylogenetic information and experimental design in molecular systematics. Proc. Biol. Sci. 265:1779-1786.
Graybeal, A. 1994. Evaluating the phylogenetic utility of genes: A search for genes informative about deep divergences among vertebrates. Syst. Biol. 43:174-193.
Graybeal, A. 1998. Is it better to add taxa or characters to a difficult phylogenetic problem? Syst. Biol. 47:9-17.
Grundy, W. N., and G.J. Naylor. 1999. Phylogenetic inference from conserved sites alignments. J. Exp. Zool. 285:128-139.
Gu, X. 2001. Maximum-likelihood approach for gene family evolution under functional divergence. MoI. Biol. Evol. 18:453-464.
Hillis, D. M. 1998. Taxonomic sampling, phylogenetic accuracy, and investigator bias. Syst. Biol. 47:3-8.
Hillis, D. M., and J. P. Huelsenbeck. 1992. Signal, noise, and reliability in molecular phylogenetic analyses. J. Hered. 83:189- 195.
Huelsenbeck, J. P. 1991a. Tree-length distribution skewness: An indicator of phylogenetic information. Syst. Zool. 10:257-270.
Huelsenbeck, J. P. 1991b. When are fossils better than extant taxa in phylogenetic analysis? Syst. Zool. 40:458-469.
Huelsenbeck, J. P., and D. M. Hillis. 1993. Success of phylogenetic methods in the four-taxon case. Syst. Biol. 42:2\47- 264.
Huelsenbeck, J. P., and F. Ronquist. 2001. MrBayes: Bayesian inference of phylogenetic trees. Bioinformatics 17:754-755.
Kim, J. 1996. General inconsistency conditions for maximum parsimony: Effects of branch lengths and increasing numbers of taxa. Syst. Biol. 45:363-374.
Kim, J. 1998. Large-scale phylogenies and measuring the performance of phylogenetic estimators. Syst. Biol. 47:43-60.
Li, W. H., D. L. Ellsworth, J. Krushkal, B. H. Chang, and D. Hewett-Emmett. 1996. Rates of nucleotide substitution in primates and rodents and the generation-time effect hypothesis. MoI. Phylogenet.Evol. 5:182-187.
Lockhart, P.J., M. A. Steel, A. C. Barbrook, D. H. Huson, M. A. Charleston, and C. J. Howe. 1998. A covariotide model explains apparent phylogenetic structure of oxygenic photosynthetic lineages. MoI. Biol. Evol. 15:1183-1188.
Miyamoto, M. M., and W. M. Fitch. 1995. Testing the covarion hypothesis of molecular evolution. MoI. Biol. Evol. 12:503-513.
Mossel, E., and M. Steel. 2004. A phase transition for a random cluster model on phylogenetic trees. Math. Biosci. 187:189-203.
Naylor, G.J., and W. M. Brown. 1997. Structural biology and phylogenetic estimation. Nature 388:527-528.
Naylor, G. J. P., and W. M. Brown. 1998. Amphioxus mitochondrial DNA, chordate phylogeny, and the limits of inference based on comparisons of sequences. Syst. Biol. 47:61-76.
Nee, S. 2001. Inferring speciation rates from phylogenies. Evol. Int. J. Org. Evol. 55:661-668.
Penny, D., B. J. McComish, M. A. Charleston, and M. D. Hendy. 2001. Mathematical elegance with biochemical realism: The covarion model of molecular evolution. J. MoI. Evol. 53:711-723.
Poe, S. 1998. Sensitivity of phylogeny estimation to taxonomic sampling. Syst. Biol. 47:18-31.
Poe, S. 2003. Evaluation of the strategy of long-branch subdivision to improve the accuracy of phylogenetic methods. Syst. Biol. 52:423428.
Poe, S., and A. L. Chubb. 2004. Birds in a bush: Five genes indicate explosive evolution of avian orders. Evolution 58:404-415.
Pollock, D. D., and W. J. Bruno. 2000. Assessing an unknown evolutionary process: Effect of increasing site-specific knowledge through taxon addition. MoI. Biol. Evol. 17:1854-1858.
Pollock, D. D., D. J. Zwickl, J. A. McGuire, and D. M. Hillis. 2002. Increased taxon sampling is advantageous for phylogenetic inference. Syst. Biol. 51:664-671.
Pond, S. K., and S. V. Muse. 2005. Site-to-site variation of synonymous substitution rates. MoI. Biol. Evol. 22:2375-2385.
Rannala, B., J. P. Huelsenbeck, Z. Yang, and R. Nielsen. 1998. Taxon sampling and the accuracy of large phylogenies. Syst. Biol. 47:702710.
Ree, R. H. 2005. Detecting the historical signature of key innovations using stochastic models of character evolution and cladogenesis. Evol. Int. J. Org. Evol. 59:257-265.
Rokas, A., and S. B. Carroll. 2005. More genes or more taxa? The relative contribution of gene number and taxon number to phylogenetic accuracy. MoI. Biol. Evol. 22:1337-1344.
Rokas, A., and P. W. Holland. 2000. Rare genomic changes as a tool for phylogenetics. Trends Ecol. Evol. 15:454-459.
Rokas, A., N. King, J. Finnerty, and S. B. Carroll. 2003a. Conflicting phylogenetic signals at the base of the metazoan tree. Evol. Dev. 5:346-359.
Rokas, A., D. Kruger, and S. B. Carroll. 2005. Animal evolution and the molecular signature of radiations compressed in time. Science 310:1933-1938.
Rokas, A., B. L. Williams, N. King, and S. B. Carroll. 2003b. Genomescale approaches to resolving incongruence in molecular phylogenies. Nature 425:798-804.
Rosenberg, M. S., and S. Kumar. 2001. Incomplete taxon sampling is not a problem for phylogenetic inference. Proc. Natl. Acad. Sci. USA 98:10751-10756.
Rosenberg, M. S., and S. Kumar. 2003. Taxon sampling, bioinformatics, and phylogenomics. Syst. Biol. 52:119-124.
Scally, M., O. Madsen, C. J. Douady, W. W. de Jong, M. J. Stanhope, and M. S. Springer. 2002. Molecular evidence for the major clades of placental mammals. J. Mammal. Evol. 8:239-277.
Shpak, M., and G. A. Churchill. 2000. The information content of a character under a Markov model of evolution. Mol. Phylogenet. Evol. 17:231-243.
Slowinski, J. B. 2001. Molecular polytomies. MoI. Phylogenet. Evol. 19:114-120.
Steel, M., and D. Penny. 2000. Parsimony, likelihood, and the role of models in molecular phylogenetics. MoI. Biol. Evol. 17:839- 850.
Steppan, S., R. Adkins, and J. Anderson. 2004. Phylogeny and divergence-date estimates of rapid radiations in muroid rodents based on multiple nuclear genes. Syst. Biol. 53:533-553.
Sullivan, J., D. L. Swofford, and G. J. P. Naylor. 1999. The effect of taxon sampling on estimating rate heterogeneity parameters of maximumlikelihood models. MoI. Biol. Evol. 16:1347-1356.
Susko, E., Y. Inagaki, C. Field, M. E. Holder, and A. J. Roger. 2002. Testing for differences in rates-across-sites distributions in phylogenetic subtrees. MoI. Biol. Evol. 19:1514-1523.
Walsh, H. E., M. G. Kidd, T. Mourn, and V. L. Friesen. 1999. Polytomies and the power of phylogenetic inference. Evolution 53:932- 937.
Wang, Y, and X. Gu. 2001. Functional divergence in the caspase gene family and altered functional constraints: Statistical analysis and prediction. Genetics 158:1311-1320.
Weinreich, D. M. 2001. The rates of molecular evolution in rodent and primate mitochondria! DNA. J. MoI. Evol. 52:40-50.
Weisrock, D. W., L. J. Harmon, and A. Larson. 2005. Resolving deep phylogenetic relationships in salamanders: Analyses of mitochondrial and nuclear genomic data. Syst. Biol. 54:758-777.
Wenzel, J. W., and M. E. Siddall. 1999. Noise. CIadistics Int. J. Willi Hennig Soc. 15:51-64.
Whelan, S., P. Lio, and N. Goldman. 2001. Molecular phylogenetics: State-of-the-art methods for looking into the past. Trends Genet.l7:262-272.
Wiens, J.J., and M. R. Servedio. 1997. Accuracy of phylogenetic analysis including and excluding polymorphic characters. Syst. Biol. 46:332345.
Yang, Z. 1998. On the best evolutionary rate for phylogenetic analysis. Syst. Biol. 47:125-133.
Yang, Z. H. 1996. Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol. 11:367-372.
Zwickl, D.}., and D. M. Hillis. 2002. Increased taxon sampling greatly reduces phylogenetic error. Syst. Biol. 51:588-598.
First submitted 7 April 2006; revieius returned 10 August 2006; final acceptance 19 October 2006
Associate Editor: Cavin Naylor
JEFFREY P. TOWNSEND
Department of Ecology and Evolutionary Biology, Yale University, New Haven, Connecticut 06520, USA; E-mail: Jeffrey.Townsend@Yale.edu
Copyright Society of Systematic Biologists Apr 2007
(c) 2007 Systematic Biology. Provided by ProQuest Information and Learning. All rights Reserved.
