Counterexample to a Claim About the Reconstruction of Ancestral Character States
Posted on: Saturday, 1 October 2005, 03:00 CDT
By Lucena, Brian; Haussler, David
Since Pauling and Zuckerkandl first suggested it more than 40 years ago, the idea of reconstructing ancestral proteins and DNA sequences from the information contained in sequences of present day species has held considerable fascination (Pauling and Zuckerkandl, 1963). Such reconstructions can provide a unifying framework for understanding the molecular origins and evolution of key components in living organisms. However, only recently has it become relatively straightforward to perform such reconstructions and then test the reconstructed molecules functionally in the lab. Now there is a surge of activity in this area. Ancestral protein sequences for rhodopsin (Chang et al., 2002), ultraviolet vision gene SWS1 (Shi and Yokoyama, 2003), ribonucleases (Jermann et al., 1995; Zhang and Rosenberg, 2002), Tu elongation factors (Gaucher, 2003), and steroid receptors (Thornton et al., 2003) have been reconstructed and tested (see the reviews by Chang and Donoghue [2000] and Thornton [2004]). In addition, DNA from the common ancestor of placental mammals has been reconstructed for a megabase-sized region containing the cystic fibrosis gene (Blanchette et al., 2004.) for several families of transposons (Adey et al., 1994; Ivies et al., 1997; Smit and Riggs, 1996; Jurka, 2002), and for some complete small genomes like HIV (Hillis et al., 1994). In the latter case the predicted ancestral sequences were compared to the known ones, so the accuracy of the resonstruction could be measured directly. However, in other cases theoretical results (Yang et al., 1995; Schultz et al., 1996; Schultz and Churchill, 1999) and computer simulations (Zhang and Nei, 1997; Blanchette et al., 2004; Cai et al., 2004) are required to assess the accuracy of the reconstructed sequence.
In these investigations it has been observed that the topology of the phylogenetic tree relating the present day species to the target ancestral species can affect the accuracy obtainable in reconstruction of the target ancestral character states. Simulations show that a starlike phylogeny, i.e., a rapid radiation of many different lineages from a common target ancestor, such as occured in the radiation of placental mammals, allows the ancestral character states of that target ancestor to be more accurately reconstructed than those of more recent ancestors in parts of the tree where speciation events are more regularly spaced (Blanchette et al., 2004.). More generally, it has been claimed that the star phylogeny always "represents the best case for ancestral character state reconstruction, because each observation is conditionally independent and yields maximum information about the ancestor" (Schultz et al., 1996), see also Schultz and Churchill (1999). Here we show that the actual situation is more complex and depends on the branch lengths. This complexity occurs even for the simplest evolutionary model with only one parameter: the Poisson model (known also as the Neyman r-state model, generalized Jukes-Cantor model, and the Potts model), where the parameter determines the rate of substitution and all substitutions are equally likely.
Consider the tree topology shown in Figure 1, where A represents the common ancestor of three present day species, designated by C^sub 1^, C^sub 2^, and C^sub 3^. B^sub 2^ represents the common ancestor of C^sub 2^ and C^sub 3^ whereas B^sub 1^ represents the ancestor of C^sub 1^ at the same moment in evolutionary time as B^sub 2^. This tree contains a subtree with the simplest star topology, the two-leaf subtree shown in bold in Figure 1b, and it also contains a subtree with the simplest nonstar topology, the "Y topology" shown in Figure 1c. The question is: Which of these two topologies (b or c) is better for reconstructing the ancestral character at A? To make this more concrete, suppose each character takes a value in the 20-letter alphabet of amino acids. You want to learn about the ancestral character A, which has a discrete uniform marginal distribution. Imagine that your budget only allows you to determine two of the three characters C^sub 1^, C^sub 2^, and C^sub 3^ of the contemporary species. Which two should you choose?
FIGURE 1. (a) The evolutionary tree. (b) The star topology is highlighted. (c) The Y-topology is highlighted.
The information present about a variable X from related variables Y and Z is given by the mutual information I(X; Y, Z) (Cover and Thomas, 1991.). The higher the mutual information, the more reconstructible is X from Y and Z. Because C^sub 2^ and C^sub 3^ are interchangeable, our problem thus reduces to the question of which is the larger mutual information, I(A; C^sub 1^, C^sub 3^) or I(A; C^sub 2^, C^sub 3^). For short branches I(A; C^sub 1^, C^sub 3^) is higher, as claimed in (Schultz et al., 1996; Schultz and Churchill, 1999) and has been observed in simulations (Blanchette et al., 2004). That is, you would prefer a subtopology that is a two-branch star, where the species you observe share no common ancestor except A (Fig. 1b). As a concrete example, suppose A is drawn uniformly from the set of all 20 amino acids and the conservation probability is .75 for each branch of the tree (i.e. from A to B^sub 1^ or B^sub 2^, from B^sub 1^ to C^sub 1^, or from B^sub 2^ to C^sub 2^ or C^sub 3^). This corresponds to about .29 expected substitutions per site (for each branch). Under these conditions I(A, C^sub 1^, C^sub 3^) = 2.419 but I(A;C^sub 2^, C^sub 3^) = 1.949. This is intuitive, because these species give independent evidence about the ancestral character (they are conditionally independent given the ancestor A). But in a long branch setting with a much lower conservation probability, I(A;C^sub 2^, C^sub 3^) is higher. For example, if the conservation probability is .15 (expected number of substitutions [approximate]2.139) , we have I(A; C^sub 1^, C^sub 3^) = .003162 but I(A;C^sub 2^, C^sub 3^) = .003215. Thus, somewhat nonintuitively, for reconstruction of the ancestral character in a long branch setting it is better to have a Y-topology, where there is an intermediate common ancestor, such as in Figure 1c. In this case the observed characters are conditionally dependent given the ancestral character A.
FIGURE 2. (a) The evolutionary tree. (b) The star topology is highlighted. (c) A correlated topology is highlighted.
This effect remains if a fourth (conditionally) independent species branching from A is allowed (Fig. 2) and we are allowed to choose three of the four species. Using the same parameters (conservation probability = .15), we get I(A; C^sub 1^, C^sub 2^, C^sub 3^) = .004796 and I(A;C^sub 1^, C^sub 2^, C^sub 4^) = .004743. This shows that even in cases where the target ancestor for reconstruction is the last common ancestor of all observed present day species, the star topology is not always best.
These results contrast with those of the case of binary characters. There it has been proven that the star topology is always best for reconstruction of the ancestoral character state for a tree with any number of leaves under the generalized Jukes-Cantor model (Evans et al., 2000) (Theorem 6.1), validating the claim of (Schulte et al., 1996; Schulte and Churchill, 1999) for this case. The counterexamples we give here show that there is a fundamental difference in the behavior of this problem when there are two states versus when there are many.
Phenomena which are quite similar (and mathematically deeper) to the results in this paper have been demonstrated in the mathematical literature regarding probability on trees. The asymptotic analysis of the Poisson model in (Mossel, 2001) and (Mossel and Peres, 2003) (there referred to as the Potts model) demonstrates that there exist nonstar topologies that have strictly positive information about the roots at the leaves (asymptotically), whereas the corresponding star topology would have information approaching zero. A thorough understanding of this result makes the results described in this paper somewhat less surprising. It should also be noted that similar phenomenon can occur with a binary state space if the model is asymmetric. This is suggested by the results in (Mossel, 2001). Furthermore, Mossel (2001) demonstrates another example where (in the language of this paper) I(A;C^sub 1^, C^sub 2^, C^sub 4^) = I(A;C^sub 1^, C^sub 2^) = 0 yet I(A;C^sub 2^, C^sub 3^) > 0. However, the model used there is unlikely to be biologically meaningful as it involves a transition matrix that does not correspond to a continuous or reversible markov process.
Biologists may be interested in computing the information some subset of species has about the root of a tree, given some probability distribution on the tree. The typical model specification is given by a marginal distribution of the root of a tree and a conditional distribution for each node given its parent. Although computing the information is theoretically quite straightforward, it effectively requires evaluating the joint probability of every state configuration of the root and the leaves. In a general n-state model on an arbitrary tree with k leaves, this requires n^sup (k+1)^ evaluations of the joint probability function. Symmetry in the model and/or the tree can often make this fairly simple. For exampl\e, for the Poisson model with a star topology with k leaves, the joint probability of a particular configuration depends only on how many leaves are the same as the root, so only k + 1 different probabilities need be computed. It should also be noted that, in the typical model specification, computing a joint probability of the root and leaves requires taking a sum over the internal nodes of the tree. Doing this with efficient methods contributes an additional factor of cn^sup 2^ (or just en if the tree has only two levels), where c is the number of internal nodes to the computational complexity. So, in general, computing the mutual information in these scenarios can become computationally infeasible unless there is a great deal of symmetry. For this reason, theoretical results which shed light on optimal choices can be of great practical value.
ACKNOWLEDGEMENTS
This work was supported in part by an NSF Mathematical Sciences Postdoctoral Fellowship.
REFERENCES
Adey, N. et al. 1994. Molecular resurrection of an extinct ancestral promoter for mouse L1. Proc. Natl. Acad. Sci. USA 91:1569- 1573.
Blanchette, M. et al. 2004. Reconstructing large regions of an ancestral mammalian genome in silico. Genome Res. 14:2412-2423.
Cai, W. et al. 2004. Reconstruction of ancestral protein sequences and its application. BMC Evol. Biol. 4:33.
Chang, B. and M. Donoghue. 2000. Recreating ancestral proteins. Trends Ecol. Evol. 15:109-114.
Chang, B. et al. 2002. Recreating a functional ancestral archosaur visual pigment. Mol. Biol. Evol. 19:1483-1483.
Cover, T. and J. Thomas. 1991. Elements of information theory. John Wiley & sons. New York.
Evans, W. et al. 2000. Broadcasting on trees and the ising model. Ann. Appl. Prob. 10:410-433.
Gaucher, E. et al. 2003. Inferring the palaeoenvironment of ancient bacteria on the basis of resurrected proteins. Nature 425:285-288.
Hillis, D. M. et al. 1994. Application and accuracy of molecular phylogenies. Science 264:671.
Ivics, Z. et al. 1997. Molecular reconstruction of sleeping beauty, a Tc1-like transposon from fish, and its transposition in human cells. Cell 91:501-510.
Jermann, T. et al. 1995. Reconstructing the evolutionary history of the artiodactyl ribonuclease superfamily. Nature 374:57-59.
Jurka, J. 2002. Repbase update: A database and an electronic journal of repetitive elements. Trends Genet. 16:418-420.
Mossel, E. 2001. Reconstruction on trees: Beating the second eigenvalue. Ann. Appl. Probab. 11:285-300.
Mossel, E., and Y. Peres. 2003. Information flow on trees. Ann. Appl. Probab. 13:817-844.
Pauling, L., and E. Zuckerkandl. 1963. Chemical paleogenetics, molecular restoration studies of extinct forms of life. Acta Chem. Scand., 17:
Schultz, T. and G. Churchill. 1999. The role of subjectivity in reconstructing ancestral character states: A Bayesian approach to unknown rates, states and transformation asymmetries. Syst. Biol. 48:651-664.
Schultz, T. et al. 1996. The reconstruction of ancestral character states. Evolution 50:504.
Shi, Y., and S. Yokoyama. 2003. Molecular analysis of the evolutionary significance of ultraviolet vision in vertebrates. Proc. Natl. Acad. Sci. USA 100:8308-8313.
Smit, A., and A. Riggs. 1996. Tiggers and other DNA transposon fossils in the human genome. Proc. Natl. Acad. Sci. USA 93:1443- 1448.
Thomton, J. 2004. Resurrecting ancient genes: Experimental analysis of extinct molecules. Nat. Rev. Genet. 5:366-375.
Thomton, J. W. et al. 2003. Resurrecting the ancestral steroid receptor: Ancient origin of estrogen signaling. Science 301:1714- 1717.
Yang, Z. et al. 1995. A new method of inference of ancestral nucleotide and amino acid sequences. Genetics. 141:1641-1650.
Zhang, J. and M. Nei. 1997. Accuracies of ancestral amino acid sequences inferred by parsimony, likelihood and distance methods. J. Mol. Evol. 44(S1):S139-S146.
Zhang, J. and H. Rosenberg. 2002. Complementary advantageous substitutions in the evolution of an antiviral Rnase of higher primates. Proc. Natl. Acad. Sci. USA 99:5486-5491.
First submitted 28 October 2004; reviews returned 11 November 2004; final acceptance 14 December 2004
Associate Editor: Mike Steel
BRIAN LUCENA1 AND DAVID HAUSSLER2
1 Division of Computer Science, University of California, Berkeley, California, USA; E-mail: lucena@cs.berkeley.edu
2 Howard Hughes Medical Institute, Department of Biomolecular Engineering, University of California, Santa Cruz, California, USA; E-mail: haussler@cse.ucsc.edu
Copyright Society of Systematic Biologists Aug 2005
Source: Systematic Biology
Related Articles
- Study: Dinosaurs, Ancestors Coexisted
- A Field Study of Host Tree Associations of an Exotic Species, the Asiatic Oak Weevil [Cyrtepistomus Castaneus (Roelofs 1873), Coleoptera: Curculionidae]
- Acacia Mangium, a Nurse Tree Candidate for Reforestation on Degraded Sandy Soils in the Malay Peninsula
- Massachusetts Trees Attacked by Two Moth Species
- Scientists Discover New Species of Coral
- The Contest Between Parsimony and Likelihood
- Trees Versus Characters and the Supertree/Supermatrix "Paradox"
- Quest Begins to Restore 'Champion' Trees
- In Ethiopia, new branch of humans' family tree
- Experts Claim Dogs Likely Originated in Asia
User Comments (0)

RSS Feeds