[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Subject Index][Author Index]

Re: New paper on Neoaves



You and David hold to the assumption that homoplasy is random, and can be
overcome simply by enlarging the dataset or expanding the taxon sample.

That was a late-night oversight on my part. If some of the taxa in a data matrix have a base-composition bias, an evolution model estimated from the data can clearly lead to a wrong result.


Adding taxa seems to be a good idea in general, however.

Shannon M. Hedtke, Ted M. Townsend & David M. Hillis: Resolution of Phylogenetic Conflict in Large Data Sets by Increased Taxon Sampling, Systematic Biology 55(3), 522 -- 529 (June 2006)

No abstract, so I'll retype the conclusions:

"No particular number of genes or taxa will guarantee that phylogenetic reconstruction is accurate, even if bootstrap support for that reconstruction is high. If conflicting signals between genes are due to method inconsistency, adding more genes may lead to increasing support for the incorrect phylogenetic reconstruction. [That's the definition of "method inconsistency".] In such cases, increasing taxon representation may improve accuracy more than does increasing gene number. If we incorporate our understanding of sources of inconsistency into study design, resulting phylogenies are more likely to be representative of evolutionary history.
For any given study, how can an investigator know whether it is better to add more characters or add more taxa to a phylogenetic analysis? High support values for individual clades indicate that sufficient characters have been collected to converge on a robust result. Unfortunately, the well-supported result may be wrong, particularly if small trees with long branches are being estimated. This outcocme appears to be especially likely when intensively sampled genomes have been selected across relatively few, distantly related species -- as with model organisms. In such cases, any slight systematic bias can become magnified and misinterpreted as phylogenetic signal. High bootstrap or other support values are almost guaranteed with genome-sized character sets: the analyses will tend to converge on some answer, even if the answer has more to do with biases in the analysis than phylogenetic history. Therefore, it is important to investigate possible sources of systematic bias, such as long-branch attraction or model misspecification. Simulation studies can help determine the likelihood of long-branch attraction problems in these situations and suggest where additional taxon sampling should occur."


Here's one case that highlights the problem. Naylor and Brown (1998) used
over 12,000 bases (from 19 mitochondrial genes) and recovered 100% bootstrap
support for a clade that comprised vertebrates and echinoderms to the
exclusion of amphioxus (lancelet). This topology was recovered regardless
of the method of analysis.

Since this was 1998 (you know, when some people honestly believed Rodentia was paraphyletic to the rest of Placentalia, and Passeriformes the sister-group to all other extant birds), I'll simply blame the model, and perhaps the fact that Bayesian analysis was not yet available.


But the thing that is obvious about *morphology*-based phylogenetic analyses is that they are almost always followed by a discussion of which morphological characters (synapomorphies) unite which taxa. In other words, it's plain to see the identity of the characters that diagnose certain clades. This rarely happens with *molecular* clades. Here, the characters are at the level of genes and amino acids, and the structural and functional properties of the sequences are skimmed over.

The reason is obvious: molecular characters all look the same and are thus very boring, with the few exceptions of those that can be treated like morphological ones (e. g. the unique 9-bp deletion in BRCA1 that is an autapomorphy of Afrotheria).


BTW, the "Ontogeny Discombobulates Phylogeny" paper (Wiens et al., Syst. Biol, February 2005) has a couple of clades that have high Bayesian posterior probabilities and are nevertheless clearly spurious. However, these based on are morphological data of paedomorphic salamanders (miscoded as if they were metamorphosed adults).