[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Subject Index][Author Index]

Re: New paper on Neoaves




David Marjanovic wrote:

Why? Of course you can. More data can never hurt, and as long as you haven't reached a certain (high) limit of characters per taxon, it actually adds signal. The idea is that the signal adds up while the noise cancels itself out.

Yes, so runs the theory. As Huelsenbecket al. (1996) put it: "... it is often argued that different data (e.g. genes) that are evolving at different rates may interact positively to resolve different levels of a phylogenetic tree. A slowly evolving gene might be useful in resolving older evolutionary splits but be of little use for younger groups, whereas rapidly evolving genes might be best for accurately resolving recent speciation events." He goes onto examine whether this is always actually true; and how would we know if it were not true?


Huelsenbeck, J.P., Bull, J.J. and Cunningham, C.W. (1996). Combining data in phylogenetic analysis. TREE 11: 152-158.

I agree that more data probably can't hurt (although see below), but it isn't a panacea for a flawed phylogenetic method. This more-data-is-always-better can be seen as an application of the Law of Large Numbers: as more independent trials are conducted, there will be a tendency for these trials to converge on the correct answer with greater frequency. Thus, as more data are collected, phylogenetic noise will cancel out, and the true phylogenetic signal will come through. By 'noise' I mean homoplasy (i.e., long-branch attraction) - and I figure you do too. However, you have to take into account functional constraints. This is especially important for genes that encode proteins, given that individual nucleotide sites are not equivalent in their ability to change, so they can't be regarded as independent characters. On top of this we have codon biases, especially important for six-fold redundancy. I know you address this, but this isn't the whole story. Protein sequence is constrained by function: there's only so much wiggle-room at each amino acid site before the protein becomes dysfunctional (or, rarely, takes on a whole new function). As Naylor and Brown (1998) put it:

"...once a sequence sample of reasonable size has been obtained, accurate phylogenetic estimation may be better served by incorporating knowledge of molecular structures and processes into inference models and by seeking additional higher order characters embedded in those sequences, than by gathering ever larger sequence samples from the same organisms in the hope that the historical signal will eventually prevail."

Naylor, G.J.P. and Brown, W.M. (1998). Amphioxus mitochondrial DNA, chordate phylogeny, and the limits of inference based on comparison of sequences. Syst. Biol. 47: 61-76.

Why should the noise conspire to create a false signal?

Back to long-branch attraction. Under certain conditions, the incorrect phylogeny may not only be recovered, but it may actually become more robust as more data is added! This is covered by Huelsenbeck et al. (1996).


Clearly, the new Neoaves analysis is not ideal. But what are 5007 bp? I'm currently writing about one that used the entire mitochondrial genome -- 11069 bp, of which some 6000 are parsimony-
informative IIRC -- and still produces a suspect clade whose existence should be tested by better taxon sampling.

Ah yes, mitochondrial genomes. Check out Naylor and Brown (1998) on that one. Also, have you seen Miya's work on reconstructing teleost phylogeny with whole mitochondrial genomes? Some of the new mitochondrial-based teleost clades are just plain weird. And at least with teleosts we have a highly speciose group - no shortage of taxa to sample here! Reminds me of birds. However, what about groups with only a relatively small total number of extant species - like crocs, or ratites, or perissodactyls, to name a (very) few. Here we really hit the wall when it comes to taxon sampling.


If there is a signal, it will add up. The noise will cancel itself out. The only way that I can think of to get a fake signal is by having a skewed base frequency distribution, and outside of parsimony, this is among the first things analyses take into account these days.

What kind of skewed base frequencies are you referring to? For example, base compositional bias is a potential problem, and the analysis may pick up the 'fake' signal instead of the 'true' signal. If pervasive enough, such biases will be amplified with increasing data. What began as 'noise' can be magnified into a robust but erroneous topology.


Cheers

Tim