[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Subject Index][Author Index]
Re: New paper on Neoaves
David Marjanovic wrote:
Why? Of course you can. More data can never hurt, and as long as you
haven't reached a certain (high) limit of characters per taxon, it actually
adds signal. The idea is that the signal adds up while the noise cancels
itself out.
Yes, so runs the theory. As Huelsenbecket al. (1996) put it: "... it is
often argued that different data (e.g. genes) that are evolving at different
rates may interact positively to resolve different levels of a phylogenetic
tree. A slowly evolving gene might be useful in resolving older evolutionary
splits but be of little use for younger groups, whereas rapidly evolving
genes might be best for accurately resolving recent speciation events." He
goes onto examine whether this is always actually true; and how would we
know if it were not true?
Huelsenbeck, J.P., Bull, J.J. and Cunningham, C.W. (1996). Combining data in
phylogenetic analysis. TREE 11: 152-158.
I agree that more data probably can't hurt (although see below), but it
isn't a panacea for a flawed phylogenetic method. This
more-data-is-always-better can be seen as an application of the Law of Large
Numbers: as more independent trials are conducted, there will be a tendency
for these trials to converge on the correct answer with greater frequency.
Thus, as more data are collected, phylogenetic noise will cancel out, and
the true phylogenetic signal will come through. By 'noise' I mean homoplasy
(i.e., long-branch attraction) - and I figure you do too. However, you have
to take into account functional constraints. This is especially important
for genes that encode proteins, given that individual nucleotide sites are
not equivalent in their ability to change, so they can't be regarded as
independent characters. On top of this we have codon biases, especially
important for six-fold redundancy. I know you address this, but this isn't
the whole story. Protein sequence is constrained by function: there's only
so much wiggle-room at each amino acid site before the protein becomes
dysfunctional (or, rarely, takes on a whole new function). As Naylor and
Brown (1998) put it:
"...once a sequence sample of reasonable size has been obtained, accurate
phylogenetic estimation may be better served by incorporating knowledge of
molecular structures and processes into inference models and by seeking
additional higher order characters embedded in those sequences, than by
gathering ever larger sequence samples from the same organisms in the hope
that the historical signal will eventually prevail."
Naylor, G.J.P. and Brown, W.M. (1998). Amphioxus mitochondrial DNA, chordate
phylogeny, and the limits of inference based on comparison of sequences.
Syst. Biol. 47: 61-76.
Why should the noise conspire to create a false signal?
Back to long-branch attraction. Under certain conditions, the incorrect
phylogeny may not only be recovered, but it may actually become more robust
as more data is added! This is covered by Huelsenbeck et al. (1996).
Clearly, the new Neoaves analysis is not ideal. But what are 5007 bp? I'm
currently writing about one that used the entire mitochondrial genome --
11069 bp, of which some 6000 are parsimony-
informative IIRC -- and still produces a suspect clade whose existence
should be tested by better taxon sampling.
Ah yes, mitochondrial genomes. Check out Naylor and Brown (1998) on that
one. Also, have you seen Miya's work on reconstructing teleost phylogeny
with whole mitochondrial genomes? Some of the new mitochondrial-based
teleost clades are just plain weird. And at least with teleosts we have a
highly speciose group - no shortage of taxa to sample here! Reminds me of
birds. However, what about groups with only a relatively small total number
of extant species - like crocs, or ratites, or perissodactyls, to name a
(very) few. Here we really hit the wall when it comes to taxon sampling.
If there is a signal, it will add up. The noise will cancel itself out. The
only way that I can think of to get a fake signal is by having a skewed
base frequency distribution, and outside of parsimony, this is among the
first things analyses take into account these days.
What kind of skewed base frequencies are you referring to? For example,
base compositional bias is a potential problem, and the analysis may pick up
the 'fake' signal instead of the 'true' signal. If pervasive enough, such
biases will be amplified with increasing data. What began as 'noise' can be
magnified into a robust but erroneous topology.
Cheers
Tim