[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Subject Index][Author Index]

Re: New paper on Neoaves



One question to keep in mind is: Do the analyses that incorporate single gene regions yield the same or similar topology as the combined dataset, but with less support; or do they yield a very different topology to the combined dataset tree, with strong support? In other words, does the tree based on a combined dataset (=5007 bp drawn from five different gene regions) improve the resolution of each of the single-gene trees, or does it contradict these single-gene trees? I'd hate to think that there is a belief that you can 'improve' the phylogenetic signal simply by expanding the dataset (=stringing together more and more DNA sequences).

Why? Of course you can. More data can never hurt, and as long as you haven't reached a certain (high) limit of characters per taxon, it actually adds signal. The idea is that the signal adds up while the noise cancels itself out. Why should the noise conspire to create a false signal?


With molecular data, adding more and more genes (or regions of the same gene) has an additional advantage: they won't evolve at the same speed. Some genes will evolve fast and thus will be able to resolve recent divergences, while saturation effects like long-branch attraction and long-branch repulsion will abound for deeper splits; other genes will evolve more slowly, resolving deep divergences but failing to find the recent ones because they don't have enough apomorphies yet. Ideally, each of the genes in the dataset will have its own rate of evolution, and because noise cancels out rather than adding up, you will get a fully resolved tree.

Clearly, the new Neoaves analysis is not ideal. But what are 5007 bp? I'm currently writing about one that used the entire mitochondrial genome -- 11069 bp, of which some 6000 are parsimony-informative IIRC -- and still produces a suspect clade whose existence should be tested by better taxon sampling.

Also, are each of the single-gene trees consistent? Individual gene regions
may be better at capturing divergence at a certain level, which means that
you're going to get a very muddled phylogenetic signal if you simply string
these genes together.

If there is a signal, it will add up. The noise will cancel itself out. The only way that I can think of to get a fake signal is by having a skewed base frequency distribution, and outside of parsimony, this is among the first things analyses take into account these days.


As a general principle, does increased taxon sampling even help? I'm not certain it does. If a gene is unable to capture the phylogenetic signal, or the analysis is unable to tease out this signal, no amount of taxa will improve the situation.

In that case better taxon sampling will at least show that the gene really is incapable of resolving divergences that happened within the time frame in question. :-)