[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Subject Index][Author Index]
"Common ancestor" in cladistics

To: dinosaur@usc.edu
Subject: "Common ancestor" in cladistics
From: Mar Qos Aker <marqosaker@yahoo.com>
Date: Thu, 29 Jul 2004 02:26:09 -0700 (PDT)
Reply-to: marqosaker@yahoo.com
Sender: owner-dinosaur@usc.edu
I don't know if this answers your question, however I found this last
night on the internet and have pasted it here (according to my
spellcheck, there are 4417 words and 24317 characters) with reference
at end:

Phylogenetic Tree-Building
Phylogenetic Analysis
Cladistics
The study of evolutionary processes has often been considered to be
unscientific because it deals with historically unique events (Popper
1957). Hypotheses concerning these events are thus not universal (in
either space or time) and they are therefore considered to be
untestable in the contemporary world. The sociological development of
phylogenetic analysis has consequently been based largely on the
erection of what have been called "evolutionary scenarios" describing
the presumed genealogical history of the organisms under study. The
number of such scenarios that may be created is, of course, limited
solely by the imagination of the researcher, and none of the scenarios
are likely to be open to falsification. Cladistic analysis can thus be
seen as an attempt to base phylogenetic analysis on a more objective
footing, where the phylogenetic hypotheses are explicitly stated, along
with the evidence supporting (and contradicting) them, and are then
subjected to quantitative testing. Its practitioners therefore claim
that cladistics is designed to make phylogenetic analysis into an
hypothetico-deductive science, where explicit hypotheses are subjected
to repeatable attempts at falsification. 
Note that the claimed advantages of cladistic analysis are not intended
to denigrate pre-cladistic biologists, nor is there any suggestion that
these biologists did not apply their minds to phylogenetic questions.
However, it is clear that pre-cladistic phylogenetic analyses were not
necessarily based on repeatable methods that produced explicit
hypotheses of evolutionary relationship which could be subjected to
falsification (Nelson & Platnick 1981). Furthermore, the taxonomic
groups produced by pre-cladistic biologists were not necessarily
monophyletic (see below for a definition), and therefore did not always
reflect evolutionary history. Post-Darwinian biologists have had the
unenviable task of producing taxonomic schemes that should, in theory,
reflect evolutionary history, without any theoretical framework for how
they should go about discovering what the evolutionary history actually
was (Stevens 1994). Cladistics is an attempt to provide this
theoretical framework. 
Cladistic analysis as an approach to phylogeny attempts to group taxa
on the basis of their ancestry. In the analysis, taxa are grouped in
such a way that those with historically more recent ancestors form
groups nested within groups of taxa with more distant ancestors. The
analytical technique is based on a widely-held view of the mode of the
evolutionary process:- species are lineages undergoing divergent
evolution with modification of their intrinsic attributes, the
attributes being transformed through time from ancestral to derived
states. Thus, if a species is a group of populations, then if one group
of populations acquires a new (derived, advanced or apomorphic)
character state while the rest do not (i.e. they retain the ancestral,
primitive or plesiomorphic state) then these populations constitute a
new species. This new species then forms a separate historical lineage
that diverges from the other populations, and maintains its own
historical tendencies and fate. Evolution can thus be thought of as a
branching sequence of historical lineages, and indeed the only diagram
in the work by Darwin (1859) represents precisely such a branching
sequence. 
When a character changes from an ancestral state to a derived state in
a lineage it is historically unique (a novelty), and it will be passed
on to all of the descendants of that lineage (even if the character is
later modified into something else). Therefore, the branching sequence
of evolution can be deduced by searching for nested groups of shared
derived character states (synapomorphies) among the taxa being
analysed. So, if a derived character state is observed in two or more
taxa then we can hypothesise that they share this apomorphy because
they are descended from a common ancestor, and that they inherited the
apomorphy from that ancestor. The possession of a shared ancestral
state (a symplesiomorphy) tells us nothing about the phylogeny of the
taxa, since this state was inherited from an ancestor that is also held
in common with those taxa possessing the derived state. Thus, cladistic
analysis is simply the search for nested sets (a hierarchy) of
synapomorphies among the taxa. Each synapomorphy represents an
ancestral lineage that has diverged from its related lineages, thus
being contemporary evidence for a prior evolutionary event (Table 1).
Clearly, the only characters that are of use for a cladistic analysis
of a group of contemporary taxa are those features that reflect their
evolutionary history. There is, of course, no simple method for
determining which features these are, but at least the method forces
the practitioner to be explicit about the characters that have been
chosen for the analysis. 
Homology
For this analytical technique to work, homologous rather than analogous
character states must be compared across the taxa (de Pinna 1991;
Brower and Schawaroch 1996). That is, for all of the taxa we must be
comparing like with like, particularly with reference to the
evolutionary origin of the attributes. Characters and their states are
thus hypotheses of evolutionary homology. As an example, for Character
16 in Table 1 possession of denticles, dermal scales, epidermal scales,
feathers and hair are all considered to be homologous character states
of a single character. In practice, characters and their states are
postulated as homologous on the basis of their structural, positional,
ontogenetic, compositional and/or functional correspondences; and they
are postulated between different taxa so as to maximise the number of
one-to-one correspondences of their parts (Stevens 1984). That is, the
features are decomposed into their constituent parts, and these are
compared in terms of their positional and connectional relationships
(i.e. topology) (Rieppel 1994). This concept may be problematic (for
example, deciding on what constitutes the ultimate parts to be
compared), but it can be put into practice through a detailed
comparative and/or experimental study of the organisms concerned (what
has traditionally been called comparative anatomy), where the constancy
of the topological relationships is used as the main criterion for
recognising homology (Rieppel 1994). In particular, structural and
positional similarity of complex structures has traditionally been
taken as good evidence of homology (Rieppel 1994). The choice of
characters to be included in a cladistic analysis may be somewhat
arbitrary, but can include intrinsic attributes such as morphology,
anatomy, embryology, behaviour, physiology, ultrastructure, cytology,
biochemistry, and immunology. The important points are that the
characters used in the analysis are hypothesised to reflect the
evolutionary history of the taxa, and that the character states of a
single character are hypothesised to have a unique evolutionary origin.

Concepts of homology are often intuitively obvious when dealing with,
for example, morphological data (e.g. the homologies postulated in
Table 1 are thus mostly unproblematic). However, when dealing with
molecular data concepts of homology have often been rather confused
(Winter et al. 1968; Reeck et al. 1987; Patterson 1988; Hillis 1994),
with the word homology being used to mean several unrelated things,
which could perhaps be better given alternative names. For example,
"sequence homology" is often used as a synonym for "sequence
similarity" (i.e. the number of nucleotides or amino acids that are
inferred to be held in common between two sequences). These are not
necessarily the same thing, since similarity can be the result either
of common ancestry or of chance convergence, parallelism or reversal;
and "isology" may be a better term to use (Wegnez 1987). Furthermore,
paired chromosomes are often referred to as homologous, whereas
"homonomy" would be a clearer name (Patterson 1988); and the use of
molecular probes should be referred to as "homospecific" or
"heterospecific" rather than homologous or heterologous (Hillis 1994). 
The alignment of molecular sequences is the direct equivalent of
homology assessments for morphological characters (Mindell 1991; Brower
and Schawaroch 1996) that is, "positional homology" for the components
of the sequences. The concepts of homology in molecular and
morphological studies are thus fundamentally the same (Patterson 1988;
de Pinna 1991; Williams 1993). So, alignment of molecular sequences
involves a series of hypotheses of homology among the taxa, with one
hypothesis of homology for each position (nucleotide or amino acid) in
the sequence. It is important to recognise this, because it is clear
that often very little attention is paid by molecular biologists to
this point -- to a morphologist the assessment of homology is an
important (and time-consuming) component of phylogenetic analysis, but
to a molecular biologist the assessment of homology is apparently often
an after-thought. The correct formulation of hypotheses of homology is
just as important for molecular data as it is for morphological data;
and it is clear from the literature that many of the so-called
controversies about the phylogeny of particular groups are based as
much on disagreements about sequence alignment as on disagreements
about actual evolutionary events (e.g. Morrison and Ellis 1997). 
Unfortunately, for molecular data there is little possibility of
further investigations (such as ontogeny) to assess homology, and so in
practice homology assessment is very different for molecular studies
(Mindell 1991). Positional homology can be represented by either
identical character states (nucleotides or amino acids) in all
sequences, substitutions in one or more sequences (representing point
mutations), or insertions/deletions (indels) in one or more sequences.
The most problematic aspect of sequence alignment is the positioning of
indels, and this problem becomes more acute for more divergent
sequences. It is worthwhile in this context to distinguish between
gaps, which are introduced into the sequences by the alignment
procedure, and indels, which are actual mutation events (Olsen 1988);
clearly, the objective is to introduce into the sequences only gaps
that truly represent indels. 
There are three possible scenarios for the degree of difficulty of
sequence alignment. Firstly, there may be relatively few indels, in
which case a robust sequence alignment can usually be produced by hand.
Such a situation is shown, for example, by protein-coding parts of
mtDNA (Miyamoto & Cracraft 1991) and the plant rbcL gene (Chase et al.
1993). Secondly, the sequence may represent a molecule for which there
is an a priori biological model of secondary structure in which certain
active sites must be maintained; the alignment is then constrained by
the base-pairing of the model (Olsen 1988). Such a situation is shown,
for example, by the 5S (Specht et al. 1990), 12S (Gutell et al. 1985)
and 18S (Van de Peer et al. 1994) rRNA genes. 
Thirdly, there may be many indels and no a priori structure model.
Under these circumstances it is usual to use a mathematical algorithm
to produce the alignment. These algorithms all attempt to produce a
sequence alignment that optimises some chosen criterion of match
between the individual sequences (cost). That is, the sequences are
compared using a pattern-matching process that searches for
correspondence between the elements of the sequences, introducing gaps
into the sequences as required to maximise some criterion for
optimality of the correspondence. There are many algorithms currently
available (see Waterman 1989; Doolittle 1990; Chan et al. 1992; McClure
et al. 1994; and Morrison and Ellis 1997), which optimise a variety of
mathematical functions measuring the overall alignment cost. When there
are more than two sequences, most of these algorithms use exact
procedures (see below for a definition) to align the sequences
pair-wise but then use heuristic procedures (see below for a
definition) to braid these alignments into a multiple alignment. Thus,
these procedures do not guarantee to produce the globally optimal
alignment; nor do they guarantee that the optimal alignment (even if
they could find it) represents the true alignment (Thorne & Kishino
1992). More to the point, most of these alignment methods use phenetic
pattern-matching algorithms, and their procedures are thus based on
maximising sequence similarity (i.e. isology) rather than homology. 
I am not going to discuss these various algorithms here, because this
is an area of active research and very little is known about their
theoretical or practical limitations (Thorne & Kishino 1992; McClure et
al. 1994). However, one important generalisation may be made:- the
differences between the alignments produced by the various algorithms
is often less than the differences produced by varying the "gap
weights". These weights refer to the relative cost of inserting a new
gap into a sequence or extending an already-existing gap, and there is
no way of determining analytically what these weights should be
(Rinsma-Melchert 1993). Most of the computer programs that implement
the alignment algorithms have default values for the weights that are
designed to produce "biologically interesting" results, and very few
molecular biologists seem to be willing to deviate from these default
choices. It is, however, clear that to simply report that a particular
computer program was used to align the sequences is meaningless (since
the work cannot be verified) unless the weight values are also reported
(Wheeler 1995). 
In dealing with the problematic nature of sequence alignment, molecular
biologists often delete parts of their sequences from the cladistic
analysis. The rationale for this is that those parts of the sequences
that cannot be reliably aligned should be excluded from the estimation
of the phylogeny (Olsen 1988). This is probably a very sensible thing
(Smith 1994), but unfortunately there is often no objective criterion
given for deciding which parts of the alignment are ambiguous, the
decisions usually being made by "visual inspection". Gatesy et al.
(1993) have suggested that those parts of the alignment that are
sensitive to the gap weights (i.e. where the alignment varies
significantly when the gap weights are changed) may constitute
unreliable hypotheses of homology and may therefore be candidates for
exclusion; and Morrison & Ellis (1997) have shown that for some
organisms it is the double-stranded parts of rRNA that may contain most
of the phylogenetic information when the sequences are aligned
according to secondary structure. Thus, there is considerable room for
further research into the problems of sequence alignment. 
It is also important to recognise that there is a more general level of
homology assessment for molecular data, as well. The sequences being
compared must themselves be homologous rather than analogous (Fitch
1970). Thus, only orthologous sequences will reflect the historical
relationships of species, while paralogous sequences (e.g. the two
sequences that result from a gene duplication) will reflect only gene
history, xenologous sequences (e.g. recently-incorporated sequences
such as result from horizontal gene transfer) will only partly reflect
gene history, and plerology (e.g. the inter-mixture of exons and
introns) will only reconstruct a composite gene history (Patterson
1988; Williams 1993; Hillis 1994). All molecular studies thus rely on
the assumption that the sequences from the taxa being compared are
orthologous, and this may be a dubious assumption for distantly related
taxa (Sneath 1989). Any sequence for which orthology has not been
established should be omitted from the analysis (Olsen 1988). 
Furthermore, for orthologous sequences there is the implicit assumption
that the sequences being compared are actually from the organisms being
studied, rather than from some other co-habiting organism. The biggest
problems here are parasites and symbionts (including fungi and algae in
plants), as well as gut contents. An example of this problem is
discussed by Zhang et al. (1997), who cite other literature references
as well. It is perhaps also worth mentioning here the potential
problems of sequence errors, such as those introduced during genomic
cloning, or large-scale rearrangements during molecular cloning,
mis-incorporation of bases in PCR amplification, compressions in
sequencing ladders, mis-reading of autoradiographs, mis-typing of
results, and difficulties during sequence assembly (Clark & Whittam
1992; States 1992) -- the data analyses assume that such errors do not
exist. 
As a final observation on molecular data, it is worth emphasising the
distinction between species trees and gene trees. The result of
cladistic analysis of molecular data is a gene tree (provided that the
entire gene has been sequenced), hypothesising relationships among the
genes or genomes that have been sampled, whereas a species tree
reflects the actual evolutionary pathways (Pamilo & Nei 1988). The gene
tree may be fundamentally incongruent with the true species phylogeny,
with the genome tree, and with other gene trees (e.g. Cao et al. 1994a;
Cummings et al. 1995), due to various phenomena such as allelic
polymorphism, introgression, lineage sorting, unequal rates of
speciation and gene mutation, lateral transfer, hybridization, or
mistaken orthology (Pamilo & Nei 1988; Penny et al. 1990; Doyle 1992;
de Queiroz et al. 1995). Thus, for the reconstruction of phylogenetic
history, a single gene may be, in practice, no more useful than a
single morphological character (Doyle 1992). It may therefore be unwise
to tacitly assume in molecular studies that the number of characters
being used to reconstruct a phylogeny is equivalent to the number of
positions in the sequence. 
Furthermore, a living organism is an integrated functioning whole, not
just a collection of unrelated genetic attributes. Thus, an organism is
a collection of interactions between genes, and between genes and their
environment (i.e. a phenotypic whole), and it is the organism as a
whole that takes part in the evolutionary process. Consequently, there
is no more reason for genetics to reflect phylogeny than for anything
else to do so (de Queiroz et al. 1995). In fact, morphological
characters may be a better reflection, because they integrate many
genetic and phenotypic characters. It is thus clear that there is still
a long way to go in the incorporation of molecular data into
evolutionary ideas. Anyone who doubts this proposition could usefully
examine the continuing saga of the position of the guinea pig within
(or without) the Rodentia:- Graur et al. (1991), Allard et al. (1991),
Graur et al. (1992), Hasegawa et al. (1992), Li et al. (1992), Luckett
& Hartenberger (1993), Martignetti & Brosius (1993), Wolf et al.
(1993), Cao et al. (1994b), Noguchi et al. (1994), Frye & Hedges
(1995), D'Erchia et al. (1996), Cao et al. (1997) and Philippe (1997). 
Polarity
Having determined the homology of the character states, the key to
cladistic analysis is the distinction between derived character states
and ancestral character states (character polarity). It is important to
note that this is a local concept that applies only to a particular set
of taxa. By this I mean that a character state is only considered to be
derived relative to a specified ancestral state, and it may well be the
ancestral state for a further derived state. As an example, for
Character 16 in Table 1 possession of epidermal scales is a derived
state relative to possession of dermal scales, but it is an ancestral
state relative to possession of either feathers or hair. Thus,
possession of epidermal scales is a synapomorphy (a shared derived
character state) for Turtles + Lizards + Snakes + Crocodiles + Birds +
Mammals, while possession of feathers is a synapomorphy for Birds, and
possession of hair is a synapomorphy for Mammals; this character does
not supply a synapomorphy for the grouping only of Turtles + Lizards +
Snakes + Crocodiles. Possession of epidermal scales is thus a
synapomorphy at a more general level than is possession of either
feathers or hair. There is thus a hierarchical relationship between
ancestral and derived character states, and recognising synapomorphies
therefore involves determining the correct level of generality of the
homologies. 
Clearly then, the success of a cladistic analysis rests on the correct
determination of the relative apomorphy of the character states, and
numerous criteria have been proposed for doing this (Crisci & Stuessy
1980; Stevens 1980; Bryant 1992; Nixon & Carpenter 1993). Most of these
criteria are recognised to rely on illogical arguments or on
assumptions that are either false or untestable; and now it is
considered worthwhile to recognise only two objective possibilities:-
the direct method; and outgroup analysis. These are complementary
methods, and both may thus be used in any one data set. For the direct
method, the hierarchical relationship between ancestral and derived
character states is directly observed, and it does not require any
pre-existing hypotheses of character polarity. On the other hand, the
outgroup analysis method does not directly observe character polarity,
and it relies upon an hypothesis of the relationship of the taxa under
study to their near relatives. 
The direct method (Weston 1988) states that:- if one character state is
possessed by all of the taxa that also possess the alternative state,
and in addition it is possessed by some taxa that don't possess the
other state, then it is postulated to be the ancestral state. For
example, open gill slits are possessed by all chordates in at least the
embryonic stage, but in tetrapod chordates these close early in
development; thus, all chordates possess open gill slits but only some
possess both open gill slits (early in development) and closed gill
slits (later in development). Consequently, possession of closed gill
slits is hypothesised to be derived relative to possession of open gill
slits. This type of argumentation can be applied to many types of
characters (Nelson 1978; Weston 1988, 1994), but it is probably of
limited utility for molecular data. However, Weston (1994) has pointed
out that the direct method can be applied to gene duplications
(paralogy) to polarise the a and b subunits of ATPase based on taxa
from the archaebacteria, eubacteria and eukaryotes, thus providing a
"root" for the tree of life. 
The outgroup analysis (or indirect) method (Watrous & Wheeler 1981)
states that:- if a character state is found in both the ingroup (the
group of taxa under study) and also in the outgroup (the sister group
of taxa), then it is postulated to be the ancestral state. For this
method to provide unequivocal evidence of character polarity, the
outgroup should consist of at least two sequential sister groups
(Maddison et al. 1984). For example in Figure 1, if we were interested
in the phylogeny of the tetrapods (the ingroup thus consisting of Frogs
+ Salamanders + Turtles + Lizards + Snakes + Crocodiles + Birds +
Mammals) then the relevant sister groups (the outgroup) would be at
least the Lungfish and the Ray-finned fish. Thus for Character 8 in
Table 1, possession of an amniotic egg is hypothesised to be
apomorphous relative to the lack of such an egg because all members of
the outgroup and some members of the ingroup lack it while only some
members of the ingroup possess it. This type of argumentation relies on
the existence of a corroborated higher level phylogeny for the taxa
being studied, because we need a priori knowledge of the sister groups
of the ingroup. Such higher level phylogenies in turn may rely on other
outgroup comparisons, and so on in a regress back to the origin of
life. Ultimately, we must rely on the direct method for at least some
of the characters in some of the analyses. 
Outgroup analysis is the most common type of argumentation in cladistic
analysis, but it is now usually implemented in a different way from its
original formulation, particularly when analysing molecular data (Smith
1994). Instead of first determining character state polarity and then
producing the cladogram, it is now more common to produce an unrooted
cladogram (a network) based on simultaneous analysis of all of the
characters of both the ingroup and the outgroup and then to determine
the root of the tree using the position where the outgroup joins the
ingroup (see below). This type of method was first described by Farris
(1972), and Nixon & Carpenter (1993) provide a recent detailed summary.
In this method, there are no guarantees that any particular number of
outgroups or any particular choice of outgroups will ensure that the
cladogram accurately reflects the evolutionary history (Li et al. 1987;
Smith 1994; Adachi & Hasegawa 1995). In practice, the phylogenetic
inferences will be more robust if more outgroup taxa are chosen rather
than fewer, and the more closely-related these taxa are to the ingroup
(Hendy & Penny 1989; Wheeler 1990; Nixon & Carpenter 1993). For
molecular data, a distantly-related outgroup may have surpassed the
point of saturation of base substitutions, and there will thus be a
loss of phylogenetic signal through evolutionary time as a result of
random sequence noise (Smith 1994). The outgroup may thus, in effect,
be random, and as such, it will simply join the rest of the cladogram
on the longest internal branch in the ingroup and will itself have a
long terminal branch (Wheeler 1990). The best strategy may be for the
outgroup to consist of several taxa from the sister group to the
ingroup (Smith 1994). 
Conclusion
Ideally, all hypotheses of homology and relative apomorphy will be
congruent with one another. That is, the sets of synapomorphies for a
group of taxa will form a perfect nested hierarchy, and the
polarisation of the character states provides a root for the tree. The
construction of a cladogram from the data would then be unproblematic
(Figure 1); and it was indeed under these circumstances that Hennig
(1966) introduced his phylogenetic method. In most of the real world,
however, there are always characters that are incongruent with one
another (homoplasies). These incongruent characters are postulated to
result from either reversals, where a derived character state reverts
to the ancestral state, parallelisms, where the same derived character
state arises in separate evolutionary lineages, or convergences, where
superficially similar character states have arisen in separate
lineages. Homoplasies are thus mistaken hypotheses of homology. 
In order to deal with the common existence of homoplasies, a range of
tree-building techniques has been developed in cladistics. Each
technique implements a different stratagem for how homoplasies are to
be treated, and each technique may thus produce a different hypothesis
of phylogenetic history for the taxa being analysed. If a data set is
perfectly congruent (i.e. all of the characters reflect the same
speciation and phylesis events), then all of the techniques will
produce the same cladogram, which will be the cladogram produced by the
original method of Hennig (1966). However, apparent incongruences
almost always exist in the real world, and therefore a range of methods
have been developed to try to detect the phylogenetic pattern that
underlies the apparent contradictions. 
Cladistics is basically axiomatic, in the sense that if the assumptions
(the axioms) are accepted then the rest (the corollaries) follows
directly from them. Thus, like all axiomatic propositions, if one or
more of the assumptions are invalid then the rest of the edifice falls
with them. The key assumption in cladistics is that evolution is mainly
a divergent process. Homoplasies are thus assumed to represent
"errors", rather than evidence in favour of some form of alternative
evolutionary process (i.e. a process that does not produce a tree-like
set of relationships), such as hybridization, endosymbiosis,
recombination, gene duplication (producing pseudogenes), or lateral
transfer, all of which are best represented by an anastomosing network
rather than a tree. You might like to keep this in mind when you next
attempt to interpret an apparently fully-bifurcating cladogram, as such
alternative evolutionary processes are increasingly being recognised as
relatively common. 
from http://www.sasb.org.au/TreeBuild/TreeBuilding2.html 
© Copyright 1998 Society of Australian Systematic Biologists



                
__________________________________
Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
http://promotions.yahoo.com/new_mail
Prev by Date: Re: "Common ancestor" in cladistics
Next by Date: Re: Must-Read Paper
Previous by thread: Re: "Common ancestor" in cladistics
Next by thread: Re: "Common ancestor" in cladistics
Indexes: