Let’s think about bacteria
If an organism X evolves into two new organisms A and B, both new organisms share something in common
For example
X: TGGGGCAAGTCGGATCCAGATGGGCGCTAC
A: TGGGGCAAGTCGGATCCAGATGGGCGCTAT
B: TAGGGCAAGTCGGATCCAGATGGGCGCTAC
We would see evolution like this
So we only see the modern organisms
How to reconstruct the original tree, given the modern sequences
in YouTube
DNA replication is not 100% perfect
Mutations can be
Not all mutations are “accepted”
Probably most mutations are lethal
We only see mutations that keeps the organism alive
Some mutations can give an advantage
Other mutations are neutral
In the short term, all viable organisms are alive
In the long term, and when resources are scarce, some organisms do not survive
For example, some organisms may be more efficient in capturing food or using energy
If the environment changes, the “fitness” changes
Evolution is more complex for sexual organisms
Some individuals do not pass their genes to the next generation, due to mate-selection
Mate-selection also evolves
We say that phenotype and peer-selection co-evolve
“Every morning in Africa, a gazelle wakes up, it knows it must run faster than the fastest lion or it will be killed.
“Every morning in Africa, a lion wakes up, it knows it must run faster than the slowest gazelle, or it will starve.
“It doesn’t matter whether you’re the lion or a gazelle-when the sun comes up, you’d better be running.”
For this class we will consider the 16S gene in bacteria
Looking only at the modern data, we cannot know which sequence existed before
That is, we cannot put an arrow between two nodes
We put a link, undirected, between nodes
These trees are called unrooted
Since we only see leaves, we cannot put arrows
So we cannot tell which internal node is the root
But, if we include a leave that we know is very distant from all the others, then we can find the root.
The same tree can be drawn in several ways
The drawing is not important
The only important things are
The tree topology. That is, who is connected to who
The length of each arc (or edge)
There are basically three approaches
In all cases the input is a multiple alignment of all sequences
If we know the tree topology, we can count how many mutations are needed to match our data
But the number of trees is HUGE
\[n^{n-2}\]
So the search has to be done with heuristics
In some simulations the predicted tree may be very different from the real one
It can be statistically inconsistent
An alternative is to find the most probable tree, given the available data
This method needs:
So, again, we need an heuristic
We already discussed them
UPGMA
Neighbor Joining
Here we use the Hamming or Levenstein distance between sequences after Multiple sequence alignment
Mutation rate is not proportional to time
Multiple substitutions of the same base cannot be observed
TATCGACTTCGGCAT
TATCGACGTCGGCAT
TATCGACTTCGGCAT
TATCGACTACGGCAT
TATCGACTTCGGCAT
So we underestimate the divergence time
There are different models to find time given distance
According to the Jukes Cantor model
\[R = -\frac{3}{4}\ln\left(1-\frac{4}{3}D/L\right)\]
Here \(D/L\) is the percentage of
sites with different nucleotides
(Hamming Distance over Length)
\(R\) is the expected number of mutations that really happened
It is hard to build time machines, and we only get an approximate answer