We know the distance between elements
Distances are in a matrix that we call \(D_1\)
a | b | c | d | e | |
---|---|---|---|---|---|
a | 0 | 17 | 21 | 31 | 23 |
b | 17 | 0 | 30 | 34 | 21 |
c | 21 | 30 | 0 | 28 | 39 |
d | 31 | 34 | 28 | 0 | 43 |
e | 23 | 21 | 39 | 43 | 0 |
Example data from Wikipedia
What are its properties?
Which matrices can be representations of distances?
The length of each edge \((i,j)\) is \(\text{len}(i,j)\)
We can calculate the distance between any pair of nodes
\[ D(i,u) = \begin{cases} \text{len}(i,u)&\text{if }i\text{ is neighbor of }u\\ \min_{j} (\text{len}(i,j) + D(k,u))&\text{otherwise } \end{cases} \]
The minimum is taken considering all \(j\) neighbors of \(i\)
a | b | c | d | e | f | |
---|---|---|---|---|---|---|
a | 0 | 13 | 11 | 15 | 21 | 22 |
b | 13 | 0 | 2 | 6 | 12 | 13 |
c | 11 | 2 | 0 | 4 | 10 | 11 |
d | 15 | 6 | 4 | 0 | 6 | 7 |
e | 21 | 12 | 10 | 6 | 0 | 13 |
f | 22 | 13 | 11 | 7 | 13 | 0 |
Let’s change only one value
a | b | c | d | e | f | |
---|---|---|---|---|---|---|
a | 0 | 13 | 9 | 15 | 21 | 22 |
b | 13 | 0 | 2 | 6 | 12 | 13 |
c | 9 | 2 | 0 | 4 | 10 | 11 |
d | 15 | 6 | 4 | 0 | 6 | 7 |
e | 21 | 12 | 10 | 6 | 0 | 13 |
f | 22 | 13 | 11 | 7 | 13 | 0 |
It is still a valid distance matrix, but cannot be drawn nicely
The idea is to group similar nodes in the same branch
The smallest distance in \(D_1\) is \(D_1 (a,b)=17\)
a | b | c | d | e | |
---|---|---|---|---|---|
a | 0 | 17 | 21 | 31 | 23 |
b | 17 | 0 | 30 | 34 | 21 |
c | 21 | 30 | 0 | 28 | 39 |
d | 31 | 34 | 28 | 0 | 43 |
e | 23 | 21 | 39 | 43 | 0 |
So, \(a\) and \(b\) are the closest elements
Before joining, we have
We create a new node \((a,b)\)
We connect \(a\) and \(b\) to \((a,b)\), splitting their distance
We build a new matrix \(D_2\) with the average distance of each element to \((a,b)\)
\[ \begin{aligned} D_2((a,b),c)= & \frac{D_1(a,c) + D_1(b,c)}{2}=\frac{21+30}{2}=25.5\\ D_2((a,b),d)= & \frac{D_1(a,d) + D_1(b,d)}{2}=\frac{31+34}{2}=32.5\\ D_2((a,b),e)= & \frac{D_1(a,e) + D_1(b,e)}{2}=\frac{23+21}{2}=22 \end{aligned} \]
The matrix \(D_2\) is
(a,b) | c | d | e | |
---|---|---|---|---|
(a,b) | 0 | 25.5 | 32.5 | 22 |
c | 25.5 | 0 | 28 | 39 |
d | 32.5 | 28 | 0 | 43 |
e | 22 | 39 | 43 | 0 |
(values in bold are new, the ones in italics did not change)
Now the smallest distance is \(D_2
((a,b),e)=22\).
We must join \((a,b)\) and \(e\)
The last slide shows that the average distance \((a,e)\) and \((b,e)\) is 22/2
But the new node \((a,b)\) is already at distance 17/2 from \(e\)
So the distance between \((a,b)\) and \(e\) is
\[ \frac{D_2((a,b),e)}{2} - \frac{D_1(a,b)}{2} =\frac{22-17}{2} = 2.5 \]
Notice that this distance is used in the drawing but not in the matrix
\[ \begin{aligned} D_3(((a,b),e),c)= & \frac{D_2((a,b),c) + D_2(e,c)}{2}\\ = & \frac{25.5 + 39}{2}=32.25\\ D_3(((a,b),e),d)= & \frac{D_2((a,b),d) + D_2(e,d)}{2}\\ = & \frac{32.5 + 43}{2}=37.75 \end{aligned} \]
The matrix \(D_3\) is
((a,b),e) | c | d | |
---|---|---|---|
((a,b),e) | 0 | 32.25 | 37.75 |
c | 32.25 | 0 | 28 |
d | 37.75 | 28 | 0 |
Now the closest elements are \(c\) and \(d\)
The distance from \(c\) and \(d\) to the new node \((c,d)\) is 28/2
No correction is necessary, since there are no nodes below them
We calculate the only remaining distance
\[ \begin{aligned} D_4((c,d),((a,b),e)) & = \frac{D_3(c,((a,b),e)) + D_3(d,((a,b),e))}{2}\\ & = \frac{32.25+37.75}{2} =35 \end{aligned} \]
The new matrix is
((a,b),e) | (c,d) | |
---|---|---|
((a,b),e) | 0 | 35 |
(c,d) | 35 | 0 |
We can represent the complete tree by \[(((a,b),e),(c,d))\]
The parenthesis show how to connect every element
But we miss the distance of every element
We can write the distance to the parent after the node label
\[(((a\colon D_a, b\colon D_b)\colon D_{ab},e\colon D_e)\colon D_{abe},(c\colon D_c,d\colon D_d)\colon D_{cd});\]
The resulting tree
can be written (including labels of internal nodes) as \[(((a\colon 8.5, b\colon 8.5)w\colon 2.5,e\colon 11)v\colon 6.5,(c\colon 14,d\colon 14)u\colon 3.5)r;\]
This is called Weighted Pair Group Method with Arithmetic Mean (WPGMA)
There are other hierarchical clustering methods, depending on how do we evaluate the distance between
we mix groups of different size
Node ((a,b),e) has three sequences, and (c,d) has two
“bigger nodes” should have more weight
Unweighted pair group method with arithmetic mean
The distance between branch \(A\) and \(B\), each of size \({N_A}\) and \({N_B}\), is the average of all distances \(D(x,y)\) between pairs of objects in \(A\) and in \(B\)
\[ D((A,B),X) = \frac{N_A \cdot D(A,X) + N_B \cdot D(B,X)}{N_A + N_B} \]
\[ \begin{aligned} D_2((a,b),c)& =\frac{D_1(a,c) \times 1 + D_1(b,c) \times 1)}{1+1}\\ & =\frac{21+30}{2}=25.5\\ D_2((a,b),d)& =\frac{D_1(a,d) + D_1(b,d)}{2}=\frac{31+34}{2}=32.5\\ D_2((a,b),e)& =\frac{D_1(a,e) + D_1(b,e)}{2}=\frac{23+21}{2}=22 \end{aligned} \]
The first step is the same as before
\[ \begin{aligned} D_3(((a,b),e),c)&=\frac{D_2((a,b),c) \times 2 + D_2(e,c) \times 1}{2+1}=\\ & =\frac{25.5 \times 2 + 39 \times 1}{3}=30\\ D_3(((a,b),e),d)&=\frac{D_2((a,b),d) \times 2 + D_2(e,d) \times 1}{2+1}=\\ & =\frac{32.5 \times 2 + 43 \times 1}{3}=36 \end{aligned} \]
((a,b),e) | c | d | |
---|---|---|---|
((a,b),e) | 0 | 30 | 36 |
c | 30 | 0 | 28 |
d | 36 | 28 | 0 |
In practice UPGMA is more realistic than WPGMA
But both have a problem:
The distances between leaves do not match the original distances
Moreover, the mutation rate may be different for different branches
If we know the tree topology, we can find the branches’ lengths
We minimize the squared difference between observed distance \(D_{ij}\) and tree distance \(d_{ij}\)
\[\min_{d_{ij}} \sum_{i,j}(D(i,j)-d_{ij})^2\]
But we still need to find the tree topology, and that is a hard problem.
Let \(i\) and \(j\) be two siblings in a nice tree
\[ \begin{aligned} D(a,b) =& D(a,c) + D(c,b)\qquad(\text{eq. }1)\\ D(a,e) =& D(a,c) + D(c,e)\qquad(\text{eq. }2)\\ D(b,e) =& D(b,c) + D(c,e)\qquad(\text{eq. }3)\\ \end{aligned} \]
\[D(c,e) =\frac{D(a,e)+D(b,e)-D(a,b)}{2}\]
So if we only know the distances between leaves \(a, b\) and \(e,\) and we add internal node \(c,\) this is how we find the distance \(D(c,e)\)
Neighbor Joining is trying to make a nice tree
This is an heuristic to solve the minimization problem
Instead of joining the nearest nodes in the distance matrix, we look into a new matrix \(Q\)
\[Q(i,j) = (n-2) D(i,j) -\sum_k D(i,k) -\sum_k D(k,j)\]
This “neighbor-joining” distance can be negative
a | b | c | d | e | |
---|---|---|---|---|---|
a | 0 | 17 | 21 | 31 | 23 |
b | 17 | 0 | 30 | 34 | 21 |
c | 21 | 30 | 0 | 28 | 39 |
d | 31 | 34 | 28 | 0 | 43 |
e | 23 | 21 | 39 | 43 | 0 |
For each \(i,j∈ \{a,b,c,d,e\}, i≠j,\) we have \[Q(i,j) = (n-2) D(i,j) - R_i - R_j\]
a | b | c | d | e | |
---|---|---|---|---|---|
a | 0 | -143 | -147 | -135 | -149 |
b | -143 | 0 | -130 | -136 | -165 |
c | -147 | -130 | 0 | -170 | -127 |
d | -135 | -136 | -170 | 0 | -133 |
e | -149 | -165 | -127 | -133 | 0 |
The nearest elements are \(c\) and \(d\)
\[ \begin{aligned} D(c, u) =& \frac{D(c,d)}{2} + \frac{R_c -R_d}{2(5-2)}\\ =& 11\\ \end{aligned} \]
\[ \begin{aligned} D(d, u) = & D(c,d) - D(c,u)\\ =&17\\ \end{aligned} \]
For each \(k∈ \{a,b,e\}\) we have \[D(u,k) = \frac{1}{2}(D(c,k) + D(d,k) - D(c,d))\]
a | b | u | e | |
---|---|---|---|---|
a | 0 | 17 | 12 | 23 |
b | 17 | 0 | 18 | 21 |
u | 12 | 18 | 0 | 27 |
e | 23 | 21 | 27 | 0 |
a | b | u | e | |
---|---|---|---|---|
a | 0 | -74 | -85 | -77 |
b | -74 | 0 | -77 | -85 |
u | -85 | -77 | 0 | -74 |
e | -77 | -85 | -74 | 0 |
\[ \begin{aligned} D(a, v) =& 4.75\\ D(u, v) =& 7.25\\ \end{aligned} \]
v | b | e | |
---|---|---|---|
v | 0.0 | 11.5 | 19 |
b | 11.5 | 0.0 | 21 |
e | 19.0 | 21.0 | 0 |
v | b | e |
---|---|---|
0.0 | -51.5 | -51.5 |
-51.5 | 0.0 | -51.5 |
-51.5 | -51.5 | 0.0 |
\[ \begin{aligned} D(v, w) = & 4.75\\ D(b, w) = & 6.75\\ \end{aligned} \]
w | e |
---|---|
0.00 | 14.25 |
14.25 | 0.00 |
thus \(D(w, e) = 14.25\)
Redo all the calculation of these trees