Clustal was the first popular multiple sequence aligner
Versions: Clustal 1, Clustal 2, Clustal 3, Clustal 4, Clustal V, Clustal W, Clustal X, Clustal Ω
Only the last one is used today
First, there should be a scoring function
Clustal uses a simple one. The sum of all v/s all \[\begin{aligned} \text{Score}_k(s_{1}[i_1],…,s_{k}[i_k])= & \text{Score}_2(s_{1}[i_1],s_{2}[i_2]) + \\ & \text{Score}_2(s_{1}[i_1],s_{3}[i_3]) + \cdots+ \\ & \text{Score}_2(s_{k-1}[i_{k-1}],s_{k}[i_k]) \end{aligned}\] where \(\text{Score}_2(s_{a}[i],s_{b}[j])\) is PAM, BLOSUM, or a similar substitution scoring matrix
We start by comparing sequences all-to-all
That is, comparing all pairs of sequences
We store them in a distance matrix
How many pairs can be done with \(N\) sequences?
Once we get all pairwise “distances”
(that is, scores)
We make a tree by hierarchical clustering
Start with one leaf node for each sequence, and no branches
bottom up: joining one by one
The guide tree is built without seeing the big picture
So it is not safe to assign any meaning to it
We will talk more about trees and build phylogenetic trees later
Clustal aligns the sequences following the guide tree
First, it aligns the more similar sequences
Then it adds the nearest sequence, and so on
These are semi-global alignments
Uses \(\text{Score}_k()\) when there are \(k\) sequences