“Distance” means that smaller numbers are better
If the distance is 0, then the two sequences are identical
If the distance is small, the two sequences are similar
If the distance is big, the sequences are different
Wikipedia
ABRACADABRA
CHUPACABRA
Extrernal gaps do not count
ABRACADABRA
||||
CHUPACABRA
Since external gaps do not count, we can always use them
In this case the smallest distance is always 0
The smallest cost is achieved taking one letter
ABRACADABRA
|
CHUPACABRA
We cannot solve the minimization problem
Instead of finding a minimum, we look for a maximum
The philosophy is the same, but we look for big numbers instead of small numbers
Initially we placed 1 or 0 on each cell depending on match or mismatch
Now we put positive or negative numbers, depending of single-letter scores
Once the “dot plot matrix” is complete, it is easy to find the optimal score
Global alignment: find the largest sum from corner to corner
Semi-global alignment: find the largest sum from side to side
Local alignment: find the largest sum in any diagonal
The most common tool for local alignment is BLAST
Basic Local Alignment Search Tool
BLAST is not Global Alignment
If we pair two letters \(x\) and \(y\), we need to know how they contribute to the score value. For example,
-
(a gap), the score reduces by 2In general, we have a substitution matrix \(C_{x, y}\) for each \(x\) and \(y\)
In one equation
\[C_{x, y} = \begin{cases} +1 & \quad x=y\\ -1 & \quad x≠y\\ -2 & \quad x=\texttt{"-"} \\ -2 & \quad y=\texttt{"-"} \end{cases}\]
The scoring matrix (also known as a substitution matrix) has one row and one column for each letter in our alphabet
For example 4 rows and 4 columns for DNA sequences
In DNA it is common to have +1 for match and -2 for mismatch
Sometimes we can use (+1,-3), (+1,-1) or even (4,-5)
A C G T
A 1 -2 -2 -2
C -2 1 -2 -2
G -2 -2 1 -2
T -2 -2 -2 1
For proteins the alphabet has 20 letters, so the scoring matrix is 20×20
The scores of these matrices are based on the relative frequency of each mutation in nature
Obviously, we only care about non-lethal mutations
More specifically, we care about the Point Mutations that are Accepted
These are called PAM matrices (Margaret Dayhoff, 1978)
Based on 1572 mutations in 71 families of closely related proteins
Dayhoff only used alignments having at least 85% identity
She assumed that any aligned mismatches were the result of a single mutation event, rather than several at the same location
PAM1 is the scoring matrix for proteins in a “short” period
With some easy mathematics we can calculate PAM2, PAM30, PAM250, etc
They represent mutations in 2, 30, and 250 short periods
A R N D C Q E G H I L K M F P S T W Y V B J Z X *
A 6 -7 -4 -3 -6 -4 -2 -2 -7 -5 -6 -7 -5 -8 -2 0 -1 -13 -8 -2 -3 -6 -3 -1 -17
R -7 8 -6 -10 -8 -2 -9 -9 -2 -5 -8 0 -4 -9 -4 -3 -6 -2 -10 -8 -7 -7 -4 -1 -17
N -4 -6 8 2 -11 -3 -2 -3 0 -5 -7 -1 -9 -9 -6 0 -2 -8 -4 -8 6 -6 -3 -1 -17
D -3 -10 2 8 -14 -2 2 -3 -4 -7 -12 -4 -11 -15 -8 -4 -5 -15 -11 -8 6 -10 1 -1 -17
C -6 -8 -11 -14 10 -14 -14 -9 -7 -6 -15 -14 -13 -13 -8 -3 -8 -15 -4 -6 -12 -9 -14 -1 -17
Q -4 -2 -3 -2 -14 8 1 -7 1 -8 -5 -3 -4 -13 -3 -5 -5 -13 -12 -7 -3 -5 6 -1 -17
E -2 -9 -2 2 -14 1 8 -4 -5 -5 -9 -4 -7 -14 -5 -4 -6 -17 -8 -6 1 -7 6 -1 -17
G -2 -9 -3 -3 -9 -7 -4 6 -9 -11 -10 -7 -8 -9 -6 -2 -6 -15 -14 -5 -3 -10 -5 -1 -17
H -7 -2 0 -4 -7 1 -5 -9 9 -9 -6 -6 -10 -6 -4 -6 -7 -7 -3 -6 -1 -7 -1 -1 -17
I -5 -5 -5 -7 -6 -8 -5 -11 -9 8 -1 -6 -1 -2 -8 -7 -2 -14 -6 2 -6 5 -6 -1 -17
L -6 -8 -7 -12 -15 -5 -9 -10 -6 -1 7 -8 1 -3 -7 -8 -7 -6 -7 -2 -9 6 -7 -1 -17
K -7 0 -1 -4 -14 -3 -4 -7 -6 -6 -8 7 -2 -14 -6 -4 -3 -12 -9 -9 -2 -7 -4 -1 -17
M -5 -4 -9 -11 -13 -4 -7 -8 -10 -1 1 -2 11 -4 -8 -5 -4 -13 -11 -1 -10 0 -5 -1 -17
F -8 -9 -9 -15 -13 -13 -14 -9 -6 -2 -3 -14 -4 9 -10 -6 -9 -4 2 -8 -10 -2 -13 -1 -17
P -2 -4 -6 -8 -8 -3 -5 -6 -4 -8 -7 -6 -8 -10 8 -2 -4 -14 -13 -6 -7 -7 -4 -1 -17
S 0 -3 0 -4 -3 -5 -4 -2 -6 -7 -8 -4 -5 -6 -2 6 0 -5 -7 -6 -1 -8 -5 -1 -17
T -1 -6 -2 -5 -8 -5 -6 -6 -7 -2 -7 -3 -4 -9 -4 0 7 -13 -6 -3 -3 -5 -6 -1 -17
W -13 -2 -8 -15 -15 -13 -17 -15 -7 -14 -6 -12 -13 -4 -14 -5 -13 13 -5 -15 -10 -7 -14 -1 -17
Y -8 -10 -4 -11 -4 -12 -8 -14 -3 -6 -7 -9 -11 2 -13 -7 -6 -5 10 -7 -6 -7 -9 -1 -17
V -2 -8 -8 -8 -6 -7 -6 -5 -6 2 -2 -9 -1 -8 -6 -6 -3 -15 -7 7 -8 0 -6 -1 -17
B -3 -7 6 6 -12 -3 1 -3 -1 -6 -9 -2 -10 -10 -7 -1 -3 -10 -6 -8 6 -8 0 -1 -17
J -6 -7 -6 -10 -9 -5 -7 -10 -7 5 6 -7 0 -2 -7 -8 -5 -7 -7 0 -8 6 -6 -1 -17
Z -3 -4 -3 1 -14 6 6 -5 -1 -6 -7 -4 -5 -13 -4 -5 -6 -14 -9 -6 0 -6 6 -1 -17
X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -17
* -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 1
Using local alignment we can identify conserved regions
In 1992 Steven Henikoff and Jorja Henikoff created new substitution matrices based on local alignment of blocks
BLOcks SUbstitution Matrix
Idea: each protein domain can evolve at different speeds
A R N D C Q E G H I L K M F P S T W Y V B J Z X *
A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 -1 -1 -4
R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 -2 0 -1 -4
N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 4 -3 0 -1 -4
D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 -3 1 -1 -4
C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -1 -3 -1 -4
Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 -2 4 -1 -4
E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 -3 4 -1 -4
G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -4 -2 -1 -4
H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 -3 0 -1 -4
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 3 -3 -1 -4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 3 -3 -1 -4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 -3 1 -1 -4
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 2 -1 -1 -4
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 0 -3 -1 -4
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -3 -1 -1 -4
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 -2 0 -1 -4
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 -1 -1 -4
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -2 -2 -1 -4
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -1 -2 -1 -4
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 2 -2 -1 -4
B -2 -1 4 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 -3 0 -1 -4
J -1 -2 -3 -3 -1 -2 -3 -4 -3 3 3 -3 2 0 -3 -2 -1 -2 -1 2 -3 3 -3 -1 -4
Z -1 0 0 1 -3 4 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -2 -2 -2 0 -3 4 -1 -4
X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -4
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1
We prefer this alignment
GGGTAACCTACCTC
||| ||||| ||||
GGGCAACCTGCCTC
instead of this other alignment
GGGT-AACCTA-CCTC
||| ||||| ||||
GGG-CAACCT-GCCTC
Thus, gap penalty must be greater than mismatch penalty
We prefer this alignment
TCAAAGAG---GATA
||| ||| ||||
TCA--GAGGGGGATA
instead of this other alignment
TCAAAGA-G-G-ATA
|| | || | | |||
TC-A-GAGGGGGATA
We want few long gaps instead of many short gaps
Gap values must reflect how real insertions and deletions occur in nature
We observe that, once an indel event starts, it can easily grow
If the polymerase jumps, then it can jump a long distance
To represent this, we use affine gaps
So far we considered only linear gaps
The penalty of \(n\) consecutive gaps is \(n\cdot G\)
(\(G\) is the gap penalty)
Now we consider affine gaps, where the first gap is expensive, but the consecutive are cheap
The penalty of \(n\) consecutive gaps is \(I + n\cdot E\)
\(I\) is the initial gap penalty, \(E\) is the gap extension penalty
After we built the matrix, we must go back from the “optimal score” finding which was the path
There may be more than one solution
Some programs build the alignment at the same time they build the matrix, but that requires more memory
GCAT-GCU
G-ATTACA
GCA-TGCU
G-ATTACA
GCATG-CU
G-ATTACA
If mismatches and gaps have different cost, the score will change
Sometimes the optimal alignment changes
Therefore alignments are meaningless without knowing the scoring matrices
Later we will discuss how to choose the “best” scoring matrix for each case
We want big scores
How big is big enough?
We need to make several hypothesis
The most common hypothesis is statistical
A hit is a subject with score over a threshold
Larger score thresholds give less hits
We can estimate the number of hits in a given database, assuming randomness
That is called Expected value
In practice, we choose a small Expected value
(usually called E-value)
Something like 10-5 or 10-20
What we find is not random
and maybe it is biologically meaningful
The formula for E-value depends on
Same alignments in different databases have different E-value
but the same score