Wikipedia
ABRACADABRA
CHUPACABRA
We have two alignments
ABRACADABRA ABRACADABRA
|||| ||||
CHUPACABRA CHUPACABRA
External gaps do not count
“Distance” means that smaller numbers are better
If the distance is 0, then the two sequences are identical
If the distance is small, the two sequences are similar
If the distance is big, the sequences are different
Since external gaps do not count, we can always use them
In this case the smallest distance is always 0
The smallest cost is achieved taking one letter
ABRACADABRA
|
CHUPACABRA
We cannot find local alignments using small values
(it is not a minimization problem)
Instead of finding a minimum, we look for a maximum
The philosophy is the same, but we look for big numbers instead of small numbers
Initially we placed 1 or 0 on each cell depending on match or mismatch
Now we put positive or negative numbers, depending of single-letter scores
\[D_{i,j} = \text{Score}(q_i,s_j)\]
Once the “dot plot matrix” is complete, it is easy to find the optimal score
Global alignment: find the largest sum from corner to corner
Semi-global alignment: find the largest sum from side to side
Local alignment: find the largest sum in any diagonal
Check Google Sheets
We have three options as before, but not negative values
\[M_{i,j} = \max\begin{cases} M_{i,j} + \text{Score}(q_i,s_j)\\ M_{i-1,j} + \text{gap} \\ M_{i,j -1} + \text{gap}\\ 0\end{cases}\]
Notice we use \(\max\) instead of \(\min\)
Check Google Sheets
We prefer this alignment
GGGTAACCTACCTC
||| ||||| ||||
GGGCAACCTGCCTC
instead of this other alignment
GGGT-AACCTA-CCTC
||| ||||| ||||
GGG-CAACCT-GCCTC
Thus, gap penalty must be greater than mismatch penalty
We prefer this alignment
TCAAAGAG---GATA
||| ||| ||||
TCA--GAGGGGGATA
instead of this other alignment
TCAAAGA-G-G-ATA
|| | || | | |||
TC-A-GAGGGGGATA
We want few long gaps instead of many short gaps
Gap values must reflect how real insertions and deletions occur in nature
We observe that, once an indel event starts, it can easily grow
If the polymerase jumps, then it can jump a long distance
To represent this, we use affine gaps
So far we considered only linear gaps
The penalty of \(n\) consecutive
gaps is \(n\cdot G\)
(\(G\) is the gap penalty)
Now we consider affine gaps, where the first gap is expensive, but the consecutive are cheap
The penalty of \(n\) consecutive gaps is \(I + n\cdot E\)
\(I\) is the initial gap penalty, \(E\) is the gap extension penalty
After we built the matrix, we must go back from the “optimal score” finding which was the path
There may be more than one solution
Some programs build the alignment at the same time they build the matrix, but that requires more memory
GCAT-GCU
G-ATTACA
GCA-TGCU
G-ATTACA
GCATG-CU
G-ATTACA
If mismatches and gaps have different cost, the score will change
Sometimes the optimal alignment changes
Therefore alignments are meaningless without knowing the scoring matrices
Later we will discuss how to choose the “best” scoring matrix for each case
We want big scores
How big is big enough?
We need to make several hypothesis
The most common hypothesis is statistical
A hit is a subject with score over a threshold
Larger score thresholds give less hits
We can estimate the number of hits in a given database, assuming randomness
That is called Expected value
In practice, we choose a small Expected value
(usually called E-value)
Something like 10-5 or 10-20
What we find is not random
and maybe it is biologically meaningful
The formula for E-value depends on
\[E=kmn\exp(-λ S)\]
Same alignments in different databases have different E-value
but the same score