Source: Uçarlı, Cüneyt, Liam J. McGuffin, Süleyman Çaputlu, Andres Aravena, and Filiz Gürel. “Genetic Diversity at the Dhn3 Locus in Turkish Hordeum Spontaneum Populations with Comparative Structural Analyses.” Scientific Reports (2016) https://doi.org/10.1038/srep20966.
We discussed pairwise alignment
We build a matrix with \(m_1\) rows and \(m_2\) columns
We write the sequence \(s_1\) in the rows,
and the sequence \(s_2\) in the columns
The computational cost is \(O(m_1 m_2)\)
(\(m_1\) is the length of \(s_1\), \(m_2\) is the length of \(s_2\))
To find the optimal alignment, we look for diagonals in the matrix that maximize the total score
Every cell \(M_{ij}\) in the matrix has initially the value \[M_{ij}=\text{Score}_2(s_{1}[i],s_{2}[j])\] where \(s_{1}[i]\) is the letter in position \(i\) of sequence \(s_1\),
and \(s_{2}[j]\) is the letter in position \(j\) of \(s_2\)
To aligning two sequences, we build a dot-plot matrix.
That is, a rectangle.
To align three sequences, we need a three-dimensional array.
That is, a cube.
Each cell \(M_{ijk}\) has value \[M_{ijk}=\text{Score}_3(s_{1}[i],s_{2}[j], s_{3}[k])\]
Usually, external gaps do not count, but internal gaps count
That is, these are semi-global alignments
Any path from a border to another border will be an alignment
We look for the optimal alignment
If the three sequences have length \(m_1, m_2,\) and \(m_3,\) then building the cube has cost \[O(m_1\cdot m_2\cdot m_3)\]
To simplify, we assume that all sequences have length \(m\)
Then the cost of three-wise alignment is \[O(m^3)\]
Following the same idea…
To align \(N\) sequences, we need a dot-plot in \(N\) dimensions
\[M_{i_1,\ldots,i_N}=\text{Score}_N(s_{1}[i_1],s_{2}[i_2],…,s_{N}[i_N])\]
Therefore, if the average sequence length is \(m,\) then the cost is \[O(m^N)\]
To fix ideas, assume that \(m=1000\)
(That is a typical size for a bacterial gene)
The computational cost is \(O(1000^N)\)
In other words, the cost is \(O(10^{3N})\)
Now assume that the computer can do one million comparisons each second
The number of seconds is then \[O(10^{3N-6})\]
Exercise: How many seconds will it take for 2, 4, 8, and 12 sequences?
Under these hypothesis we have this table
\(N\) | Seconds | In words |
---|---|---|
2 | \(10^0\) | 1 sec |
4 | \(10^6\) | 1 million seconds |
8 | \(10^{18}\) | 1 trillion/quintillion seconds |
12 | \(10^{30}\) | a lot of time |
Translate these numbers to days, years, etc.
(Approximate answer are OK. We only need one significant figure)
How do these numbers change if \(m\) changes?
What happens if the computers are 1000 times faster?
What is the largest multiple alignment that you can do in your life?
What is the largest number of sequences that can be aligned?
What can we do to align more sequences?
This is clearly too expensive, so we need heuristics
(i.e. solving a similar but simpler problem)
One common idea is to do a progressive alignment
There are several ways to simplify the original problem
Thus, there are many approximate solutions
The main differences are: