Class 10: Multiple Sequence Analysis

Bioinformatics

Andrés Aravena

November 1st, 2021

Multiple Sequence alignment

Source: Uçarlı, Cüneyt, Liam J. McGuffin, Süleyman Çaputlu, Andres Aravena, and Filiz Gürel. “Genetic Diversity at the Dhn3 Locus in Turkish Hordeum Spontaneum Populations with Comparative Structural Analyses.” Scientific Reports (2016) https://doi.org/10.1038/srep20966.

Why Multiple Sequence Alignment?

To study closely related genes or proteins
- to find conserved domains
To find the evolutionary relationships
- the base of phylogenetic trees
To identify shared patterns among related genes
- conserved sequences can be binding sites

In the previous chapter…

We discussed pairwise alignment

Global, using Needleman & Wunsch algorithm
- also for semi-global alignment
Local, using Smith & Waterman algorithm

Both methods build a dot-plot matrix

We build a matrix with \(m_1\) rows and \(m_2\) columns

We write the sequence \(s_1\) in the rows,
and the sequence \(s_2\) in the columns

The computational cost is \(O(m_1 m_2)\)

(\(m_1\) is the length of \(s_1\), \(m_2\) is the length of \(s_2\))

Filling the dot-plot matrix

To find the optimal alignment, we look for diagonals in the matrix that maximize the total score

Every cell \(M_{ij}\) in the matrix has initially the value \[M_{ij}=\text{Score}_2(s_{1}[i],s_{2}[j])\] where \(s_{1}[i]\) is the letter in position \(i\) of sequence \(s_1\),
and \(s_{2}[j]\) is the letter in position \(j\) of \(s_2\)

Pairwise to three-wise alignment

To aligning two sequences, we build a dot-plot matrix.
That is, a rectangle.

To align three sequences, we need a three-dimensional array.
That is, a cube.

Each cell \(M_{ijk}\) has value \[M_{ijk}=\text{Score}_3(s_{1}[i],s_{2}[j], s_{3}[k])\]

Then we find the diagonals

Usually, external gaps do not count, but internal gaps count

That is, these are semi-global alignments

Any path from a border to another border will be an alignment

We look for the optimal alignment

Cost of three-wise alignment

If the three sequences have length \(m_1, m_2,\) and \(m_3,\) then building the cube has cost \[O(m_1\cdot m_2\cdot m_3)\]

Always look for the big picture

To simplify, we assume that all sequences have length \(m\)

Then the cost of three-wise alignment is \[O(m^3)\]

Multiple alignment

Following the same idea…

To align \(N\) sequences, we need a dot-plot in \(N\) dimensions

\[M_{i_1,\ldots,i_N}=\text{Score}_N(s_{1}[i_1],s_{2}[i_2],…,s_{N}[i_N])\]

Therefore, if the average sequence length is \(m,\) then the cost is \[O(m^N)\]

How much is that?

To fix ideas, assume that \(m=1000\)

(That is a typical size for a bacterial gene)

The computational cost is \(O(1000^N)\)

In other words, the cost is \(O(10^{3N})\)

How many seconds

Now assume that the computer can do one million comparisons each second

The number of seconds is then \[O(10^{3N-6})\]

Exercise: How many seconds will it take for 2, 4, 8, and 12 sequences?

How much is that?

Under these hypothesis we have this table

\(N\)	Seconds	In words
2	\(10^0\)	1 sec
4	\(10^6\)	1 million seconds
8	\(10^{18}\)	1 trillion/quintillion seconds
12	\(10^{30}\)	a lot of time

See https://en.wikipedia.org/wiki/Trillion

Exercise 1

Translate these numbers to days, years, etc.

(Approximate answer are OK. We only need one significant figure)

Exercise 2

How do these numbers change if \(m\) changes?

Exercise 3

What happens if the computers are 1000 times faster?

Exercise 4

What is the largest multiple alignment that you can do in your life?

Exercise 5

What is the largest number of sequences that can be aligned?

Exercise 6

What can we do to align more sequences?

Heuristic

This is clearly too expensive, so we need heuristics

(i.e. solving a similar but simpler problem)

One common idea is to do a progressive alignment

start with a pairwise alignment,
and then align the rest one by one

There are too many heuristics

There are several ways to simplify the original problem

Thus, there are many approximate solutions

The main differences are:

How to decide what to align first?
Can new sequences change the previous alignment?
- Can \(s_k\) change the alignment of \((s_1,…,s_{k-1})\)?
How to use additional information (like 3D structure)?
What is the formula for \(\text{Score}_k(s_{1}[i_1],…,s_{k}[i_k])\)?

Some common Multiple-Alignment tools

Clustal
- ClustalW, ClustalX, Clustal Omega
T-Coffee
- 3D-Coffee
MUSCLE
MAFFT