Class 10: Working with multiple sequences

December 4th, 2017

Graphical representation of multiple alignment

A Sequence Logo shows the frequency of each residue and the level of conservation This is the logo for the multiple alignment of a transcription factor binding site. See www.prodoric.de

Alignment shows what is shared among sequences

We align sequences of some class. The alignment can show what characterizes the class.

From the empirical (position specific) frequencies we can estimate the probability \[\Pr(s_k=\text{"A"}\vert s\text{ is a Binding Site})\] that position \(k\) in sequence \(s\) is an ‘A’, given that \(s\) is a binding site

Conservation is seen on probability

Not all probability distributions are equal. If \[\begin{aligned}\Pr(s_k=\text{"A"}\vert s\in BS)&=\Pr(s_k=\text{"C"}\vert s\in BS)\\ &=\Pr(s_k=\text{"G"}\vert s\in BS)\\&=\Pr(s_k=\text{"T"}\vert s\in BS)\end{aligned}\] then the residue is completely random and we do not know anything. On the other side, if \[\Pr(s_k=\text{"A"}\vert s\in BS)=1 \qquad \Pr(s_k\not=\text{"A"}\vert s\in BS)=0\] then we know everything

Conservation level is measured in bits

Some probability distributions give more information. How much?

We measure information in bits

Since DNA alphabet \(\cal A\) has four symbols we have at most 2 bits per position. The formula is

\[2+\sum_{X\in\cal A}\Pr(s_k=X\vert s\in BS)\log_2 \Pr(s_k=X\vert s\in BS)\]

Using this result to look for more sequences

Basic probability theory says that if \(X\) is any given sequence, then the probability that it is a binding site is \[\Pr(BS\vert X)=\frac{\Pr(X\vert BS)\Pr(BS)}{\Pr(X)}\] Under some hypothesis we also have \[\Pr(X\vert BS)=\Pr(s_1=X_1\vert BS)\Pr(s_2=X_2\vert BS)\cdots\Pr(s_n=X_n\vert BS)\] This is called naive bayes model

Scoring

Using logarithms we will have \[\begin{aligned} \log\Pr(BS\vert X)=\log\frac{\Pr(X\vert BS)}{\Pr(X)}+\log\Pr(BS)\\ \log\Pr(BS\vert X)=\sum_{k=1}^n\log\frac{\Pr(s_k=X_k\vert BS)}{\Pr(s_k=X_k)}+\text{Const.} \end{aligned}\] Thus the sequences that most probably are BS will also have a big value for this formula

Position Specific Scoring Matrices

The score of any sequence \(X\) will be \[\text{Score}(X)=\sum_{k=1}^n\log\frac{\Pr(s_k=X_k\vert BS)}{\Pr(s_k=X_k)}\] So the total score is the sum of the score of each column.

We can write the score function as a \(4\times n\) matrix

These are used to find new binding sites

Exercise

Go To http://meme-suite.org/
Use MAST to find all instances of Fur motifs on S.pombe
What are the differences between FIMO, MAST, MCAST and GLAM2Scan?
What are the differences between MEME, DREME, MEME-ChIP, GLAM2 and MoMo?