A Sequence Logo shows the frequency of each residue and the level of conservation This is the logo for the multiple alignment of a transcription factor binding site. See www.prodoric.de
December 4th, 2017
A Sequence Logo shows the frequency of each residue and the level of conservation This is the logo for the multiple alignment of a transcription factor binding site. See www.prodoric.de
We align sequences of some class. The alignment can show what characterizes the class.
From the empirical (position specific) frequencies we can estimate the probability \[\Pr(s_k=\text{"A"}\vert s\text{ is a Binding Site})\] that position \(k\) in sequence \(s\) is an ‘A’, given that \(s\) is a binding site
Not all probability distributions are equal. If \[\begin{aligned}\Pr(s_k=\text{"A"}\vert s\in BS)&=\Pr(s_k=\text{"C"}\vert s\in BS)\\ &=\Pr(s_k=\text{"G"}\vert s\in BS)\\&=\Pr(s_k=\text{"T"}\vert s\in BS)\end{aligned}\] then the residue is completely random and we do not know anything. On the other side, if \[\Pr(s_k=\text{"A"}\vert s\in BS)=1 \qquad \Pr(s_k\not=\text{"A"}\vert s\in BS)=0\] then we know everything
Some probability distributions give more information. How much?
We measure information in bits
Since DNA alphabet \(\cal A\) has four symbols we have at most 2 bits per position. The formula is
\[2+\sum_{X\in\cal A}\Pr(s_k=X\vert s\in BS)\log_2 \Pr(s_k=X\vert s\in BS)\]
Basic probability theory says that if \(X\) is any given sequence, then the probability that it is a binding site is \[\Pr(BS\vert X)=\frac{\Pr(X\vert BS)\Pr(BS)}{\Pr(X)}\] Under some hypothesis we also have \[\Pr(X\vert BS)=\Pr(s_1=X_1\vert BS)\Pr(s_2=X_2\vert BS)\cdots\Pr(s_n=X_n\vert BS)\] This is called naive bayes model
Using logarithms we will have \[\begin{aligned} \log\Pr(BS\vert X)=\log\frac{\Pr(X\vert BS)}{\Pr(X)}+\log\Pr(BS)\\ \log\Pr(BS\vert X)=\sum_{k=1}^n\log\frac{\Pr(s_k=X_k\vert BS)}{\Pr(s_k=X_k)}+\text{Const.} \end{aligned}\] Thus the sequences that most probably are BS will also have a big value for this formula
The score of any sequence \(X\) will be \[\text{Score}(X)=\sum_{k=1}^n\log\frac{\Pr(s_k=X_k\vert BS)}{\Pr(s_k=X_k)}\] So the total score is the sum of the score of each column.
We can write the score function as a \(4\times n\) matrix
These are used to find new binding sites