- DNA sequencing
- Pairwise Alignment
- Genome Assembly
- Multiple Alignment
- Finding Binding Sites
December 17, 2019
A Sequence Logo shows the frequency of each residue and the level of conservation
This is the logo for the multiple alignment of a transcription factor binding site. See www.prodoric.de
We align sequences of some class. The alignment can show what characterizes the class.
From the empirical (position specific) frequencies we can estimate the probability \[\Pr(s_k=\text{"A"}\vert s\text{ is a Binding Site})\] that position \(k\) in sequence \(s\) is an ‘A’, given that \(s\) is a binding site
Not all probability distributions are equal. If \[\begin{aligned}\Pr(s_k=\text{"A"}\vert s\in BS)&=\Pr(s_k=\text{"C"}\vert s\in BS)\\ &=\Pr(s_k=\text{"G"}\vert s\in BS)\\&=\Pr(s_k=\text{"T"}\vert s\in BS)\end{aligned}\] then the residue is completely random and we do not know anything. On the other side, if \[\Pr(s_k=\text{"A"}\vert s\in BS)=1 \qquad \Pr(s_k\not=\text{"A"}\vert s\in BS)=0\] then we know everything
Some probability distributions give more information. How much?
We measure information in bits
Since DNA alphabet \(\cal A\) has four symbols we have at most 2 bits per position. The formula is
\[2+\sum_{X\in\cal A}\Pr(s_k=X\vert s\in BS)\log_2 \Pr(s_k=X\vert s\in BS)\]
Basic probability theory says that if \(X\) is any given sequence, then the probability that it is a binding site is \[\Pr(BS\vert X)=\frac{\Pr(X\vert BS)\Pr(BS)}{\Pr(X)}\] Under some hypothesis we also have \[\Pr(X\vert BS)=\Pr(s_1=X_1\vert BS)\Pr(s_2=X_2\vert BS)\cdots\Pr(s_n=X_n\vert BS)\] This is called naive bayes model
Using logarithms we will have \[\begin{aligned} \log\Pr(BS\vert X)=\log\frac{\Pr(X\vert BS)}{\Pr(X)}+\log\Pr(BS)\\ \log\Pr(BS\vert X)=\sum_{k=1}^n\log\frac{\Pr(s_k=X_k\vert BS)}{\Pr(s_k=X_k)}+\text{Const.} \end{aligned}\] Thus the sequences that most probably are BS will also have a big value for this formula
The score of any sequence \(X\) will be \[\text{Score}(X)=\sum_{k=1}^n\log\frac{\Pr(s_k=X_k\vert BS)}{\Pr(s_k=X_k)}\] So the total score is the sum of the score of each column.
We can write the score function as a \(4\times n\) matrix
These are used to find new binding sites
A Sequence Logo shows the frequency of each residue and the level of conservation
For any column we have two scenarios. Either
Each scenario has an information content, which we measure in bits \[H=-\sum_{n\in\mathcal A} \Pr(n)\log_2 \Pr(n)\] where \(\mathcal A=\{A,C,T,G\}\). For example \(H_\text{random}=2\)
The size of the letter corresponds to the information gain \[I = H_\text{random} - H_\text{motif}\] For example, if I know the (exact) result of tossing a coin and you don’t, then I know one bit.
If I do not know the exact result but I know that head has probability 88.99721%, then I know 0.5 bits.
The information gain in each column is \[H_\text{random} - H_\text{motif} = 2 + \sum_{n\in\mathcal A} \Pr(n)\log_2 \Pr(n)\]
If we know the empirical frequencies \(f_{i}(n)\) of the nucleotide \(n\) on each column \(i,\) we can build a position specific score matrix
The score of nucleotide \(n\) at position \(i\) is proportional to \[\log(f_{i}(n)) - \log(F(n))\] where \(F(n)\) is the background frequency of nucleotide \(n\).
But finding the empirical frequencies requires a multiple alignment
It is easy to find motifs in a multiple alignment
But we have seen that multiple alignment can be expensive
When we are looking for short patterns (a.k.a motifs) we can use other strategies
Nucleic Acids Research, Volume 34, Issue 11, 1 January 2006, Pages 3267–3278, https://doi.org/10.1093/nar/gkl429
Given \(N\) sequences of length \(L\): \(seq_1, seq_2, ..., seq_N\), and a fixed motif width \(k,\) we want
Testing all sets of \(k\)-mers is computationally intensive, so several heuristics have been proposed
Given \(N\) sequences of length \(L\) and desired motif width \(k\):
Choose a position \(a_i\) in each sequence at random
Choose a sequence at random, let’s say \(seq_1.\) Make a weight matrix model from the sites in all sequences except \(seq_1\)
Assign a probability to each position in using the weight matrix
Choose a new position in \(seq_1\) based on this probability
Build a new matrix with the new sequence
Choose randomly a new sequence from the set and repeat
We repeat the steps until the matrix no longer changes
Alternative: we stop when the positions \(a_i\) do not change
The methods probably gives us a over-represented motif
Let’s say you get a complex sample:
You extract total DNA, you sequence all.
You got millions of reads
What is on these reads?
Now you ask:
Most of the reads do not align to any reference genome
(most of Prokaryotes have not been isolated)
Given a DNA read \(r\) we want to find which species \(S\) is the most probable origin of the read
We want to find an \(S\) that maximizes \(\Pr(\text{species is }S\vert \text{we saw }r)\)
Thus, the question is how to evaluate this probability. Some approaches:
Some properties of DNA composition tend to be conserved through evolution
For example, two phylogenetically close species usually have similar GC content
Generalizing the idea, we can consider the relative abundance of
An oligomer of size \(k\) is called \(k\)-mer
Given a sequence \(s\) of length \(L\), we characterize it by a vector counting all subwords of size \(k\)
Given a DNA “word” \(w\in {\cal A}^k\) of length \(k\), we define
Remember that \(\mathcal A = \{A,C,G,T\}\) and that \([Q]=1\) iff \(Q\) is true.
The classification must be invariant to reverse complement
The representation should not change if we use either strand
Given a read \(r\) we want to identify its origin in a set \(\{C_j\}\) of taxas (species, genus, etc.)
We want to compare the probabilities \(\Pr(C_i\vert r)\) for each taxa, and assign \(r\) to the taxa with highest probability
Using Bayes’ Theorem we can rewrite the unknown in terms of things that we can evaluate
\[\Pr(C_i\vert r)=\frac{\Pr(r\vert C_i)\Pr(C_i)}{\Pr(r)}\]
Therefore, to maximize \(\Pr(C_i\vert r)\) we need to maximize \(\Pr(r\vert C_i)\)
The read \(r\) can be seen as a series of overlapping \(k\)-mers (oligomers of size \(k\))
We naively assume that they are independent, thus \[\Pr(r\vert S)=\Pr(r_1\ldots r_k\vert S)\Pr(r_2\ldots r_{k+1}\vert S)\cdots\Pr(r_{L-k}\ldots r_L\vert S)\]
Now we need to calculate each \(\Pr(r_{i+1}\ldots r_{i+k}\vert S)\)
We have therefore \[\Pr(C_j\vert r)=\frac{\Pr(r\vert C_j)\Pr(C_j)}{\Pr(r)} =\prod_i{\Pr(r_i\ldots r_{j+k-1}\vert C_j)}\frac{\Pr(C_j)}{\Pr(r)}\]
where \(\Pr(r)\) is independent of the taxa, and \(\Pr(C_j)\) is usually assumed to be uniform. We can define a score using logarithms
\[\log\Pr(C_j\vert r)=\sum_i{\log\Pr(r_i\ldots r_{j+k-1}\vert C_j)}+\log{\Pr(C_j)}-\log{\Pr(r)}\]
If we assume that all \(C_i\) are in the same proportion, then \[\text{Score}(C_i|r)=\sum_i \log f_{(k)}(r_i\ldots r_{j+k-1};C_i)\] Therefore, we just need to calculate, for each \(C_i,\) the frequencies \[f_{(k)}(w;C_i)\] for each \(w\in\cal A^k\)