Broadly speaking, there are three kinds of proteins
Transcription Factors (TF) are proteins that can bind to DNA
The binding sites of a TF are the places in the genome where the TF binds
They are short regions (about 20bp long)
One protein can bind in several sites
We want to see what do these site have in common
Experiments can give us the sequence where a TF binds
For example, the E.coli protein Fnr binds in sites like this:
TTGATCGTTATCAA
TTGATTTACATCAA
ATGATAAATATCAA
TTGAGGTAGGTCAA
TTGATTTGGATCAC
TTGATCTCACCCGG
TTGATTAACATCAA
See http://http://www.prodoric.de/matrix.php?matrix_acc=MX000004
There are tens of binding sites for each TF
We can align them to see what is conserved
We can summarize all of them counting each nucleotide on every column
That is, calculating the frequency of each nucleotide
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A | 3 | 0 | 1 | 24 | 2 | 9 | 6 | 15 | 7 | 23 | 5 | 2 | 27 | 25 |
C | 1 | 0 | 1 | 3 | 0 | 7 | 4 | 5 | 8 | 4 | 3 | 26 | 1 | 4 |
G | 1 | 2 | 30 | 2 | 4 | 3 | 3 | 3 | 5 | 4 | 1 | 3 | 1 | 2 |
T | 27 | 30 | 0 | 3 | 26 | 13 | 19 | 9 | 12 | 1 | 23 | 1 | 3 | 1 |
Let’s give a name to this matrix. It is \(F.\) \[\begin{aligned} F(“A”, 1) & =3\\ F(“T”, 1) & =27\\ F(“C”, 6) & =7 \end{aligned} \]
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A | 3 | 0 | 1 | 24 | 2 | 9 | 6 | 15 | 7 | 23 | 5 | 2 | 27 | 25 |
C | 1 | 0 | 1 | 3 | 0 | 7 | 4 | 5 | 8 | 4 | 3 | 26 | 1 | 4 |
G | 1 | 2 | 30 | 2 | 4 | 3 | 3 | 3 | 5 | 4 | 1 | 3 | 1 | 2 |
T | 27 | 30 | 0 | 3 | 26 | 13 | 19 | 9 | 12 | 1 | 23 | 1 | 3 | 1 |
This matrix was built with 32 sequences
There are no gaps in the alignment
The sum of each column is 32. That is, \[\forall j\qquad \sum_{X\in \{A,C,G,T\}} F(X, j)=32\]
Other TF may have more or less sequences
To compare equal to equal, we divide everything by 32
This matrix was built with 32 sequences
There are no gaps in the alignment
The sum of each column is 32. That is, for each \(j\) we have \[F(“A”,j)+ F(“C”,j)+ F(“G”,j)+ F(“T”,j)=32\]
Other TF may have more or less sequences
To compare equal to equal, we divide everything by 32
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A | 0.09 | 0.00 | 0.03 | 0.75 | 0.06 | 0.28 | 0.19 | 0.47 | 0.22 | 0.72 | 0.16 | 0.06 | 0.84 | 0.78 |
C | 0.03 | 0.00 | 0.03 | 0.09 | 0.00 | 0.22 | 0.12 | 0.16 | 0.25 | 0.12 | 0.09 | 0.81 | 0.03 | 0.12 |
G | 0.03 | 0.06 | 0.94 | 0.06 | 0.12 | 0.09 | 0.09 | 0.09 | 0.16 | 0.12 | 0.03 | 0.09 | 0.03 | 0.06 |
T | 0.84 | 0.94 | 0.00 | 0.09 | 0.81 | 0.41 | 0.59 | 0.28 | 0.38 | 0.03 | 0.72 | 0.03 | 0.09 | 0.03 |
Let’s call this matrix with the name \(f.\) We have \[\forall X, \forall j\qquad f(X, j)=\frac{F(X, j)}{\sum_Y F(Y, j)}\]
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A | 0.09 | 0.00 | 0.03 | 0.75 | 0.06 | 0.28 | 0.19 | 0.47 | 0.22 | 0.72 | 0.16 | 0.06 | 0.84 | 0.78 |
C | 0.03 | 0.00 | 0.03 | 0.09 | 0.00 | 0.22 | 0.12 | 0.16 | 0.25 | 0.12 | 0.09 | 0.81 | 0.03 | 0.12 |
G | 0.03 | 0.06 | 0.94 | 0.06 | 0.12 | 0.09 | 0.09 | 0.09 | 0.16 | 0.12 | 0.03 | 0.09 | 0.03 | 0.06 |
T | 0.84 | 0.94 | 0.00 | 0.09 | 0.81 | 0.41 | 0.59 | 0.28 | 0.38 | 0.03 | 0.72 | 0.03 | 0.09 | 0.03 |
Now the sum of each column is 1, that is \[\forall j\qquad \sum_{X} f(X, j)=1\] (there may be some rounding error)
Now we build another matrix, called \(M\) \[\forall X, \forall j\qquad M(X, j)=2 + \log_2 f(X, j)\]
The result is this
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A | -1.4 | -Inf | -3.0 | 1.6 | -2.0 | 0.17 | -0.42 | 0.91 | -0.19 | 1.5 | -0.68 | -2.0 | 1.8 | 1.6 |
C | -3.0 | -Inf | -3.0 | -1.4 | -Inf | -0.19 | -1.00 | -0.68 | 0.00 | -1.0 | -1.42 | 1.7 | -3.0 | -1.0 |
G | -3.0 | -2.0 | 1.9 | -2.0 | -1.0 | -1.42 | -1.42 | -1.42 | -0.68 | -1.0 | -3.00 | -1.4 | -3.0 | -2.0 |
T | 1.8 | 1.9 | -Inf | -1.4 | 1.7 | 0.70 | 1.25 | 0.17 | 0.58 | -3.0 | 1.52 | -3.0 | -1.4 | -3.0 |
Since \(f(X, j)\) is a relative frequency, it is always less than or equal to 1 \[f(X, j)≤1\] Therefore its logarithm is always negative or zero \[\log_2 f(X, j)≤0\]
A priori, each nucleotide \(X\) has 25% relative frequency, so \[\log_2 f(X, j)=\log_2 0.25 = -2\quad\text{if }X\text{ has 25% frequency}\] thus \[\log_2 f(X, j) + 2 = 0\quad\text{if }X\text{ has 25% frequency}\]
If \(f(X, j)>0.25,\) that is, if the nucleotide is more frequent than expected, then \[\log_2 f(X, j) + 2 > 0\quad\text{if }X\text{ has over 25% frequency}\]
If \(f(X, j)<0.25,\) that is, if the nucleotide is less frequent than expected, then \[\log_2 f(X, j) + 2 < 0\quad\text{if }X\text{ has over 25% frequency}\]
and negative for less frequent nucleotides
We assumed a background frequency of 25% for each nucleotide
This may not be true in some cases
(depends on the GC content)
The formula can be adapted for that case
When the frequency is 0, the score is -Inf
This is too pessimistic
Maybe we have not seen all relevant binding sites
One solution is to change \(f\) a little bit \[\forall X, \forall j\qquad f(X, j)=\frac{F(X, j)+0.1}{\sum_Y (F(Y, j)+0.1)}\]
Doing math with integers is much faster than using decimals
(at least in some old computers)
In some cases people works with the matrix \(M'=\lfloor 10M\rfloor\)
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A | -15 | -1000 | -30 | 15 | -20 | 1 | -5 | 9 | -2 | 15 | -7 | -20 | 17 | 16 |
C | -30 | -1000 | -30 | -15 | -1000 | -2 | -10 | -7 | 0 | -10 | -15 | 17 | -30 | -10 |
G | -30 | -20 | 19 | -20 | -10 | -15 | -15 | -15 | -7 | -10 | -30 | -15 | -30 | -20 |
T | 17 | 19 | -1000 | -15 | 17 | 7 | 12 | 1 | 5 | -30 | 15 | -30 | -15 | -30 |
It is not 100% exact, but the results are the same almost always
Now we can calculate the score for any subsequence of the same size as the motif
Let’s call \(w\) to this subsequence. It has \(L\) nucleotides \[w=(w_1, …, w_L)\]
Then \[\text{Score}(w)=\sum_{j=1}^L M(w_j, j)\]
In some columns the scores can be very high
Since we know the frequency of each letter, we can calculate the average score for each column
\[\forall j \quad I(j) = \sum_X f(X, j) M(X, j)\]
A Sequence Logo shows the frequency of each residue and the level of conservation
Why some letters are taller than others?
Looking for new Binding Sites
Now we will look in all the genome \(G\)
For each position \(i\) in the genome, we calculate \[\text{Score}(G,i)=\sum_{j=1}^n M(G_{i+j-1}, j)\]
This will give us a vector of scores. Higher values are better
We keep all position with score higher than a threshold
The threshold depends on the statistical significance of the score
If we know the PSSM, it is easy to find new sites matching the motif
We just use the PSSM to give a score to each symbol, and we get a total score.
High scores will be new motif instances
It is easy to prepare PSSM in a multiple alignment
But we have seen that multiple alignment can be expensive
When we are looking for short patterns (a.k.a motifs) we can use other strategies
Nucleic Acids Research, Volume 34, Issue 11, 1 January 2006, Pages 3267–3278, https://doi.org/10.1093/nar/gkl429
Given \(N\) sequences of length \(m\): \(s_1, s_2, ..., s_N\), and a fixed motif width \(k,\) we want
Testing all sets of \(k\)-mers is computationally intensive, so several heuristics have been proposed
Given \(N\) sequences of length \(m\) and desired motif width \(k\):
Step 1: Choose a position in each sequence at random: \[a_1 \in s_1, a_2 \in s_2, ..., a_N \in s_N\]
Step 2: Choose a sequence at random from the set (let’s say, \(s_j\)).
Step 3: Make a PSSM of width \(k\) from the sites in all sequences except \(s_j.\)
Step 4: Assign a score to each position in \(s_j\) using the PSSM constructed in step 3: \[p_1, p_2, p_3, ..., p_{m-k+1}\]
Step 5: Choose a starting position in \(s_j\) (higher score have higher probability) and set \(a_j\) to this new position.
Step 6: Choose a sequence at random from the set (say, \(s_{j'}\)).
Step 7: Make a PSSM of width \(k\) from the sites in all sequences except \(s_{j'}\), but including \(a_j\) in \(s_j\)
Step 8: Repeat until convergence (of positions or motif model)
Choose a position \(a_i\) in each sequence at random
Choose a sequence at random, let’s say \(s_1.\) Make a PSSM from the sites in all sequences except \(s_1\)
Assign a score to each position in using the PSSM
Choose a new position in \(s_1\) based on this score
Build a new PSSM with the new sequence
Choose randomly a new sequence from the set and repeat
We repeat the steps until the PSSM no longer changes
Alternative: we stop when the positions \(a_i\) do not change
The methods probably gives us a over-represented motif