This class is based on old material and is not completely updated.
It will be updated later.
My apologies for any confusing material.
May 18, 2018
This class is based on old material and is not completely updated.
It will be updated later.
My apologies for any confusing material.
We throw two dice at the same time. What will be the sum of both numbers?
A <- 🎲 B <- 🎲 C <- A + B
\[\Pr(C\,|\, (B>3)\wedge Z)\]
\[\Pr(C\,|\, (B=3)\wedge Z)\]
If \(Z\) does not say anything about \(B\), the probability \(\Pr(C\vert Z)\) is
2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|
2.8 | 5.6 | 8.3 | 11 | 14 | 17 | 14 | 11 | 8.3 | 5.6 | 2.8 |
If we know that \(B=3\) then \(\Pr(C\vert B=3\wedge Z)\) is
4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|
17 | 17 | 17 | 17 | 17 | 17 |
A probability associates each event with a number
In the case of simple events we have \[\Pr(X=x)=\Pr(x)\]
More complex events can be evaluated by decomposing them into simpler ones \[\Pr(X\text{ is purine})=\Pr(X=\text{'A'})+\Pr(X=\text{'G'})\]
We throw two dice at the same time. What will be the sum of both numbers?
The probability of an event depends on our knowledge
Information about an event of \(C\) may change the probabilities of \(B\)
\(\Pr(B\vert C\wedge Z)\) may not be the same as \(\Pr(B\vert Z)\)
That is the general case. We should not expect that two events are a priori independent
The variables \(B\) and \(C\) are independent if knowing any event of \(C\) does not change the probabilities of \(B\) \[\Pr(B|C\wedge Z)=\Pr(B\vert Z)\] By symmetry, knowing events about \(B\) does not change the probabilities for \(C\) \[\Pr(C|B\wedge Z)=\Pr(C\vert Z)\] We can write \(B\perp C\)
If two experiments \(B\) and \(C\) are performed, we can study the probability of events on \(B\) and events on \(C\)
The probability for both events is then \[\Pr(B=a, C=b) = \Pr(B=a|C=b)\cdot\Pr(C=b)\] or in short \[\Pr(B, C) = \Pr(B|C)\cdot\Pr(C)\]
The probability of an event on \(B\) and on \(C\) can be seen in two parts
The joint probability is always \[\Pr(B, C\vert Z) = \Pr(B|C\wedge Z)\cdot\Pr(C\vert Z)\] If \(B\) and \(C\) are independent, then \[\Pr(B|C\wedge Z)=\Pr(B\vert Z)\] Replacing the second equation into the first, we have \[\Pr(B, C\vert Z) = \Pr(B\vert Z)\cdot\Pr(C\vert Z)\quad\text{ if }B\perp C\]
Imagine we have a test to determine if someone has HIV.
Let’s assume that:
Test- | Test+ | Total | |
---|---|---|---|
HIV- | . | . | . |
HIV+ | . | . | . |
Total | . | . | . |
Test- | Test+ | Total | |
---|---|---|---|
HIV- | . | . | . |
HIV+ | . | . | . |
Total | . | . | 100000 |
Test- | Test+ | Total | |
---|---|---|---|
HIV- | . | . | 99900 |
HIV+ | . | . | 100 |
Total | . | . | 100000 |
\[\Pr(\text{HIV}_+)=0.001\]
Test- | Test+ | Total | |
---|---|---|---|
HIV- | . | . | 99900 |
HIV+ | . | 99 | 100 |
Total | . | . | 100000 |
\[\Pr(\text{test}_+, \text{HIV}_+)=\Pr(\text{test}_+ \vert \text{HIV}_+)\cdot\Pr(\text{HIV}_+)\] \[\Pr(\text{test}_+ \vert \text{HIV}_+)=0.99\]
Test- | Test+ | Total | |
---|---|---|---|
HIV- | 98901 | . | 99900 |
HIV+ | . | 99 | 100 |
Total | . | . | 100000 |
\[\Pr(\text{test}_-, \text{HIV}_-)=\Pr(\text{test}_- \vert \text{HIV}_-)\cdot\Pr(\text{HIV}_-)\] \[\Pr(\text{test}_- \vert \text{HIV}_-)=0.99\]
Test- | Test+ | Total | |
---|---|---|---|
HIV- | 98901 | 999 | 99900 |
HIV+ | 1 | 99 | 100 |
Total | . | . | 100000 |
\[\Pr(\text{test}_-, \text{HIV}_+)=\Pr(\text{test}_- \vert \text{HIV}_+)\cdot\Pr(\text{HIV}_+)\] \[\Pr(\text{test}_- \vert \text{HIV}_+)=0.01\]
Test- | Test+ | Total | |
---|---|---|---|
HIV- | 98901 | 999 | 99900 |
HIV+ | 1 | 99 | 100 |
Total | 98902 | 1098 | 100000 |
\[\Pr(\text{test}_+)= \Pr(\text{test}_+, \text{HIV}_+)+ \Pr(\text{test}_+, \text{HIV}_-)\]
What is the probability of being sick given that the test is positive?
Test- | Test+ | Total | |
---|---|---|---|
HIV- | 98901 | 999 | 99900 |
HIV+ | 1 | 99 | 100 |
Total | 98902 | 1098 | 100000 |
\[\Pr(\text{test}_+, \text{HIV}_+)=\Pr(\text{HIV}_+ \vert \text{test}_+)\cdot\Pr(\text{test}_+)\] \[\Pr(\text{HIV}_+ \vert \text{test}_+)=\frac{\Pr(\text{test}_+, \text{HIV}_+)}{\Pr(\text{test}_+)}=\frac{99}{1098} = 9\%\]
“An Essay towards solving a Problem in the Doctrine of Chances” is a work on the mathematical theory of probability by the Reverend Thomas Bayes, published in 1763, two years after its author’s death
The use of the Bayes theorem has been extended in science and in other fields
Since \[\Pr(B, C\vert Z) = \Pr(B|C,Z)\cdot\Pr(C\vert Z)\] and, by symmetry \[\Pr(B, C\vert Z) = \Pr(C|B,Z)\cdot\Pr(B\vert Z)\] then \[\Pr(B|C) = \frac{\Pr(C|B,Z)\cdot\Pr(B\vert Z)}{\Pr(C\vert Z)}\]
It can be understood as \[\Pr(B|C\wedge Z) = \frac{\Pr(C|B\wedge Z)}{\Pr(C\vert Z)}\cdot\Pr(B\vert Z)\] which is a rule to update our opinions
Bayes says how to change \(\Pr(B\vert Z)\) when we learn \(C\)
“When the facts change, I change my opinion. What do you do, sir?”
John Maynard Keynes (1883 – 1946), English economist, “father” of macroeconomics
Another point of view is \[\Pr(B|C\wedge Z) = \Pr(C|B\wedge Z)\cdot\frac{\Pr(B\vert Z)}{\Pr(C\vert Z)}\] which is a rule to invert the conditional probability
This is the view we will use now
We have two variables:
We do an “experiment” and get a short DNA sequence \(\mathbf{x}=(s_1,\ldots,s_m)\)
We want \(\Pr(B_+|\mathbf{X}=\mathbf{x})\)
We want \(\Pr(B_+|\mathbf{X}=\mathbf{x})\)
Applying Bayes’ theorem we have \[\Pr(B_+|\mathbf{X}=\mathbf{x})= \frac{\Pr(\mathbf{X}=\mathbf{x}|B_+)\cdot\Pr(B_+)}{\Pr(\mathbf{X}=\mathbf{x})}\] so we need to find them
We have a matrix \(\mathbf{M}\) with the empirical frequencies of nucleotides in \(n\) sequences
\(\mathbf{M}\) has 4 rows (A, C, T, G) and \(m\) columns
\(M_{ij}=\) number of times nucleotide \(i\) is at position \(j\)
The sum of each column of \(\mathbf{M}\) is \(n\)
We assume that these sequences are outcomes of a probabilistic process
That is, the sequences follow some probability
We don’t know the exact probability
But we can approximate \[\Pr(X_j=i|B_+)=M_{ij}/n\] for \(i\in\{A,C,T,G\}\)
We also assume that the probabilities of each \(X_j\) are independent
In such case we have \[\Pr(\mathbf{X}=\mathbf{x}|B_+)= \Pr(X_1=s_1|B_+) \cdots \Pr(X_m=s_m|B_+)\] or, in short \[\Pr(\mathbf{X}=\mathbf{x}|B_+)= \prod_{j=1}^m\Pr(X_j=s_j|B_+)\]
Using the same hypothesis of independence, we have \[\Pr(\mathbf{X}=\mathbf{x})= \Pr(X_1=s_1) \cdots \Pr(X_m=s_m)\] or, in short \[\Pr(\mathbf{X}=\mathbf{x})= \prod_{j=1}^m\Pr(X_j=s_j)\] Usually \(\Pr(X_j=i)\) is approximated by the frequency of each nucleotide in the complete genome \[\Pr(X_j=i)=\frac{N_i}{L}\]
We got “good” guesses of \(\Pr(\mathbf{X}=\mathbf{x}|B_+)\) and \(\Pr(\mathbf{X}=\mathbf{x})\)
We need \(\Pr(B_+)\)
How do we get it?
There is no easy answer for that
Let’s say \(\Pr(B_+)=K\) and later we solve it
Applying Bayes’ theorem we have \[\Pr(B_+|\mathbf{X}=\mathbf{x})=\prod_{j=1}^m \frac{\Pr(X_j=s_j|B_+)}{\Pr(X_j=s_j)}\cdot K\]
Can it be simpler?
…are made to change multiplications into sums
\[\log\Pr(B_+|\mathbf{X}=\mathbf{x})=\sum_{j=1}^m \log\frac{\Pr(X_j=s_j|B_+)}{\Pr(X_j=s_j)} + mK\]
For each sequence \(\mathbf{x}\) we calculate the score
\[\mathrm{Score}(\mathbf{x}) =\sum_{j=1}^m Q_{s_j,j} =\sum_{j=1}^m\log\frac{\Pr(X_j=s_j|B_+)}{\Pr(X_j=s_j)}+Const\]
We prepare a matrix \(\mathbf{Q}\) for each binding site \[Q_{i,j}=\log\frac{M_{ij}}{N_i}\]
Write a program (in any computer language) to calculate the score of each position of a genome