Last week we saw that the score of an alignment can be changed if we mutiply the substitution matrix by a constant λ such that
λSab=logqabpapb
The value of λ depends on the base of log()
It is like changing between meters and inches
Andrés Aravena - Department of Molecular Biology and Genetics
December 4, 2018
Last week we saw that the score of an alignment can be changed if we mutiply the substitution matrix by a constant λ such that
λSab=logqabpapb
The value of λ depends on the base of log()
It is like changing between meters and inches
The score is a sum of log of probabilities. Thus, it has dimensions of information
In practice there are two options for the logarithm, that result in two units for the score
NCBI has a mix. The score is shown in bits (more details later), but λ is calculated for natural logarithm
“Recall that mathematics is a tool to meditate, not compute.”
Nassim Nicholas Taleb (born 1960)
Lebanese-American essayist, scholar, statistician, former trader, and risk analyst
We know that for a small numeric sample x=(x1,…,xn) the average is ˉx=1n∑ixi=∑xxN(x)n=∑xxp(x) where N(x) is the number of times the value x is in the sample x
The number p(x)=N(x)/n is the proportion of values x in the sample
When we work with “infinite populations”, i.e. when we use probabilities, we can generalize the idea of average.
In probabilities instead of “average” we say expected value and we write E(X)=∑xxPr
As we see, the expected value depends on the probabilities
We are always comparing two scenarios, with different probabilities
We want to see which scenario is more plausible, given the sequences
The Turkish word for “dice” is zar
The Spanish word for “random” is azar
Do these two words derive from a common ancestor?
Or is it just a coincidence that they are similar?
To be valid, the matrix S_{a,b} must have
These conditions guarantee that we can find a good \lambda
The value H= \mathbb E(\lambda S_{a,b}|\mathcal H_1) =\sum q_{a,b}\lambda S_{a,b} =\sum q_{a,b}\ln\frac{q_{a,b}}{p_a p_b} >0 is called Information Gain of S_{a,b}
This is a property that characterizes the matrix S_{a,b}. It can be used to compare different matrices
This is the average score per position in the correct alignment of two related sequences
High positive scores are good
But how large is large enough?
For a few applications, such as primer design, we can define a threshold score based on theory
In most cases, we need to do the statistical analysis
If we have two sequences a_i and b_j, then a
A local alignment A is a pair of subsequences a\subset q and b\subset s with gaps in some positions
The score of the alignment A is \mathrm{Score}(A)=\sum_i S_{a_i, b_i}
Given a threshold s we say that A is a High Score Pairing (HSP) if \mathrm{Score}(A)>s
Under the null hypothesis the sequences a_i and b_j are random
Therefore the scores of their alignments are random variables
The same happens with the number of high scoring pairs
Let’s call N_s the number of alignment with score greater than s N_s=\text{Size of}(\{A\text{ alignment}: \mathrm{Score}(A)>s\}) N_s=\text{Cardinality}(\{A\text{ alignment}: \mathrm{Score}(A)>s\}) N_s=\vert\{A\text{ alignment}: \mathrm{Score}(A)>s\}\vert
Since N_s is a random variable, we can ask what is its expected value? \mathbb E(N_s)? This number is called the E-value of the score s
We can also ask what is the probability distribution of N_s? \Pr(N_s=x)? It is easier to answer the first question (E-value) and then find out the second
The basic idea is to decompose the E-value into two parts: \mathbb E(N_s)=\vert\{\text{positions where an alignment can start}\}\vert\cdot\Pr(\mathrm{Score}(A)>s) To simplify notation, let’s call \mathcal P to the set of all positions where an alignment can start \mathbb E(N_s)=\vert\mathcal P\vert\cdot\Pr(\mathrm{Score}(A)>s)
Remember that, if we draw a matrix with one sequence on the rows and the other in the columns, then alignments are diagonals
In a first approach \vert \mathcal P\vert\approx m\,n
Alignments are diagonals in the matrix
If there is a high scoring diagonal, then other HSP cannot start in any of the positions of the diagonal
Therefore, there is a factor K<1 that limits the number of starting points
This number K depends on the scoring matrix S_{a,b}
The formula is very complicated. The computer calculates it for us
With this adjustment, we have \vert \mathcal P\vert= K mn
Therefore the E-value is \mathbb E(N_s)=Kmn\cdot\Pr(\mathrm{Score}(A)>s)
Even before finding the probability, we see that the E-value depends on
Therefore the E-value is not a property of the sequence.
It depends on the context.
In general the probability distribution for the score \Pr(\mathrm{Score}(A)=s) is not easy to calculate, but this \Pr(\mathrm{Score}(A)>s) is easier
First, let’s assume that the score is 1 for a correct matching and -\infty for a mismatch. Each matching happens with probability p
To have \mathrm{Score}(A)>s we need to have at least s independent matchings, each one with probability p: \Pr(\mathrm{Score}(A)>s)=p^s=e^{\ln(p)\cdot s} We rewrite it noticing that -\ln(p) is the surprise of each matching
Since \lambda S_{a,b} is the relative surprise of each pairing, we can understand that it can be used instead of -\ln(p)\cdot s \Pr(\mathrm{Score}(A)>s)=e^{-\lambda s}
The Karlin-Altchulz formula for the E-value of a score s is \mathbb E(N_s) =\vert\mathcal P\vert\cdot\Pr(\mathrm{Score}(A)>s) =Kmn\,e^{-\lambda s} If we define the bit-score as s'=(\lambda s - \ln K)/\ln(2) then the formula can be rewritten as \mathbb E(N_s)=mn\cdot 2^{-s'}
The probability of an event depends on our knowledge
Information about an event of Z may change the probabilities of Y
\Pr(Y|Z) may not be the same as \Pr(Y)
That is the general case. We should not expect that two events are a priori independent
The variables Y and Z are independent if knowing any event of Z does not change the probabilities of Y \Pr(Y|Z)=\Pr(Y) By symmetry, knowing events about Y does not change the probabilities for Z \Pr(Z|Y)=\Pr(Z) We can write Y\perp Z
If two experiments Y and Z are performed, we can study the probability of events on Y and events on Z
The probability for both events is then \Pr(Y=a, Z=b) = \Pr(Y=a|Z=b)\cdot\Pr(Z=b) or in short \Pr(Y, Z) = \Pr(Y|Z)\cdot\Pr(Z)
The probability of an event on Y and on Z can be seen in two parts
The joint probability is always \Pr(Y, Z) = \Pr(Y|Z)\cdot\Pr(Z) If Y and Z are independent, then \Pr(Y|Z)=\Pr(Y) Replacing the second equation into the first, we have \Pr(Y, Z) = \Pr(Y)\cdot\Pr(Z)\quad\text{ if }Y\perp Z
When two variables are not independent, ie. \Pr(Y\vert X)\neq\Pr(Y) then they share some information.
Knowing something about X tell us something about Y.
How much does it tell us?
Can we put a value on this information?
I know something that you do not know: the result of a binary fair choice
If I tell you the result I am giving you some information We call this one bit of information.
Now I know twice as much information as before: 2 bits
In general, tossing N coins will give N bits
This time the coin is really biased, and you know it.
It always falls on Heads
I see the result, but you already know.
Therefore now the information is 0 bits
The realization of a random variable X will give us some information
This information depends only on the probability distribution of X
We will represent the information of the random variable X by the function H(X).
Claude Shanon defined the information of a message x as - \log_2 (\Pr(X=x)) Notice that, since \Pr(X=x)\leq 1, this formula is always positive or zero.
The information of a random variable X is the average of all possible messages H(X) = - \sum_X \Pr(X) \log_2 (\Pr(X)) with the convention that 0\cdot\log_2 (0) = 0.
Notice that, since \Pr(X)\leq 1, this formula is always positive or zero.
Following the definition, we have H(X,Y) = -\sum_X \sum_Y \Pr(X,Y) \log_2(\Pr(X,Y)) If X and Y are idependent, then \Pr(X,Y)=\Pr(X)\Pr(Y). So H(X,Y) = -\sum_X \sum_Y \Pr(X)\Pr(Y) \log_2(\Pr(X)\Pr(Y))
The logarithm of a product is the sum of the logarithms, so H(X,Y) = -\sum_X \sum_Y \Pr(X)\Pr(Y) \log_2(\Pr(X)) -\sum_X \sum_Y \Pr(X)\Pr(Y) \log_2(\Pr(Y)) The sums can be reorganized as H(X,Y) = -\sum_Y \Pr(Y) \sum_X \Pr(X) \log_2(\Pr(X)) -\sum_X \Pr(X)\sum_Y \Pr(Y) \log_2(\Pr(Y)) (Just using the associative and distributive laws)
Since we know that \sum_X \Pr(X)=\sum_Y \Pr(Y)=1, we have H(X,Y) = -\sum_X \Pr(X) \log_2(\Pr(X)) - \sum_Y \Pr(Y) \log_2(\Pr(Y)) therefore H(X,Y) = H(X) + H(Y)
It is easy to see that this result can be generalized to several independent random variables.