Colin Wright, juggler,
inventor of the mathematical notation of juggling
In my opinion, all biologist should know something about
The rest depends on each case
(maybe calculus and linear algebra)
A combination of nodes and edges
Nodes are the elements of any set we choose
Edges are pairs of nodes \[E⊂N×N\]
Nodes are also called vertices
At least two groups of people have worked with networks: engineers and mathematicians
They use different words for the same objects
Network: graph
Node: vertex
Link: edge
Arrow: arc
In directed graphs edges have direction
Edge (a, b) is different from edge (b, a)
In undirected graphs the edges have no direction
Edge {a, b} is the same as edge {b, a}
Directed edges are also called arcs
The degree of a node is the number of edges connected to it
In other words, it is the number of neighbors nodes
If the graph is directed, we can also talk about
Depending on the problem, we may add other properties to the nodes and edges. For instance
Binary trees are graphs fully connected and without cycles
In an unrooted tree all nodes have either degree
The length of each edge \((i,j)\) is \(\text{len}(i,j)\)
We can calculate the distance between any pair of nodes
\[D(i,u) = \begin{cases} \text{len}(i,u)&\text{if }i\text{ is neighbor of }u\\ \min_{j} (\text{len}(i,j) + D(k,u))&\text{otherwise } \end{cases}\]
The minimum is taken considering all \(j\) neighbors of \(i\)
a | b | c | d | e | f | |
---|---|---|---|---|---|---|
a | 0 | 13 | 11 | 15 | 21 | 22 |
b | 13 | 0 | 2 | 6 | 12 | 13 |
c | 11 | 2 | 0 | 4 | 10 | 11 |
d | 15 | 6 | 4 | 0 | 6 | 7 |
e | 21 | 12 | 10 | 6 | 0 | 13 |
f | 22 | 13 | 11 | 7 | 13 | 0 |
Let’s change only one value
a | b | c | d | e | f | |
---|---|---|---|---|---|---|
a | 0 | 13 | 9 | 15 | 21 | 22 |
b | 13 | 0 | 2 | 6 | 12 | 13 |
c | 9 | 2 | 0 | 4 | 10 | 11 |
d | 15 | 6 | 4 | 0 | 6 | 7 |
e | 21 | 12 | 10 | 6 | 0 | 13 |
f | 22 | 13 | 11 | 7 | 13 | 0 |
It is still a valid distance matrix, but cannot be drawn nicely
Let \(i\) and \(j\) be two siblings in a nice tree
\[\begin{aligned} D(a,b) =& D(a,c) + D(c,b)\qquad(\text{eq. }1)\\ D(a,e) =& D(a,c) + D(c,e)\qquad(\text{eq. }2)\\ D(b,e) =& D(b,c) + D(c,e)\qquad(\text{eq. }3)\\ \end{aligned}\]
\[D(c,e) =\frac{D(a,e)+D(b,e)-D(a,b)}{2}\]
So if we only know the distances between leaves \(a, b\) and \(e,\) and we add internal node \(c,\) this is how we find the distance \(D(c,e)\)
Neighbor Joining is trying to make a nice tree
The set of all possible outcomes is often called Ω
An event 𝐴 can be seen as the set of all outcomes that make the event true
For example,
Fever={Temperature>37.5°C}
An event will become either true or false after an experiment
For example, a dice can be either 4 or not
We want to give a value to our rational belief that the event will become true after the experiment
The numeric value is called Probability
It is useful to think that the probability of an event is the area in the drawing
The total area of Ω is 1
Usually we do not know the shape of 𝐴
Our rational beliefs depend on our knowledge
If we represent our knowledge (or hypothesis) by 𝑍, the the probability of an event 𝐴 is written as \[ℙ(A|Z)\] We read “the probability of event 𝐴, given that we know 𝑍”
For example, “the probability that we get a 4, given that the dice is symmetrical”
The order is relevant \[ℙ(A|Z)≠ℙ(Z|A)\] There are two events, 𝐴 and 𝑍
The one written after |
is what we assume to be true
The one written before |
is what we are asking for
One we know, the other we do not
Now outcomes are limited only to the 𝑍 region
We measure the area of \(ℙ(A|Z)\) with respect to the area of 𝑍 instead of Ω
The shape of 𝑍 is often unknown
If, given our knowledge 𝑍, the event 𝐵 is more plausible than the event 𝐴, then \[ℙ(A|Z)≤ℙ(B|Z)\]
For example, the probability that we get either 4, 5 or 6 is greater than the probability that we get a 4, given that the dice is symmetrical \[ℙ(\{4\}|Z)≤ℙ(\{4,5,6\}|Z)\]
On the other hand, if we get new information, the probabilities may change
The same event 𝐴 may be more plausible under a new hypothesis 𝑌 than under the initial hypothesis 𝑍
Then \[ℙ(A|Z)≤ℙ(A|Y)\]
It has been proven that probabilities must be like this
A probability is a number between 0 and 1 inclusive \[ℙ(A) ≥ 0\textrm{ and } ℙ(A)≤1\]
The probability of an sure event is 1 \[ℙ(\textrm{True}) = 1\]
The probability of an impossible event is 0 \[ℙ(\textrm{False}) = 0\]
We are interested in non-trivial events, that are usually combinations of smaller events
For example, we may ask “what is the probability that, in a group of 𝑛 people, at least two persons have the same birthday”
Fortunately, any complex event can be decomposed into simpler events, combined with and, or and not connectors
Exercise: decompose the birthday event into simpler ones
If the event 𝐴 becomes more and more plausible, then the opposite event not 𝐴 becomes less and less plausible
It can be shown that we always have \[ℙ(\textrm{not } A) = 1-ℙ(A)\]
The probability of of 𝐴 and 𝐵 happening simultaneously must be connected to the probability of each one
It can be shown that there are only two ways to calculate it
It can be proven that the only way to combine \(ℙ(A)\) and \(ℙ(B|A)\) to get \(ℙ(A,B)\) is to multiply them.
Both are true, since \(ℙ(A,B)=ℙ(B,A).\) The order that we write them is irrelevant.
As part of the strategy to control COVID-19, many governments carry on random sampling of the population looking for asymptomatic cases.
Imagine that you are randomly chosen for a test of COVID-19. The test result is “positive”, that is, it says that you have the virus. You also know that the test sometimes fails, giving either a false positive or a false negative. Then the question is what is the probability that you have COVID-19 given that the test said “positive”?
Let’s assume that:
Since this context will be the same in all cases, we will not write it explicitly
Test- | Test+ | Total | |
---|---|---|---|
COVID- | . | . | . |
COVID+ | . | . | . |
Total | . | . | . |
We show COVID reality in the rows and test results in the columns
Test- | Test+ | Total | |
---|---|---|---|
COVID- | . | . | . |
COVID+ | . | . | . |
Total | . | . | 1e+05 |
We will fill this matrix in the following slides
A large population size help us to see small values
Test- | Test+ | Total | |
---|---|---|---|
COVID- | . | . | 99900 |
COVID+ | . | . | 100 |
Total | . | . | 1e+05 |
Prevalence is the percentage of the population that has COVID.
In other words, it is the probability of (COVID+) \[
\begin{aligned}
ℙ(\text{COVID+}) & =0.1\% = 0.001\\
ℙ(\text{COVID-}) & =99.9\%=0.999
\end{aligned}
\]
Test- | Test+ | Total | |
---|---|---|---|
COVID- | . | . | 99900 |
COVID+ | . | 99 | 100 |
Total | . | . | 1e+05 |
Precision is the probability of a correct diagnostic \[ℙ(\text{test+} \vert \text{COVID+})=0.99\] We fill the box corresponding to (test+,COVID+) \[ℙ(\text{test+}, \text{COVID+})=ℙ(\text{test+} \vert \text{COVID+})\cdotℙ(\text{COVID+})\]
Test- | Test+ | Total | |
---|---|---|---|
COVID- | 98901 | . | 99900 |
COVID+ | . | 99 | 100 |
Total | . | . | 1e+05 |
In this case the precision for negative cases is the same \[ℙ(\text{test-} | \text{COVID-})=0.99\] We fill the box corresponding to (test-,COVID-) \[ℙ(\text{test-}, \text{COVID-})=ℙ(\text{test-} | \text{COVID-})⋅ℙ(\text{COVID-})\]
Test- | Test+ | Total | |
---|---|---|---|
COVID- | 98901 | 999 | 99900 |
COVID+ | 1 | 99 | 100 |
Total | . | . | 1e+05 |
Misdiagnostic is the negation of good diagnostic \[ℙ(\text{test-} | \text{COVID+})=1-ℙ(\text{test+} | \text{COVID+})=0.01\] we combine them in the same way as before \[ℙ(\text{test-}, \text{COVID+})=ℙ(\text{test-} | \text{COVID+})⋅ ℙ(\text{COVID+})\]
Test- | Test+ | Total | |
---|---|---|---|
COVID- | 98901 | 999 | 99900 |
COVID+ | 1 | 99 | 100 |
Total | 98902 | 1098 | 1e+05 |
We sum and fill the empty boxes
1098 people got positive test, but only 99 of them have COVID%\[ℙ(\text{COVID+} | \text{test+})=\frac{99}{1098} = 9.02\%\]