Class 13: Essential Maths for Bioinformatics

Bioinformatics

Andrés Aravena

November 25, 2022

Math is not sums, calculations, and formulae.
It is pulling things apart to understand how things work

Colin Wright, juggler,
inventor of the mathematical notation of juggling

Essential Math for Biology

In my opinion, all biologist should know something about

  • Set theory
  • Logic
  • Probabilities
  • Graphs (Networks)

The rest depends on each case

(maybe calculus and linear algebra)

Graphs

Graphs

A combination of nodes and edges

Nodes are the elements of any set we choose

Edges are pairs of nodes \[E⊂N×N\]

Nodes are also called vertices

Mathematical Language

At least two groups of people have worked with networks: engineers and mathematicians

They use different words for the same objects

  • Network: graph

  • Node: vertex

  • Link: edge

  • Arrow: arc

There are two kinds of graphs

In directed graphs edges have direction

Edge (a, b) is different from edge (b, a)

In undirected graphs the edges have no direction

Edge {a, b} is the same as edge {b, a}

Directed edges are also called arcs

Degree of a node

The degree of a node is the number of edges connected to it

In other words, it is the number of neighbors nodes

If the graph is directed, we can also talk about

  • in-degree: Number of arcs coming into a given node
  • out-degree: Number of arcs going out of a given node

Extra properties

Depending on the problem, we may add other properties to the nodes and edges. For instance

  • Nodes often have names. They may also have color or size
    • for a node \(i\) we write \[\text{name}(i)\quad\text{color}(i)\quad\text{size}(i)\]
  • Edges often have length or weight or cost or capacity
    • for an edge between nodes \(i\) and \(j\) we write \[\text{length}(i,j)\quad\text{weight}(i,j)\quad\text{cost}(i,j)\quad\text{capacity}(i,j)\]

We have seen graphs already

Binary trees are graphs fully connected and without cycles

  • The leaves have degree 1
  • Internal nodes have degree 3
  • The root is the only node with degree 2

In an unrooted tree all nodes have either degree

Distance between nodes

The length of each edge \((i,j)\) is \(\text{len}(i,j)\)

We can calculate the distance between any pair of nodes

\[D(i,u) = \begin{cases} \text{len}(i,u)&\text{if }i\text{ is neighbor of }u\\ \min_{j} (\text{len}(i,j) + D(k,u))&\text{otherwise } \end{cases}\]

The minimum is taken considering all \(j\) neighbors of \(i\)

Example

a b c d e f
a 0 13 11 15 21 22
b 13 0 2 6 12 13
c 11 2 0 4 10 11
d 15 6 4 0 6 7
e 21 12 10 6 0 13
f 22 13 11 7 13 0

Not all distances correspond to a nice graph

Let’s change only one value

a b c d e f
a 0 13 9 15 21 22
b 13 0 2 6 12 13
c 9 2 0 4 10 11
d 15 6 4 0 6 7
e 21 12 10 6 0 13
f 22 13 11 7 13 0

It is still a valid distance matrix, but cannot be drawn nicely

Why neighbor joining formula

Let \(i\) and \(j\) be two siblings in a nice tree

\[\begin{aligned} D(a,b) =& D(a,c) + D(c,b)\qquad(\text{eq. }1)\\ D(a,e) =& D(a,c) + D(c,e)\qquad(\text{eq. }2)\\ D(b,e) =& D(b,c) + D(c,e)\qquad(\text{eq. }3)\\ \end{aligned}\]

Result

\[D(c,e) =\frac{D(a,e)+D(b,e)-D(a,b)}{2}\]

So if we only know the distances between leaves \(a, b\) and \(e,\) and we add internal node \(c,\) this is how we find the distance \(D(c,e)\)

Neighbor Joining is trying to make a nice tree

Probabilities

An event is a set of outcomes

The set of all possible outcomes is often called Ω

An event 𝐴 can be seen as the set of all outcomes that make the event true

For example,

Fever={Temperature>37.5°C}

Evaluating rational beliefs

An event will become either true or false after an experiment

For example, a dice can be either 4 or not

We want to give a value to our rational belief that the event will become true after the experiment

The numeric value is called Probability

Probabilities as Areas

It is useful to think that the probability of an event is the area in the drawing

The total area of Ω is 1

Usually we do not know the shape of 𝐴

Probabilities depend on our knowledge

Our rational beliefs depend on our knowledge

If we represent our knowledge (or hypothesis) by 𝑍, the the probability of an event 𝐴 is written as \[ℙ(A|Z)\] We read “the probability of event 𝐴, given that we know 𝑍”

For example, “the probability that we get a 4, given that the dice is symmetrical”

Important idea

The order is relevant \[ℙ(A|Z)≠ℙ(Z|A)\] There are two events, 𝐴 and 𝑍

The one written after | is what we assume to be true

The one written before | is what we are asking for

One we know, the other we do not

Visually

Now outcomes are limited only to the 𝑍 region

We measure the area of \(ℙ(A|Z)\) with respect to the area of 𝑍 instead of Ω

The shape of 𝑍 is often unknown

Degrees of belief

If, given our knowledge 𝑍, the event 𝐵 is more plausible than the event 𝐴, then \[ℙ(A|Z)≤ℙ(B|Z)\]

For example, the probability that we get either 4, 5 or 6 is greater than the probability that we get a 4, given that the dice is symmetrical \[ℙ(\{4\}|Z)≤ℙ(\{4,5,6\}|Z)\]

Degrees of belief

On the other hand, if we get new information, the probabilities may change

The same event 𝐴 may be more plausible under a new hypothesis 𝑌 than under the initial hypothesis 𝑍

Then \[ℙ(A|Z)≤ℙ(A|Y)\]

Probability rules based on these two ideas

It has been proven that probabilities must be like this

  1. A probability is a number between 0 and 1 inclusive \[ℙ(A) ≥ 0\textrm{ and } ℙ(A)≤1\]

  2. The probability of an sure event is 1 \[ℙ(\textrm{True}) = 1\]

  3. The probability of an impossible event is 0 \[ℙ(\textrm{False}) = 0\]

Complex events

We are interested in non-trivial events, that are usually combinations of smaller events

For example, we may ask “what is the probability that, in a group of 𝑛 people, at least two persons have the same birthday”

Fortunately, any complex event can be decomposed into simpler events, combined with and, or and not connectors

Exercise: decompose the birthday event into simpler ones

Probability of not 𝐴

If the event 𝐴 becomes more and more plausible, then the opposite event not 𝐴 becomes less and less plausible

It can be shown that we always have \[ℙ(\textrm{not } A) = 1-ℙ(A)\]

Joint Probability

The probability of of 𝐴 and 𝐵 happening simultaneously must be connected to the probability of each one

It can be shown that there are only two ways to calculate it

  • Start with the prob. of \(A\) and then of \(B\) given that \(A\) is true \[ℙ(A,B)=ℙ(A)⋅ℙ(B|A)\]
  • Start with the prob. of \(B\) and then of \(A\) given that \(B\) is true \[ℙ(A,B)=ℙ(B)⋅ℙ(A|B)\]

It must be a multiplication

It can be proven that the only way to combine \(ℙ(A)\) and \(ℙ(B|A)\) to get \(ℙ(A,B)\) is to multiply them.

Both are true, since \(ℙ(A,B)=ℙ(B,A).\) The order that we write them is irrelevant.

Example

Example: diagnosis

As part of the strategy to control COVID-19, many governments carry on random sampling of the population looking for asymptomatic cases.

Imagine that you are randomly chosen for a test of COVID-19. The test result is “positive”, that is, it says that you have the virus. You also know that the test sometimes fails, giving either a false positive or a false negative. Then the question is what is the probability that you have COVID-19 given that the test said “positive”?

Context

Let’s assume that:

  • There are \(`r pop.size`\) people tested
  • The test has a precision of 99%
  • The prevalence of COVID in the population is 0.1%
  • The people to test is chosen randomly from the population

Since this context will be the same in all cases, we will not write it explicitly

Let’s fill this matrix

  Test- Test+ Total
COVID- . . .
COVID+ . . .
Total . . .

We show COVID reality in the rows and test results in the columns

We start with the total population

  Test- Test+ Total
COVID- . . .
COVID+ . . .
Total . . 1e+05

We will fill this matrix in the following slides

A large population size help us to see small values

0.1% of them are COVID positive

  Test- Test+ Total
COVID- . . 99900
COVID+ . . 100
Total . . 1e+05

Prevalence is the percentage of the population that has COVID.
In other words, it is the probability of (COVID+) \[ \begin{aligned} ℙ(\text{COVID+}) & =0.1\% = 0.001\\ ℙ(\text{COVID-}) & =99.9\%=0.999 \end{aligned} \]

99% are correctly diagnosed

  Test- Test+ Total
COVID- . . 99900
COVID+ . 99 100
Total . . 1e+05

Precision is the probability of a correct diagnostic \[ℙ(\text{test+} \vert \text{COVID+})=0.99\] We fill the box corresponding to (test+,COVID+) \[ℙ(\text{test+}, \text{COVID+})=ℙ(\text{test+} \vert \text{COVID+})\cdotℙ(\text{COVID+})\]

99% are correctly diagnosed

  Test- Test+ Total
COVID- 98901 . 99900
COVID+ . 99 100
Total . . 1e+05

In this case the precision for negative cases is the same \[ℙ(\text{test-} | \text{COVID-})=0.99\] We fill the box corresponding to (test-,COVID-) \[ℙ(\text{test-}, \text{COVID-})=ℙ(\text{test-} | \text{COVID-})⋅ℙ(\text{COVID-})\]

1% are misdiagnosed

  Test- Test+ Total
COVID- 98901 999 99900
COVID+ 1 99 100
Total . . 1e+05

Misdiagnostic is the negation of good diagnostic \[ℙ(\text{test-} | \text{COVID+})=1-ℙ(\text{test+} | \text{COVID+})=0.01\] we combine them in the same way as before \[ℙ(\text{test-}, \text{COVID+})=ℙ(\text{test-} | \text{COVID+})⋅ ℙ(\text{COVID+})\]

Total people diagnosed

  Test- Test+ Total
COVID- 98901 999 99900
COVID+ 1 99 100
Total 98902 1098 1e+05

We sum and fill the empty boxes

1098 people got positive test, but only 99 of them have COVID%\[ℙ(\text{COVID+} | \text{test+})=\frac{99}{1098} = 9.02\%\]

Summary

  • The order matters: \(ℙ(A|Z)≠ℙ(Z|A)\)
  • To get the probability of \(A\) and \(B\) together we find the probability of \(A\) and then of \(B\) given that \(A\) is true \[ℙ(A,B)=ℙ(A)⋅ℙ(B|A)\]
  • Make sure that you ask the correct question. A test can be “precise” and still give many false positives