May 4th, 2018
Fragments are separated by electrophoresis
Fragment length tells us the position of the nucleotide
The result is called chromatogram
From it we get the letters of the sequence
In 2001, the cost of sequencing the first human genome was USD 108
Today you can have your own genome for 1000 USD
The problem is no longer how to do the experiment
Instead is how do we make sense of the results
First computers where big and expensive
Only in a few universities, used by experts
Then there was one on every office… and home
Today everybody has one… in the pocket
A PlayStation has more power than the biggest computer of 1998
Can the same happen with DNA sequencing?
Today you can buy a DNA sequencer of the size of an iPhone
… at the price of an iPhone
Next step: people will make apps for DNA sequencer
There is a phase transition: we changed from “solid” to “liquid”
For example, patents are obsolete
There is already a lot of public data
It does not depend on hands and wallets
It depends on brains and guts
But Data Science is not about Data
Science is about obtaining
Current technology allows us to read DNA in runs of ~100-600 letters. Imagine a book of 1000 pages:
The problem is to reconstruct the original book
G
, let’s say \(10^7\)G <- 1E7
N
of them, let’s say \(10^4\), each of length L
N <- 1E4 L <- 300
start
and end
position on the genomeL
must be equal to end-start
start
randomly, then we calculate end
start <- sample.int(G, size=N, replace=TRUE) end <- start + L
This is an event. That is, a function that returns TRUE or FALSE
If two reads do overlap, the overlap will be negative
Thus overlap_size
is -gap_size
If we do not know their position, how can we detect if two reads do overlap?
T
The assembler needs to compare all reads to all reads and see if they overlap.
T
?If we decide that two reads overlap only when they share 1 letter, we will put together 25% of them every time
The best T
has to be big enough to guarantee that the reads do not overlap by chance
If the reads do not match by chance, then there is a biological reason
Bigger values of T
reduce the probability of “overlap by chance”
A group of contiguous sequences is called Contig
The goal of the assembly process is to find one contig.
The sequence of this contig will be the genome
Most of times we do get several contigs