December 20, 2019
Let’s say you get a complex sample:
You extract total DNA, you sequence all.
You got millions of reads
What is on these reads?
Now you ask:
Most of the reads do not align to any reference genome
(most of Prokaryotes have not been isolated)
Given a DNA read \(r\) we want to find which species \(S\) is the most probable origin of the read
We want to find an \(S\) that maximizes \(\Pr(\text{species is }S\vert \text{we saw }r)\)
Thus, the question is how to evaluate this probability. Some approaches:
Some properties of DNA composition tend to be conserved through evolution
For example, two phylogenetically close species usually have similar GC content
Generalizing the idea, we can consider the relative abundance of
An oligomer of size \(k\) is called \(k\)-mer
Given a sequence \(s\) of length \(L\), we characterize it by a vector counting all subwords of size \(k\)
Given a DNA “word” \(w\in {\cal A}^k\) of length \(k\), we define
Remember that \(\mathcal A = \{A,C,G,T\}\) and that \([Q]=1\) iff \(Q\) is true.
The classification must be invariant to reverse complement
The representation should not change if we use either strand
We built a platform that allow us to test these hypothesis:
Phylogenetically close species have similar distribution of \(k\)-mer frequencies
Phylogenetically distant species have very different \(k\)-mer distributions
For this analysis we need to know how long ago two species diverged
We used the values published in TimeTree.org: 2274 Studies, 50K Species
For this analysis we need to know how long ago two species diverged
Linnaean taxonomic ranks have some temporal inconsistencies
We used the values published in TimeTree.org
Provides an estimation of divergence time ## The current presentation uses 2015 data
There are only 570 species in both TimeTree and RefSeq
Enough to make 163K comparisons
Every species is represented by the frequency of each \(k\)-mer
(i.e. empirical probability distributions)
We can compare two probability distributions using the Total Variation distance: \[\mathrm{dist_{TV}}(p,q)=\frac{1}{2}\Vert p-q\Vert_1 = \frac{1}{2}\sum_i\vert p_i-q_i\vert\] (it happens to be half the Manhattan distance)
Since all probability distributions follow \(\sum_i\vert p_i\vert=\Vert p\Vert_1=1\), it is easy to see that the Manhattan distance is a good one.