Let’s say you get a complex sample:
You extract total DNA, you sequence all.
You got millions of reads
What is on these reads?
Now you ask:
We can use BLAST or other alignment tool to search our reads into a database of known organisms
If the organism is known, and has been sequenced, we can identify it
but…
Most of the reads do not align to any reference genome
(most of Prokaryotes have not been isolated)
Some properties of DNA composition tend to be conserved through evolution
For example, two phylogenetically close species usually have similar GC content
We can consider the relative abundance of
An oligomer of size \(k\) is called 𝒌-mer
Given a sequence \(s\) of length \(L\), we can count all subwords of size \(k\) (\(s\) can be a genome, or a DNA fragment)
For each \(k\)-mer \(w\) we calculate its absolute frequency \(F_{(k)}(w; s)\) as the number of times that \(w\) appears in \(s\)
In other word we count the number of values \(j\) such that \((s_j, \ldots, s_{j+k-1})=w\)
It is exactly as you imagine
\[f_{(k)}(w; s)=\frac{F_{(k)}(w; s)}{\sum_{v} F_{(k)}(v; s)}\]
(here \(v\) goes through all 𝑘-mers)
First we have \((s_1, \ldots, s_{k})\)
Then we have \((s_2, \ldots, s_{k+1})\)
Then we have \((s_3, \ldots, s_{k+2})\)
And so on
There are 4 possibilities for each letter
There are \(4^k\) different \(k\)-mers
but…
When we sequence DNA, we get one strand, randomly
If we see ATG
, we also see CAT
The result should not change if we use either strand
Each 𝑘-mer is paired with another 𝑘-mer, so only the first half counts
There are \(4^{k/2}\) distinct values
More precisely, when \(k\) is even, some 𝑘-mers do not have pair.
For example AATT
is paired with itself
So the number of \(k\)-mers is
For example, for \(k=1\) we have four 1-mers: \(A, C, G, T\)
But \(A\) pairs with \(T\), and \(C\) pairs with \(G\),
so there are only two values: \[f_{(1)}(“A”;s)+f_{(1)}(“T”,s)\text{ and }f_{(1)}(“G”;s)+f_{(1)}(“C”,s)\]
Since the sum of all frequencies is 1, the last value is determined by the rest \[f_{(1)}(“A”;s)+f_{(1)}(“T”,s)+f_{(1)}(“G”;s)+f_{(1)}(“C”,s)=1\]
So we have only one independent value: \[f_{(1)}(“G”;s)+f_{(1)}(“C”,s)\]
That is the GC content
Given a read \(r\) we want to identify its origin in a set \(\{T_j\}\) of taxas (species, genus, etc.)
We want to compare the probabilities \(ℙ(T_i\vert r)\) for each taxa, and assign \(r\) to the taxa with highest probability
Calculating probabilities is hard without good tools
Instead we calculate a Score for the taxa, given the read \[\text{Score}(T_i; r)\]
Taxonomic classification is done in two parts
First we train, by counting 𝑘-mers in known genomes
Second, we apply the classifier to a new sequence
We evaluate a score using the read’s 𝑘-mers
The taxa giving the highest score is the chosen taxa
When we train, we get \(f_{(k)}(w; T_i)\) for each 𝑘-mer \(w\) in taxa \(T_i\)
Then the score for \(T_i\) given the read \(r\) is \[ \text{Score}(T_i; r)=\sum_i \log f_{(k)}(r_i\ldots r_{j+k-1}; T_i) \]
which can also be written as \[ \text{Score}(T_i; r)=\sum_{w\in r} F_{(k)}(w; T_i)\log f_{(k)}(w; T_i) \]
Collaboration with Stockholm University
and nothing else
(in this model)
We built a platform that allow us to test these hypothesis:
Phylogenetically close species have similar distribution of \(k\)-mer frequencies
Phylogenetically distant species have very different \(k\)-mer distributions
For this analysis we need to know how long ago two species diverged
We used the values published in TimeTree.org: 2274 Studies, 50K Species
Hedges et al, Mol. Biol. and Evol. (2015)
For this analysis we need to know how long ago two species diverged
Linnaean taxonomic ranks have some temporal inconsistencies
The class angiosperm has lower average age than the order of fungus
Ranks for prokaryotes are all older than the corresponding ranks of eukaryotes
Hedges et al, Mol. Biol. and Evol. (2015)
We used the values published in TimeTree.org
Kumar, Sudhir, et al. “TimeTree: a resource for timelines, timetrees, and divergence times.” Molecular Biology and Evolution 34.7 (2017): 1812-1819.
Provides an estimation of divergence time
There are only 570 species in both TimeTree and RefSeq
Enough to make 163K comparisons
Every species is represented by the frequency of each \(k\)-mer
(i.e. empirical probability distributions)
We can compare two probability distributions using the Total Variation distance: \[\mathrm{dist_{TV}}(p, q)= \frac{1}{2}\sum_i\vert p_i-q_i\vert\]
Since all probability distributions follow \(\sum_i\vert p_i\vert=1\), it is easy to see that this measure makes sense