So you get the DNA sequence. What’s next?
What is this?
Imagine that you used PCR to cut the 16S gene
November 13th, 2018
So you get the DNA sequence. What’s next?
What is this?
Imagine that you used PCR to cut the 16S gene
We have today databases of known genes
For example, databases of 16S gene
16S gene (in bacteria) is
We want to compare our query sequence q
to each sequence in the database
The database’s sequence is called subject, we write s
If they are identical, we are ready
If they are not identical, then we define a distance
D(q,r)
is the number of “mismatches”
GGGTGAGTAACACGTGGGTAACCTACCTCAAAGAGGGGG GGGTGAGTAACACGTGGGCAACCTGCCTCAAAGTGGGGG
here we need that the sequences have the same length
D(q,s)==D(s,q)
(symmetry)D(q,s)>=0
(non-negative)D(q,s)==0
then q==s
(reflexivity)D(q,s1) + D(q,s2) <= D(s1,s2)
(triangular inequality)Hamming distance has all these properties
But it is not the best distance for biological sequences
When there is an insertion or deletion, one of the sequences is “shorter” than the other
Here we cannot use Hamming distance. Instead we use Edit distance, also called Levenshtein distance
Distance = mismatches + insertion + deletions
We really want to find the gaps that produce “the best” combination
The first part is easy. All letters are identical.
We need to decide only for the new letter. We have only three options
query A T A A T subject A T A A C
query A T A A - subject A T A A C
query A T A A T subject A T A A -
D(q[1:5], s[1:5]) = D(q[1:4], s[1:4]) + D(q[5], s[5])
(match or mismatch)D(q[1:4], s[1:5]) = D(q[1:4], s[1:4]) + gap(s[5])
(deletion)D(q[1:5], s[1:4]) = D(q[1:4], s[1:4]) + gap(q[5])
(insertion)Usually gaps are more “expensive”
In general we have
D(q[1:i], s[1:j]) = D(q[1:(i-1)], s[1:(j-1)]) + D(q[i], s[j])
(match or mismatch)D(q[1:(i-1)], s[1:j]) = D(q[1:(i-1)], s[1:(j-1)]) + gap(s[j])
(deletion)D(q[1:i], s[1:(j-1)]) = D(q[1:(i-1)], s[1:(j-1)]) + gap(q[i])
(insertion)Easy: the “cheapest” one
In other words, we find the minimum
For this, programs use a matrix
One sequence in the rows
the other in the columns
Initially we have C[q[i], s[j]]
, defined as
q[i] == s[j]
q[i] <> s[j]
cumulate the total cost
M[i, j] = M[i-1, j-1] + C[q[i], s[j]]
the last corner of the matrix M
has the distance between the two sequences
(but we have not yet used gaps)
moving vertical and horizontal
M[i, j] = min( M[i-1,j-1] + C[q[i], s[j]], M[i,j-1] + gap(s[j]), M[i-1,j] + gap(q[i]) )
i
without changing j
j
without changing i
If gaps are “cheap”, the method will use more gaps
If gaps are expensive, the alignment will have more mismatches