When you write a text in any good text editor
(or even in Microsoft Word®),
some words are automatically underlined with red
These are words that are not found in the dictionary
Sometimes the editor can suggest the correct word
How does the computer do that?
One option is to define a rule, like a program, that will tell if a word is correct or not \[ isCorrect: Words ↦ \{True, False\} \] That is hard, in general
Other option is to have a dictionary or database of know correct words, and a way to find the nearest word \[ Nearest: (Words, Database) ↦ \{Words\in Database\} \]
Let’s say we have a sequence and we want to identify it
Our sequence is called query
We look for it on a list of known sequences
The list is called database
Each sequence is a subject
For this example, please download a sample database
Fasta format
http://www.dry-lab.org/static/bioinfo/short-protein.faa
Comma-separated values
http://www.dry-lab.org/static/bioinfo/short-protein.csv
query 1: ASELLKYLTT
query 2: ASELLKALTT
query 3: ASELLKYALTT
query 4: ASELLKLTT
Do the search in your computer
If you did it correctly, only the first search gave any result
The rest did not find anything
But we can see that the queries are very similar
How can we find sequences that are similar but not identical?
When are two sequences similar?
First idea:
We count the number of mismatches
Both strings must have the same length
ASELLKYLTT
ASELLKALTT
Here distance(ASELLKYLTT
,ASELLKALTT
) is 1
How many substitutions we need to go from one sequence to the other
CAT
and CAT
have a Hamming distance of 0CAT
and BAT
have a Hamming distance of 1CAT
and BAG
have a Hamming distance of 2We look for the smallest distance
We calculate the distance between our query and each subject in the database
If the distance is 0, we have found a perfect match
Let’s pause a minute and do this exercise
In that case we need to insert “gaps”
MOUSE-
GROUSE
Hamming Distance=6
Gaps are represented by -
The distance depends on the gaps positions
-MOUSE
GROUSE
Hamming Distance=2
MOUSE--
-GROUSE
Hamming Distance=7
In geometry we say that “the distance between a point X and a line L is the shortest length from X to any point in L”
If there many “candidate distances”, we choose the smallest one
This is an important idea
Hamming distance counts substitutions between sequences
Now we counts substitutions, insertions and deletions
Hamming distance counts substitutions
ABCDEFGHIJ
BCDEFGHIJA
Hamming distance=10
Levenstein counts substitutions, insertions and deletions
ABCDEFGHIJK-
-BCDEFGHIJKA
Levenstein distance=2
If the sequence has m letters, there are 2m+1 ways to insert a single gap
CAT
CAT-
CA-T
CA-T-
C-AT
C-AT-
C-A-T
C-A-T-
-CAT
-CAT-
-CA-T
-CA-T-
-C-AT
-C-AT-
-C-A-T
-C-A-T-
We could also do
C--A-----T
We will see more details later
The idea is to draw a rectangle
The goal is to move from one corner to the other
Jumping black to black is free
Horizontal and vertical moves are gaps
Prepare a DotPlot in Excel or Google Sheets {.center background=“var(–good-blue)” .large .white}
This allow us to find repeats inside the sequence
A sequence is said palindromic when it repeats itself backwards
In RNA that results in a single-strand structure called hairpin
They are usually transcription terminators
Useful to compare proteins
But not good when we look for a gene in a genome
or a domain in a protein
What shall we do when the query is much smaller than all subjects?
In this case we want to go from one side to the other
The query is much smaller than the subject
In this case we distinguish two kind of gaps
Internal gaps, inside the sequences
External gaps, outside the query sequence
External gaps are caused by the experiment
For example, we use PCR to cut part of a genome
Therefore anything outside the sequence cannot be seen for technical reasons
Internal gaps are caused by nature
They are gained or lost during replication
They have biological meaning
Therefore, they must be counted