When you write a text in any good text editor
(or even in Microsoft Word®),
some words are automatically underlined with red
These are words that are not found in the dictionary
Sometimes the editor can suggest the correct word
How does the computer do that?
One option is to define a rule, like a program, that will tell if a word is correct or not \[ isCorrect: Words ↦ \{True, False\} \] That is hard, in general
Other option is to have a dictionary or database of know correct words, and a way to find the nearest word \[ Similar: (Words, Database) ↦ \{Words\in Database\} \]
When are two sequences similar?
First idea:
We count the number of mismatches
Both strings must have the same length
ASELLKYLTT
ASELLKALTT
Here distance(ASELLKYLTT
,ASELLKALTT
) is 1
How many substitutions we need to go from one sequence to the other
CAT
and CAT
have a Hamming distance of 0CAT
and BAT
have a Hamming distance of 1CAT
and BAG
have a Hamming distance of 2Let’s pause a minute and do this exercise
In that case we need to insert “gaps”
MOUSE
GROUSE
The distance depends on the gaps positions
-MOUSE
GROUSE
Hamming Distance=2
MOUSE--
-GROUSE
Hamming Distance=7
In geometry we say that “the distance between a point X and a line L is the shortest length from X to any point in L”
If there many “candidate distances”, we choose the smallest one
This is an important idea
Hamming distance counts substitutions between sequences
Now we counts substitutions, insertions and deletions
Hamming distance counts substitutions
ABCDEFGHIJ
BCDEFGHIJA
Hamming distance=10
Levenstein counts substitutions, insertions and deletions
ABCDEFGHIJK-
-BCDEFGHIJKA
Levenstein distance=2
If the sequence has m letters, there are 2m+1 ways to insert a single gap
CAT
CAT-
CA-T
CA-T-
C-AT
C-AT-
C-A-T
C-A-T-
-CAT
-CAT-
-CA-T
-CA-T-
-C-AT
-C-AT-
-C-A-T
-C-A-T-
We could also do
C--A-----T
or
----CA-T--
or
-C--A--T--
We will see more details later
The idea is to draw a rectangle
The goal is to move from one corner to the other
Jumping black to black is free
Horizontal and vertical moves are gaps
Prepare a DotPlot in Excel or Google Sheets