A function is a rule connecting each element of one set to one element of another set
Sometimes they are written with formulas
Sometimes they are written case-by-case
There’s a function connecting PubMed ids to PubMed records
Each PubMed ID points to one, and only one, PubMed record
The same is true for Nucleotide database, and protein database
(Obviously with different sets on each case)
TTT F Phe TCT S Ser TAT Y Tyr TGT C Cys
TTC F Phe TCC S Ser TAC Y Tyr TGC C Cys
TTA L Leu TCA S Ser TAA * Ter TGA * Ter
TTG L Leu i TCG S Ser TAG * Ter TGG W Trp
CTT L Leu CCT P Pro CAT H His CGT R Arg
CTC L Leu CCC P Pro CAC H His CGC R Arg
CTA L Leu CCA P Pro CAA Q Gln CGA R Arg
CTG L Leu i CCG P Pro CAG Q Gln CGG R Arg
ATT I Ile ACT T Thr AAT N Asn AGT S Ser
ATC I Ile ACC T Thr AAC N Asn AGC S Ser
ATA I Ile ACA T Thr AAA K Lys AGA R Arg
ATG M Met i ACG T Thr AAG K Lys AGG R Arg
GTT V Val GCT A Ala GAT D Asp GGT G Gly
GTC V Val GCC A Ala GAC D Asp GGC G Gly
GTA V Val GCA A Ala GAA E Glu GGA G Gly
GTG V Val GCG A Ala GAG E Glu GGG G Gly
For our purposes each genetic code is a vector indexed by codons
AAs = FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
Starts = ---M------**--*----M---------------M----------------------------
Base1 = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
Base2 = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
Base3 = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
The symbol ’*’ represents a stop codon
The line Starts =
shows codons encoding start
(we will ignore the start codon information for now)
01 FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
02 FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSS**VVVVAAAADDEEGGGG
03 FFLLSSSSYY**CCWWTTTTPPPPHHQQRRRRIIMMTTTTNNKKSSRRVVVVAAAADDEEGGGG
04 FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
05 FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSSSSVVVVAAAADDEEGGGG
06 FFLLSSSSYYQQCC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
09 FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIIMTTTTNNNKSSSSVVVVAAAADDEEGGGG
10 FFLLSSSSYY**CCCWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
11 FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
12 FFLLSSSSYY**CC*WLLLSPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
13 FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSSGGVVVVAAAADDEEGGGG
14 FFLLSSSSYYY*CCWWLLLLPPPPHHQQRRRRIIIMTTTTNNNKSSSSVVVVAAAADDEEGGGG
16 FFLLSSSSYY*LCC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
21 FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNNKSSSSVVVVAAAADDEEGGGG
22 FFLLSS*SYY*LCC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
23 FF*LSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
24 FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSSKVVVVAAAADDEEGGGG
25 FFLLSSSSYY**CCGWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
26 FFLLSSSSYY**CC*WLLLAPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
27 FFLLSSSSYYQQCCWWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
28 FFLLSSSSYYQQCCWWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
29 FFLLSSSSYYYYCC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
30 FFLLSSSSYYEECC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
31 FFLLSSSSYYEECCWWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
33 FFLLSSSSYYY*CCWWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSSKVVVVAAAADDEEGGGG
When are two sequences similar?
We compare them by counting how many times they are different
01 FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
02 FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSS**VVVVAAAADDEEGGGG
How many symbols are different?
First idea:
We count the number of mismatches
Both strings must have the same length
ASELLKYLTT
ASELLKALTT
Here
distance(ASELLKYLTT
,ASELLKALTT
)
is 1
How many substitutions we need to go from one sequence to the other
CAT
and CAT
have a Hamming distance of
0CAT
and BAT
have a Hamming distance of
1CAT
and BAG
have a Hamming distance of
2In that case we need to insert “gaps”
MOUSE
GROUSE
The distance depends on the gaps positions
-MOUSE
GROUSE
Hamming Distance = 2
MOUSE--
-GROUSE
Hamming Distance = 7
In geometry we say that “the distance between a point X and a line L is the length of the shortest path from X to any point in L”
If there many “candidate distances”, we choose the smallest one
This is an important idea
Hamming distance counts substitutions between sequences
Now we counts substitutions, insertions and deletions
Hamming distance counts substitutions
ABCDEFGHIJ
BCDEFGHIJA
Hamming distance=10
Levenstein counts substitutions, insertions and deletions
ABCDEFGHIJK-
-BCDEFGHIJKA
Levenstein distance=2
If the sequence has m letters, there are 2m+1 ways to insert a single gap
CAT
CAT-
CA-T
CA-T-
C-AT
C-AT-
C-A-T
C-A-T-
-CAT
-CAT-
-CA-T
-CA-T-
-C-AT
-C-AT-
-C-A-T
-C-A-T-
We could also do
C--A-----T
or
----CA-T--
or
-C--A--T--
We will see more details later
The idea is to draw a rectangle
The goal is to move from one corner to the other
Jumping black to black is free
Horizontal and vertical moves are gaps
The three ideas of today
Calculate the Hamming distance between all genetic codes
This will be a matrix
What is the size of that matrix?