This week we have few mandatory questions, some bonus questions, and optional mathematics for your amusement. Answer to bonus questions are optional, and give extra score if they are right. You do not loose score if they are wrong. Nothing to loose, so it is worth trying.
Homework
Write an Entrez query to get all 16S nucleotide sequences from E.coli with length at least 1400 basepairs.
Write an Entrez query to get all complete Globin protein sequences. The sequence length should be between 200 and 1000 amino acids. The title should not contain the words “partial” nor “domain-containing”.
Make a Hamming distance calculator in Excel or Google Sheets.
How many comparisons do you need to calculate the Hamming distance between all genetic codes?
Prepare a DotPlot in Excel or Google Sheets. Use it to compare the following sequences.
ABCDEFGHIJKLMNOPQRSUTUVWXYZ
ABCDEAFGNIJKLOPQRSYTUVWXXZ
Use the previous answer to find the Levenstein distance between the two sequences.
Mathematical definition of “distance”
(This part is optional, it is useful in life, but it is not necessary for this course.)
Distance is a function taking pairs of objects and returning a number. It is the length of the shortest path between two points.
Notice that the “shortest path” depends on what are the allowed movements. For instance, what is the distance between our campus and Taxim square?
If we can fly, like a bird or a drone, then the shortest path is a straight line. This is called Euclidean distance, because it is based on the classic geometry described 2300 years ago by Euclid.
If we have to walk, we must move through the streets. When the city is organized like the Manhattan island in New York, this is called Manhattan distance. (This is not true in European cities). This distance is often used in biology.
If we take the metro, we will probably count the number of stations we need to cross. The metro is a network, so this is a network distance.
To be a distance, a function \(d\) needs to obey the following rules
- \(d(x,y)≥0\) for all \(x,y\)
- \(d(x,x)=0\) for all \(x\)
- if \(d(x,y) = 0\) then \(x=y\) (we say that \(d\) is reflexive)
- \(d(x,y) = d(y,x)\) for all \(x,y\) (we say that \(d\) is symmetric)
- \(d(x,y) + d(y,z) ≥ d(x,z)\) (this is called triangular inequality)
Bonus questions
Can you see why the last property is called triangular inequality?
Can you prove that “Hamming distance” is indeed a “distance”, according to the definition given above?
Write a
hamming_dist(x,y)
function in R, Python, or any other computer programming language.Use the previous answer to calculate the distance between all genetic codes. The answer should be a matrix with one row and one column for each genetic code.
Write a script using edirect tools or any other library like rentrez to download the first 40 amino acids of each protein in question (2).