We will explore some methods to find which parts of a text are similar to a pattern. For instance, the text can be a genome, and the pattern can be a gene or a motif, but the same ideas apply to any text and any fixed pattern.
For this, we need a dissimilarity function \[ d: \mathcal S \times \mathcal S \to \mathbb R\] such that, for all \(x, y, z \in \mathcal S,\) we have
- \(d(x, y)\geq 0\) (non-negativity)
- \(d(x, x)=0\) (reflexivity)
- \(d(x, y)=d(y, x)\) (symmetry)
- \(d(x, y) + d(y, z)\geq d(x,z)\) (triangular inequality)
that is, the dissimilarity is never a negative number (i.e. it is always positive or zero), the dissimilarity if anything to itself is 0 (i.e everything is similar to itself), the dissimilarity is independent of the order, and the direct comparison of two elements is always better (or the same) than comparing to a third element.
In this case the set \(\mathcal S\) will be the set of character vectors, or strings. That is, we will compare words and parts of texts, such as genes, genomes, motifs and proteins.
Write a function (in any formal language), taking two strings of the same size —named
x
andy
—, and returning the Hamming distance of the two sequences. A good name for this function isHamming
.Remember that the Hamming distance is just the number of letters that are different between the two strings. It is the number of mismatches. If the strings represent DNA, the Hamming distance represents the number of mutations.
Write another function that takes two strings of different size —named
pattern
andtext
—, and calculates the Hamming distance betweenpattern
and the corresponding substring oftext
at each location. For example in R language, ifpattern
length ism
, then the value at locationk
will beans[k] <- Hamming(pattern, text[k:(k+m-1)])
If you use Python or other similar language, the indices will be a little different. They key part of this question is to find the valid range for
k
.Write a third function, based on the previous one, that takes two strings of different size —named
pattern
andtext
— and a number namedthr
, and returns a list of all locations intext
whose distance topattern
is less or equal that the thresholdthr
.