Three answers to three different questions
Align sequence to sequence
Align sequence to subsequence
Align subsequence to subsequence
Blast is local, not global
Each subject get a score,
The threshold depends on the chosen E-value
The score depends on the substitution matrix and gap costs
These depend on evolutionary hypotheses
Choose the matrix wisely
Score does not depend on the database
Databases may change with time
Therefore E-values may change
Choose your database wisely
Write down the date when the search
Aligning two sequences gives a score
Searching a sequence in a database gives an E-value
It depends on the technology and the algorithm
Let’s ignore the technology
How many comparisons/multiplications are needed?
It depends on query and subject size, so \[Cost=f(m, n)\] where \(m\) and \(n\) are the query and subject lengths
So, What is the formula?
Small sizes take short time
Larger sizes take longer time
We do not care about specific numbers
For example \(100m^2 n^4\) is equivalent to \(m^2 n^4\)
We say “the cost is on the order of \(m^2 n^4\)”
We write \[O(m^2 n^4)\]
Size of dot plot matrix is \(m\cdot n\)
Building dot plot matrix takes time \(m\cdot n\)
Then we have to find “the best diagonals”
Thus, computational cost is \(O(m\cdot n)\)
A substitution matrix are 4×4 or 20×20
It is also known as a scoring matrix
It is symmetric
It tells us what to write on each cell of dot plot matrix
For local alignment we have two cases
Global and semi-global alignments always have gaps
Search for “and
”, “it
”, and “in
”. Without gaps, full word.
In a village of La Mancha, the name of which I have no desire to call to
mind, there lived not long since one of those gentlemen that keep a lance
in the lance-rack, an old buckler, a lean hack, and a greyhound for
coursing. An olla of rather more beef than mutton, a salad on most
nights, scraps on Saturdays, lentils on Fridays, and a pigeon or so extra
on Sundays, made away with three-quarters of his income. The rest of it
went in a doublet of fine cloth and velvet breeches and shoes to match
for holidays, while on week-days he made a brave figure in his best
homespun. He had in his house a housekeeper past forty, a niece under
twenty, and a lad for the field and market-place, who used to saddle the
hack as well as handle the bill-hook. The age of this gentleman of ours
was bordering on fifty; he was of a hardy habit, spare, gaunt-featured, a
very early riser and a great sportsman. They will have it his surname was
Quijada or Quesada (for here there is some difference of opinion among
the authors who write on the subject), although from reasonable
conjectures it seems plain that he was called Quejana. This, however, is
of but little importance to our tale; it will be enough not to stray a
hair's breadth from the truth in the telling of it.
You must know, then, that the above-named gentleman whenever he was at
leisure (which was mostly all the year round) gave himself up to reading
books of chivalry with such ardour and avidity that he almost entirely
neglected the pursuit of his field-sports, and even the management of his
property; and to such a pitch did his eagerness and infatuation go that
he sold many an acre of tillageland to buy books of chivalry to read, and
brought home as many of them as he could get. But of all there were none
he liked so well as those of the famous Feliciano de Silva's composition,
for their lucidity of style and complicated conceits were as pearls in
his sight, particularly when in his reading he came upon courtships and
cartels, where he often found passages like "the reason of the unreason
with which my reason is afflicted so weakens my reason that with reason I
murmur at your beauty;" or again, "the high heavens, that of your
divinity divinely fortify you with the stars, render you deserving of the
desert your greatness deserves." Over conceits of this sort the poor
gentleman lost his wits, and used to lie awake striving to understand
them and worm the meaning out of them; what Aristotle himself could not
have made out or extracted had he come to life again for that special
purpose. He was not at all easy about the wounds which Don Belianis gave
and took, because it seemed to him that, great as were the surgeons who
had cured him, he must have had his face and body covered all over with
seams and scars. He commended, however, the author's way of ending his
book with the promise of that interminable adventure, and many a time was
he tempted to take up his pen and finish it properly as is there
proposed, which no doubt he would have done, and made a successful piece
of work of it too, had not greater and more absorbing thoughts prevented
him.
Let’s say that
Then each query-subject comparison takes \(O(mn)\)
and searching all database takes \(O(mnd)\)
a 1,2 2,13 3,7 3,11 4,10 5,9 7,3 8,8 9,7 9,11 10,3 12,8 12,13 13,5 17,15 23,5 41,11 43,11
about 37,8
above-named 19,7
absorbing 44,11
acre 24,5
adventure 41,8
afflicted 30,6
again 31,6 36,11
age 11,9
all 20,5 25,13 37,6 39,13
almost 21,11
although 15,8
among 14,12
an 3,4 4,2 24,4
and 3,10 5,8 7,8 7,11 10,2 10,8 13,4 21,7 22,7 23,2 23,10 24,15 27,6 28,12 34,5 35,2 38,1 39,10 40,2 41,9 42,8 43,9 44,9
ardour 21,6
aristotle 35,10
as 11,2 11,4 25,3 25,7 26,5 27,10 38,10 42,12
at 19,12 31,2 37,5
author's 40,8
authors 15,2
avidity 21,8
awake 34,9
away 6,4
be 17,10
beauty 31,4
because 38,3
beef 4,7
belianis 37,13
best 8,13
bill-hook 11,7
body 39,11
book 41,1
books 21,1 24,10
bordering 12,2
brave 8,9
breadth 18,2
breeches 7,10
brought 25,1
buckler 3,6
but 17,2 25,11
buy 24,9
call 1,16
called 16,8
came 28,9
cartels 29,1
chivalry 21,3 24,12
cloth 7,7
come 36,8
commended 40,5
complicated 27,7
composition 26,13
conceits 27,8 33,6
conjectures 16,1
could 25,9 35,12
coursing 4,1
courtships 28,11
covered 39,12
cured 39,2
de 26,11
desert 33,1
deserves 33,4
deserving 32,10
desire 1,14
did 23,7
difference 14,9
divinely 32,2
divinity 32,1
don 37,12
done 43,8
doublet 7,4
doubt 43,4
eagerness 23,9
early 13,2
easy 37,7
ending 40,11
enough 17,11
entirely 21,12
even 22,8
extra 5,13
extracted 36,5
face 39,9
famous 26,9
feliciano 26,10
field 10,7
field-sports 22,6
fifty 12,4
figure 8,10
fine 7,6
finish 42,9
for 3,13 8,1 10,5 14,4 27,1 36,12
fortify 32,3
forty 9,10
found 29,5
fridays 5,7
from 15,9 18,3
gaunt-featured 12,12
gave 20,9 37,14
gentleman 11,12 19,8 34,1
gentlemen 2,10
get 25,10
go 23,12
great 13,6 38,9
greater 44,8
greatness 33,3
greyhound 3,12
habit 12,10
hack 3,9 11,1
had 9,3 36,6 39,1 39,7 44,6
hair's 18,1
handle 11,5
hardy 12,9
have 1,12 13,10 36,1 39,6 43,7
he 8,6 9,2 12,5 16,6 19,10 21,10 24,1 25,8 26,1 28,8 29,3 36,7 37,2 39,4 40,4 42,1 43,5
heavens 31,9
here 14,5
high 31,8
him 38,7 39,3 45,1
himself 20,10 35,11
his 6,8 8,12 9,5 13,12 22,5 22,12 23,8 28,1 28,6 34,3 39,8 40,12 42,6
holidays 8,2
home 25,2
homespun 9,1
house 9,6
housekeeper 9,8
however 16,11 40,6
i 1,11 30,14
importance 17,4
in 1,1 3,1 7,2 8,11 9,4 18,6 27,12 28,5
income 6,9
infatuation 23,11
interminable 41,7
is 14,7 16,12 30,5 42,13
it 6,13 13,11 16,2 17,8 18,10 38,4 42,10 44,4
keep 2,12
know 19,3
la 1,5
lad 10,4
lance 2,14
lance-rack 3,3
lean 3,8
leisure 20,1
lentils 5,5
lie 34,8
life 36,10
like 29,7
liked 26,2
little 17,3
lived 2,3
long 2,5
lost 34,2
lucidity 27,3
made 6,3 8,7 36,2 43,10
management 22,10
mancha 1,6
many 24,3 25,4 41,10
market-place 10,9
match 7,14
meaning 35,5
mind 2,1
more 4,6 44,10
most 4,13
mostly 20,4
murmur 31,1
must 19,2 39,5
mutton 4,9
my 30,3 30,9
name 1,8
neglected 22,1
niece 9,12
nights 5,1
no 1,13 43,3
none 25,16
not 2,4 17,12 35,13 37,4 44,7
of 1,4 1,9 2,8 4,4 6,7 6,12 7,5 11,10 11,13 12,7 14,10 17,1 18,9 21,2 22,4 22,11 24,6 24,11 25,5 25,12 26,7 27,4 29,10 31,11 32,11 33,7 35,7 40,10 41,5 44,1 44,3
often 29,4
old 3,5
olla 4,3
on 4,12 5,3 5,6 6,1 8,4 12,3 15,5
one 2,7
opinion 14,11
or 5,11 14,2 31,5 36,4
our 17,6
ours 11,14
out 35,6 36,3
over 33,5 39,14
particularly 28,3
passages 29,6
past 9,9
pearls 27,11
pen 42,7
piece 43,13
pigeon 5,10
pitch 23,6
plain 16,4
poor 33,11
prevented 44,13
promise 41,4
properly 42,11
property 23,1
proposed 43,1
purpose 37,1
pursuit 22,3
quejana 16,9
quesada 14,3
quijada 14,1
rather 4,5
read 24,14
reading 20,13 28,7
reason 29,9 30,4 30,10 30,13
reasonable 15,10
render 32,8
rest 6,11
riser 13,3
round 20,8
saddle 10,13
salad 4,11
saturdays 5,4
scars 40,3
scraps 5,2
seams 40,1
seemed 38,5
seems 16,3
shoes 7,12
sight 28,2
silva's 26,12
since 2,6
so 5,12 26,3 30,7
sold 24,2
some 14,8
sort 33,9
spare 12,11
special 36,14
sportsman 13,7
stars 32,7
stray 17,14
striving 34,10
style 27,5
subject 15,7
successful 43,12
such 21,5 23,4
sundays 6,2
surgeons 38,13
surname 13,13
take 42,4
tale 17,7
telling 18,8
tempted 42,2
than 4,8
that 2,11 16,5 19,5 21,9 23,13 30,11 31,10 36,13 38,8 41,6
the 1,7 3,2 6,10 10,6 10,14 11,6 11,8 15,1 15,6 18,4 18,7 19,6 20,6 22,2 22,9 26,8 29,8 29,11 31,7 32,6 32,12 33,10 35,4 37,9 38,12 40,7 41,3
their 27,2
them 25,6 35,1 35,8
then 19,4
there 2,2 14,6 25,14 42,14
they 13,8
this 11,11 16,10 33,8
those 2,9 26,6
thoughts 44,12
three-quarters 6,6
tillageland 24,7
time 41,12
to 1,15 1,17 7,13 10,12 17,5 17,13 20,12 23,3 24,8 24,13 34,7 34,11 36,9 38,6 42,3
too 44,5
took 38,2
truth 18,5
twenty 10,1
under 9,13
understand 34,12
unreason 29,12
up 20,11 42,5
upon 28,10
used 10,11 34,6
velvet 7,9
very 13,1
village 1,3
was 12,1 12,6 13,14 16,7 19,11 20,3 37,3 41,13
way 40,9
weakens 30,8
week-days 8,5
well 11,3 26,4
went 7,1
were 25,15 27,9 38,11
what 35,9
when 28,4
whenever 19,9
where 29,2
which 1,10 20,2 30,2 37,11 43,2
while 8,3
who 10,10 15,3 38,14
will 13,9 17,9
with 6,5 21,4 30,1 30,12 32,5 39,15 41,2
wits 34,4
work 44,2
worm 35,3
would 43,6
wounds 37,10
write 15,4
year 20,7
you 19,1 32,4 32,9
your 31,3 31,12 33,2
The database is \((s_1,…,s_d)\)
We need two auxiliary variables. Let \(l ←1, u ←d\)
To search in a sorted file, we start in the middle
In a sorted file, we can discard half of the database after every comparison
So we need to compare the query with \(\log(d)\) subjects
Thus, the search cost is \(O(mn\log_2(d))\)
Let’s say that \(m=100, n=100.\) Then
d | plain.database | time | with.index | time_2 |
---|---|---|---|---|
1000 | 1e+07 | 10 sec | 99658 | 0.1 sec |
10000 | 1e+08 | 1.7 min | 132877 | 0.13 sec |
1e+05 | 1e+09 | 16.7 min | 166096 | 0.17 sec |
1e+06 | 1e+10 | 2.8 hours | 199316 | 0.2 sec |
1e+07 | 1e+11 | 1.2 days | 232535 | 0.23 sec |
1e+08 | 1e+12 | 11.6 days | 265754 | 0.27 sec |
1e+09 | 1e+13 | 115.7 days | 298974 | 0.3 sec |
assuming \(10^{6}\) comparisons each second
There are many strategies to index databases
This is one of the main differences between tools
A big part of bioinformatics is to know which index to use
All previous discussion assumed exact match
It can be extended to partial match
Let’s say that there are at most 3 mismatches (indels or gaps) between query and the best subject
To find a partial match with \(n\) mismatches
BLAST uses indices to look for an initial hit
(sometimes called seed)
Then it tries to extend it using building a dot-matrix around the hit
The key parameter is Word size
Words of larger size are faster to search
Small word size is more sensitive
In particular Multiple Alignment
Good information systems assign an ID to everything
NCBI assigns a Request ID to each request
Write it
Use it to recover the result up to 36 hours later
To speed things up, we have prepared some searches
One classic example is to look for ELVIS
We can also search these proteins: http://dry-lab.org/static/bioinfo/dhn3.faa
You can look for the results later
Long term storage
Must be logged into myNCBI
you can choose which columns to show
This is the subsequence that matches, not the whole subject