Looking for sequences in databases
We know how to compare two sequences
GROUSE
M-OUSE
- We can calculate a distance between sequences
- Smaller distances mean more similarity
- It works for global and semi-global alignment
- We can calculate a score for each alignment
- We can decide where to put gaps and substitutions to compare two sequences
- Works for global, semi-global, and local alignment
Alignment score depends on Substitution matrix
Score can change
If mismatches and gaps have different cost, the score will change
Sometimes the optimal alignment changes
Therefore alignments are meaningless without knowing the scoring matrices
Later we will discuss how to choose the “best” scoring matrix for each case
Looking on a database
We have one sequence, called query
We compare our sequences with each sequence in a Database
(sequences in the database are called subjects)
We get the score of each alignment
We report all subjects with score over a threshold
What is the best threshold?
We want big scores
How big is big enough?
We need to make several hypothesis
The most common hypothesis is statistical
Larger scores, less hits
A hit is a subject with score over a threshold
Larger score thresholds give less hits
We can estimate the number of hits in a given database, assuming randomness
That is called Expected value
Expected value as a threshold
In practice, we choose a small Expected value
(usually called E-value)
Something like 10-5 or 10-20
What we find is not random
and maybe it is biologically meaningful
E-value depends on the database
The formula for E-value depends on
- The substitution scoring matrix
- The query size
- The database size
Same alignments in different databases have different E-value
but the same score
Use the smallest relevant database
BLAST
The most common tool for local alignment is BLAST
Basic Local Alignment Search Tool
BLAST is not Global Alignment
Using BLAST
There are two ways of using BLAST
- Going to NCBI’s website: https://blast.ncbi.nlm.nih.gov/
- Runs in NCBI servers with NCBI databases
- Using a command line version
- Runs in your server with your databases
- can also send jobs in NCBI servers with NCBI databases
- Download it from NCBI website
For today we can look at NCBI page
Types of BLAST
Depending on the alphabet of the query and subject
- BlastN
- Search nucleotides in nucleotide databases
- BlastP
- Search proteins in protein databases
- BlastX
- Search nucleotide in protein databases.
- Each query is translated into 6 putative proteins
Types of BLAST
- TBlastN
- Search proteins in nucleotide databases.
- Each subject is translated into 6 putative proteins
- TblastX
- Search nucleotides in nucleotide databases
- Translate each query and each subject into 6 proteins
- Compares all the resulting proteins
NCBI protein databases
- nr
- Non-redundant protein sequences
- refseq_protein
- Reference proteins
- refseq_select
- Reference Select proteins
What is “Non-Redundant”?
These databases get data from several sources
Sometimes two people upload the same sequence but with different ID
For example, EMBL ID, GenBank ID, RefSeq ID, etc.
This database combines all identical entries into one, and keeps all the alternative IDs
NCBI protein databases
- landmark
- Model Organisms
- swissprot
- UniProtKB/Swiss-Prot
- pat_aa
- Patented protein sequences
NCBI protein databases
- pdb
- Protein Data Bank proteins
- env_nr
- Metagenomic proteins
- tsa_nr
- Transcriptome Shotgun Assembly proteins
NCBI nucleotide databases
- Human G+T
- Human genomic plus transcript
- Mouse G+T
- Mouse genomic plus transcript
- nr/nt
- Nucleotide collection
NCBI nucleotide databases
- Bacteria and Archaea
- 16S ribosomal RNA sequences
- refseq_select
- Reference Select sequences
- refseq_rna
- Reference RNA sequences
NCBI nucleotide databases
- refseq_representative_genomes
- RefSeq Representative genomes
- refseq_genomes
- RefSeq Genome Database
NCBI nucleotide (reads)
- SRA
- Sequence Read Archive
- TSA
- Transcriptome Shotgun Assembly
- HTGS
- High throughput genomic sequences
NCBI nucleotide databases
- pat
- Patent sequences
- pdb
- nucleotides in Protein Data Bank
- RefSeq_Gene
- Human RefSeqGene sequences
BlastN variants
- megablast
- Highly similar sequences
- discontiguous megablast
- More dissimilar sequences
- blastn
- Somewhat similar sequences
BlastP variants
- blastp
- protein-protein BLAST.
- PSI-BLAST
- Position-Specific Iterated BLAST.
- builds a position-specific scoring matrix.
- PHI-BLAST
- Pattern Hit Initiated BLAST.
- limits alignments to those that match a pattern in the query.
BlastP variants
- Quick BLASTP
- Accelerated protein-protein BLAST.
- very fast and works best if the target percent identity is 50% or more.
- DELTA-BLAST
- Domain Enhanced Lookup Time Accelerated BLAST.
- builds a PSSM using a Conserved Domain Database search.
- searches a sequence database.