Looking for sequences in databases
We know how to compare two sequences
GROUSE
M-OUSE
- We calculate a score for each alignment
- We can decide where to put gaps and substitutions to compare two
sequences
- Works for global, semi-global, and local
alignment
Looking on a database
We have one sequence, called query
We compare our sequences with each sequence in a Database
(sequences in the database are called subjects)
We get the score of each alignment
We report all subjects with score over a threshold
What is the best threshold?
We want big scores
How big is big enough?
We need to make several hypothesis
The most common hypothesis is statistical
Larger scores, less hits
A hit is a subject with score over a threshold
Larger score thresholds give less hits
We can estimate the number of hits in a given database,
assuming randomness
That is called Expected value
BLAST
The most common tool for local alignment is BLAST
Basic Local
Alignment Search
Tool
BLAST is not Global Alignment
Using BLAST
There are two ways of using BLAST
- Going to NCBI’s website: https://blast.ncbi.nlm.nih.gov/
- Runs in NCBI servers with NCBI databases
- Using a command line version
- Runs in your server with your databases
- can also send jobs in NCBI servers with NCBI databases
- Download it from NCBI website
For today we can look at NCBI page
Types of BLAST
Depending on the alphabet of the query and subject
- BlastN
-
Search nucleotides in nucleotide databases
- BlastP
-
Search proteins in protein databases
- BlastX
-
Search nucleotide in protein databases.
-
Each query is translated into 6 putative proteins
Types of BLAST
- TBlastN
-
Search proteins in nucleotide databases.
-
Each subject is translated into 6 putative proteins
- TblastX
-
Search nucleotides in nucleotide databases
-
Translate each query and each subject into 6 proteins
-
Compares all the resulting proteins
NCBI protein databases
- nr
-
Non-redundant protein sequences
- refseq_protein
-
Reference proteins
- refseq_select
-
Reference Select proteins
What is “Non-Redundant”?
These databases get data from several sources
Sometimes two people upload the same sequence but with different
ID
For example, EMBL ID, GenBank ID, RefSeq ID, etc.
This database combines all identical entries into one, and keeps all
the alternative IDs
NCBI protein databases
- landmark
-
Model Organisms
- swissprot
-
UniProtKB/Swiss-Prot
- pat_aa
-
Patented protein sequences
NCBI protein databases
- pdb
-
Protein Data Bank proteins
- env_nr
-
Metagenomic proteins
- tsa_nr
-
Transcriptome Shotgun Assembly proteins
NCBI nucleotide databases
- Human G+T
-
Human genomic plus transcript
- Mouse G+T
-
Mouse genomic plus transcript
- nr/nt
-
Nucleotide collection
NCBI nucleotide databases
- Bacteria and Archaea
-
16S ribosomal RNA sequences
- refseq_select
-
Reference Select sequences
- refseq_rna
-
Reference RNA sequences
NCBI nucleotide databases
- refseq_representative_genomes
-
RefSeq Representative genomes
- refseq_genomes
-
RefSeq Genome Database
NCBI nucleotide (reads)
- SRA
-
Sequence Read Archive
- TSA
-
Transcriptome Shotgun Assembly
- HTGS
-
High throughput genomic sequences
NCBI nucleotide databases
- pat
-
Patent sequences
- pdb
-
nucleotides in Protein Data Bank
- RefSeq_Gene
-
Human RefSeqGene sequences
BlastN variants
- megablast
-
Highly similar sequences
- discontiguous megablast
-
More dissimilar sequences
- blastn
-
Somewhat similar sequences
BlastP variants
- blastp
-
protein-protein BLAST.
- PSI-BLAST
-
Position-Specific Iterated BLAST.
-
builds a position-specific scoring matrix.
- PHI-BLAST
-
Pattern Hit Initiated BLAST.
-
limits alignments to those that match a pattern in the query.
BlastP variants
- Quick BLASTP
-
Accelerated protein-protein BLAST.
-
very fast and works best if the target percent identity is 50% or more.
- DELTA-BLAST
-
Domain Enhanced Lookup Time Accelerated BLAST.
-
builds a PSSM using a Conserved Domain Database search.
-
searches a sequence database.