Searching into NCBI
Advanced queries
Automatic queries
Taxonomy
Searching NCBI has much more options than Google
(do you know Google options?)
By default the query text is searched in any part of any database
But you can specify the fields where you are looking for
protease NOT hiv1[organism]
1000:2000[slen]
10000:100000[mlwt]
src specimen voucher[properties]
Quotes "
are important
The fields are written inside brackets []
Each database page includes an Advanced Search option
Entrez queries can be single words, short phrases, sentences, database identifiers, gene symbols, or names
AND: Finds documents that contain terms on both sides of the operator terms. The intersection of both searches.
OR: Finds documents that contain either term. The union of both searches.
NOT: Finds documents that contain the term on the left but not the term on the right of the operator. The subtraction of the right side from the left side
AND
must be in uppercase. It is recommended to also use uppercase for OR and NOT
Operators are processed left-to-right
promoters OR response elements NOT human AND mammals
Parenthesis can be used to control the evaluation order
g1p3 AND (response element OR promoter)
Certain fields can accept ranges of values
Low and high numbers are entered with a colon “:” between them followed by the field
110:500[Sequence Length]
2015/3/1:2016/4/30[Publication Date]
We can get a different explanation in the public documentation made by NCBI
All documents made by NCBI are public domain
When we design molecular biology experiments, or when we analyze their results, we need to use several tools in chain
Today we are going to see an example using the NCBI website
Our challenge is to find proteins in legumes having F-Box and WD-40 domains
We need an official definition of each domain
Using NCBI Conserved Domain Architecture Retrieval Tool (https://www.ncbi.nlm.nih.gov/Structure/lexington/lexington.cgi)
Use the query:
[pfam00646,pfam00400]
to look for proteins that contain both domains in the specified order
Most of times is a good idea to check the Taxonomy database
Each sequence on GenBank is tagged with a taxon id
Using taxid is more precise than using common names
For example, a protein from human can be labeled “95% similar to mouse”
Is that a human or a mouse protein?
For your convenience you can download the sequences
In this case we only need accession ids
Now we use BLAST to find other proteins in legumes similar to the ones we have
Notice that CDART only has some proteins pre-processed. New sequences take time to be processed
How would you do that?
It is essential that your protocol can be replicated
It is a very good idea to save the search strategy in a file
It is also wise to save the output in a text file
Separate by tab or by comma
How many new proteins we find?
Do all of them have the good domains?
Let’s use CDD again, this time looking for motifs on the new proteins \[protein \to\{domains\}\] (the first time was \(domain\to\{proteins\}\))
And takes a lot of time
It is easy to make mistakes
It is hard to replicate
Can we do it automatically?
ESearch -> ESummary;
ESearch -> EFetch;
EPost -> ESummary;
EPost -> EFetch;
ESearch -> ELink;
EPost -> ELink;
EPost -> ESearch;
ELink -> ESearch;
ESearch -> ELink -> ESummary;
ESearch -> ELink -> EFetch;
EPost -> ESearch -> ESummary;
EPost -> ESearch -> EFetch;
EPost -> ELink -> ESearch -> ESummary;
EPost -> ELink -> ESearch -> EFetch;
There are Entrez libraries for most languages
For example in R it is called rentrez
There is a command line version, and versions for all major computer languages