March 22th, 2016
Start-Not Stop{repeated}-Stop
\[\{x\text{ is CDS}\}\subset\{x\text{ is ORF}\}\]
What are the inputs?
What is the output?
What are the steps between them?
Which ones are CDS?
gff <- read.delim("NC_000913.gff", header=F, comment.char="#") summary(gff)
V1 V2 V3 V4 NC_000913.3:9720 RefSeq:9720 gene :4516 Min. : 1 CDS :4382 1st Qu.:1162247 repeat_region: 355 Median :2298492 exon : 178 Mean :2313290 tRNA : 89 3rd Qu.:3453929 ncRNA : 65 Max. :4640942 (Other) : 135 V5 V6 V7 V8 Min. : 255 .:9720 -:4682 .:5338 1st Qu.:1163115 +:5038 0:4370 Median :2299578 1: 7 Mean :2314649 2: 5 3rd Qu.:3455647 Max. :4641652 V9 ID=cds1219;Parent=gene1262;Note=pseudogene%2C transposase homolog;Dbxref=ASAP:ABE-0004159,ASAP:ABE-0004161,ASAP:ABE-0285106,UniProtKB%2FSwiss-Prot:P30192,EcoGene:EG11611,GeneID:4056037;gbkey=CDS;gene=insZ;pseudo=true;transl_table=11 : 3 ID=cds1389;Parent=gene1435;Note=pseudogene%2C autotransporter homolog%7Einterrupted by IS2 and IS30;Dbxref=ASAP:ABE-0004680,ASAP:ABE-0004694,ASAP:ABE-0285093,UniProtKB%2FSwiss-Prot:P33666,EcoGene:EG11307,GeneID:2847750;gbkey=CDS;gene=ydbA;pseudo=true;transl_table=11: 3 ID=cds1922;Parent=gene1981;Note=pseudogene%2C IpaH%2FYopM family;Dbxref=ASAP:ABE-0006435,ASAP:ABE-0006437,ASAP:ABE-0006440,ASAP:ABE-0285096,UniProtKB%2FSwiss-Prot:P76321,EcoGene:EG13281,GeneID:2847704;gbkey=CDS;gene=yedN;pseudo=true;transl_table=11 : 3 ID=cds3953;Parent=gene4120;Note=pseudogene%2C SopA-related%2C pentapeptide repeats-containing;Dbxref=ASAP:ABE-0013224,UniProtKB%2FSwiss-Prot:P32690,EcoGene:EG11927,GeneID:948546;gbkey=CDS;gene=yjbI;pseudo=true;transl_table=11 : 3 ID=cds1153;Parent=gene1190;Note=pseudogene;Dbxref=ASAP:ABE-0003933,ASAP:ABE-0285042,UniProtKB%2FSwiss-Prot:P76000,EcoGene:EG13890,GeneID:1450255;gbkey=CDS;gene=ycgI;pseudo=true;transl_table=11 : 2 ID=cds1302;Parent=gene1345;Note=pseudogene%7Eputative ATP-binding component of a transport system;Dbxref=ASAP:ABE-0004422,ASAP:ABE-0285045,UniProtKB%2FSwiss-Prot:P77481,EcoGene:EG13919,GeneID:4306141;gbkey=CDS;gene=ycjV;pseudo=true;transl_table=11 : 2 (Other) :9704
How can we combine a FASTA file and a GFF file to get the gene sequences?
Write a function that combines a FASTA of the full genome and a GFF.
The output must be a FASTA file with the aminoacidic sequence of the coded proteins.
See write.fasta()
Sigma 70 factor of E.coli binds to:
TTGACA-N(15-19)-TATAAT
What does this mean?
The most complete database of E.coli regulation
Look for Sigma Factor 70
See also: Downloads
Transcription Factor Dan has 5 binding sites in different parts of the genome.
The sequences are:
GTTAATT GTGTATT ATTCATT GTTGATT GTTAATT
How do we summrize this?
How to make an “average”? A model?
A string to represent many strings
GTTAATT GTGTATT ATTCATT [GA]T[TG][ACTG]ATT GTTGATT GTTAATT
GTTAATT GTGTATT ATTCATT GTTGATT GTTAATT A | 1 0 0 2 5 0 0 C | 0 0 0 1 0 0 0 G | 4 0 1 1 0 0 0 T | 0 5 4 1 0 5 5
Each column has different score. Total score is the sum of all
PSSM gives the Score of each position in the window
M[nucl,pos]
For each start position we evaluate the sum of the scores of the nucleotides in the window
Write a function to evaluate the score of a position for a given matrix
Inputs:
pos
: position in the genomegenome
: vector of charsmat
: a position specific score matrixOutput: the score
Evaluate it on each position of E.coli genome.