The library seqinr
is needed in all the following
exercises.
Finding all ORFs in bacteria
Prepare a RMarkdown document with the following parts:
- Write a function named
find.codons()
with 3 inputs: a genome (a vector of char), a strand (a char, either “+” or “-”) and a frame number (either 0, 1 or 2). The function should return a vector with all codons in that reading frame. You can use the functionsplitseq()
. Verify that all codons have length 3. - Apply this function on the genome of E.coli for strands “+”
and “-” and for frame 0, 1 and 2. Store the six results on a list named
codons
with 6 elements. - Write a function that, given a list of codons, finds the position of
all start codons. You can use the function
which
. Apply this function to any of the elements of the listcodons
and store the result in a vector namedstart
. - Write a similar function to find the position of the stop
codons. Store them in a vector named
stop
. - (Optional) Write a function that, given the vectors
start
andstop
, returns a vector of the lengths of all ORFs. Notice that there may be more than one start codon for each stop codon. Which one is the real start?
Combine a FASTA file and a GFF file to get the CDS sequences
Write a RMarkdown document that combines the following steps
Read the genome of the bacteria from a FASTA file. For example you can use
NC_000913.fna
. Store the first element of the result in a vector namedgenome
.Read the description of the features from a GFF file and store them in a data frame named
features
.Write a function that takes the
features
data frame and returns a new data frame with one row for each CDS and three columns: start, stop and strand. Store the result in a variable namedCDS
.Write a function that takes a number
n
, a data frameCDS
and a vectorgenome
and returns the nucleotidic sequence of the gene described in the linen
of the data frameCDS
.Write a function that combines the
CDS
data frame and thegenome
vector to produce a list with the nucleotidic sequences of all genes. Each gene sequence is a vector of char.You can use
lapply
on the function from the previous question. Store the result in a variable namedCDS.sequences
.Write a function that takes the
CDS.sequences
list and produces a list with the aminoacidic sequences of all proteins coded genes. Store the result inprotein.sequences
.You can assume that the genetic code is the bacterial one.
Write the values of
CDS.sequences
into a FASTA file. See the functionwrite.fasta()
.Write another FASTA file with the aminoacidic sequence of the coded proteins. See the
translate()
function.
The names of the genes in the FASTA output files can be a
serial number. See the paste()
function if you want to do
something nicer.
Finding Binding Sites
Write a function
score.position()
to evaluate the score of a position for a given position weight matrix. These matrices have 4 rows and 5 to 30 columns. Rows have names corresponding to nucleotides.The function should work with any position weight matrix. For testing you can download the matrix here. You can read it with
PWM <- read.delim("PWM.txt", header=FALSE, row.names=1)
Inputs:
pos
: position in the genomegenome
: vector of charsmat
: a position specific score matrix
Output: the score
Evaluate it on each position of E.coli genome.