We need to do a lot of exercises to be ready for the midterm. Here you have several exercises. Some of them can be answered in short time, others require more thinking. Start thinking all of them. The deadline is valid only for the short term questions. Long term questions should be answered before the midterm exam.
Please use the official template for answers.
Short term questions
Calculate the GC content for only part of the genome
Instead of all the genome, we only look through a window.
That is, we look only a region of the genome, with a fixed
size, and starting in a given position. For example,
we examine only the genome region starting at position 250000 and we
look only for 100 letters. That is, only letters in the positions in
seq(from=250000, length=1000)
.
The result should depend on:
- the genomic sequence
- the position of the window
- the size of the window
Write a function called window_gc_content()
, that takes
sequence
, position
, and size
as
input, and returns a single value with the window GC content. You can
test this function with the genome of E.coli follwing these
steps
Download the genome of E.coli from NCBI or from the blog. Take note of the folder where the file is downloaded. Different web browsers may use different folders.
Load
library(seqinr)
. If you do not have it installed, pleas install it.Set your working directory to the folder where the file was downloaded.
Read the sequences with the command
sequences <- read.fasta("NC_000913.fna")
. Be careful that the file may have a different name in your computer.Then you can test using the command
window_gc_content(sequences[[1]], 250000, 100)
Using window_gc_content()
in many places
We want to evaluate window_gc_content
on different
positions of the genome. Specifically, we want to evaluate in these
positions:
<- seq(from=1, to=length(genome)-window_size, by= window_size) positions
Obviously, the result depends on the genome and
window_size
. Please write a function that takes as inputs
genome
and window_size
, and returns a
vector with the GC content of each of the windows in each of the
positions
.
GC Skew
Write a function that takes a list of genes, and calculate the ratio
(nG-nC)/(nG+nC)
for each gene. The function should be
called gene_gc_skew
and takes only one input: a list called
genes
. What should be the output?
Long term questions
Algorithm design
In many important cases we have a vector x
with growing
values. That is, each value is bigger or equal to the previous one,
so
x[i+1] >= x[i]
for all values of the index i
. It is easy to see that
the position of the minimum value has to be 1. We also know that the
position of the maximum value is the last position. What about the
position of the half value?
The half value is the average of the minimum and the
maximum. For example if x
is the vector
c(1, 4, 4, 6, 10, 15)
then the half value is
(1+15)/2
, that is 8.
The position of the half value of the vector x
is the index of the first value that is equal or bigger
than the half value of x
. In the example the
position of the half value is 5, since x[5]
is the
smallest value that is bigger or equal than 8.
Please write a function called position_of_half()
, with
one input called x
. The function must return a single
number, which is the index of the smallest value in x
that
is bigger than or equal to the average of minimum and maximum of
x
.
You can test your functions with the following code.
<- 1:9
x position_of_half(x)
position_of_half(x + 20)
position_of_half(x * x)
position_of_half(sqrt(x))
The answers should be 5, 5, 7, 4, respectively.
Merge two sorted vectors
Please write a function called vector_merge(x, y)
that
receives two sorted vectors x
and
y
and returns a new vector with the elements of
x
and y
together sorted. The
output vector has size length(x)+length(y)
.
You must assume that each of the input vectors is already sorted.
in your code you have to use three indices: i
,
j
, and k
; to point into x
,
y
and the output vector answer
, respectively.
On each step you have to compare x[i]
and
y[j]
. If x[i] < y[j]
then you make
answer[k] <- x[i]
, otherwise make
answer[k] <- y[j]
.
You have to increment i
or j
, and
k
carefully. To test your function, you can use this
code:
<- c("a", "d", "e", "h", "i", "k", "m", "s", "t", "u", "v", "w", "z")
x <- c("b", "c", "f", "g", "j", "l", "n", "o", "p", "q", "r", "x", "y")
y vector_merge(x, y)
The output must be a sorted alphabet.
"a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m"
"n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"