This question deals with Genome Analysis. We want to “draw the genome” using Turtle Graphics, and then find the replication origin using the GC-skew.
2.1 Load the bacterial genome
Load the FASTA file of the bacterial genome (not the genes) and assign the first element to a vector called
genome
.
This is very easy and has been discussed several times already. Just
remember that you want a vector, and read.fasta()
gives you
a list.
2.2 Draw the genome using Turtle Graphics
To make it easy, we will draw only the first 1000 nucleotides.
For each position in the genome between 1 and 1000, if the nucleotide is
"a"
, the turtle moves one step to the left. If the nucleotide is"c"
, the turtle moves one step down. If the nucleotide is"g"
, the turtle moves one step up. Finally, if the nucleotide is"t"
, the turtle moves one step to the right.
I think we did this on a class. It is very straightforward:
library(TurtleGraphics)
turtle_init(mode="clip")
turtle_hide()
turtle_do({
for(i in 1:1000) {
if(genome[i]=="a") {
turtle_setangle(-90)
else if(genome[i]=="c") {
} turtle_setangle(180)
else if(genome[i]=="g") {
} turtle_setangle(0)
else if(genome[i]=="t") {
} turtle_setangle(90)
}turtle_forward(1)
} })
Sometimes the genome sequence is written using upper case letters. Thus, a better version would be:
library(TurtleGraphics)
turtle_init(mode="clip")
turtle_hide()
turtle_do({
for(i in 1:1000) {
if(genome[i]=="a" | genome[i]=="A") {
turtle_setangle(-90)
else if(genome[i]=="c" | genome[i]=="C") {
} turtle_setangle(180)
else if(genome[i]=="g" | genome[i]=="G") {
} turtle_setangle(0)
else if(genome[i]=="t" | genome[i]=="T") {
} turtle_setangle(90)
}turtle_forward(1)
} })
The code genome[i]=="a" | genome[i]=="A"
means “the
nucleotide in position i
of genome
has the
value "a"
, or the nucleotide in position i
of
genome
has the value "A"
”.
2.3 Non-overlapping windows of size 1000
Create a vector called
genome_position
with a sequence of positions starting at 1, increasing by 1000 up to the last position in the genome minus 1000.(We stop at least 1000 nucleotide before the genome end to make sure that each window is inside the genome)
This is just translating English into R.
<- 1000
window_size <- seq(from=1, to=length(genome)-window_size, by=window_size) genome_position
If you do not want to use window_size
, you can say
<- seq(from=1, to=length(genome)-1000, by=1000) genome_position
2.4 GC-skew
Calculate the GC-skew of each window starting in the positions given by
genome_position
, and store them in a vector calledskew
.You can use the following function, that we used in a homework.
<- function(genome, position, window_size) { window_gc_skew <- nG <- 0 nC for(i in seq(from=position, length=window_size)) { if(genome[i]=="c" | genome[i]=="C") { <- nC + 1 nC else if(genome[i]=="g" | genome[i]=="G") { } <- nG + 1 nG } }if(nG+nC==0) { return(0) } else { return((nG-nC)/(nG+nC)) } }
As in many other cases, we apply the same function to each element of a list and assign the result to a vector.
<- rep(NA, length(genome_position))
skew for(j in 1:length(genome_position)) {
<- window_gc_skew(genome, genome_position[j], window_size)
skew[j] }
This is a pattern that we use often. Understand and use it.
2.5 Accumulated GC-skew
Calculate the accumulated GC-skew and store it in a vector called
acc_skew
This is taken directly from the example of balance and incoming money.
<- rep(NA, length(skew))
acc_skew 1] <- skew[1]
acc_skew[for(i in 2:lenght(skew)) {
<- acc_skew[i-1] + skew[i]
acc_skew[i] }
“The acc_skew
of today is the acc_skew
of
yesterday plus the skew
of today”.
For this particular case there is a special function to calculate cumulative sums:
<- cumsum(skew) acc_skew
2.6 Extremal acc_skew
values
Find the genome position of the window which has the maximum value of
acc_skew
. Do the same for the minimum value.
As in all question about maximum, it is important to understand if we
look for the value or the index of the largest
element. In this case we need the index so we use
which.max()
.
But this is an index in a vector, not a position in the genome.
Therefore, we must use this index to point inside the correct vector.
Since the question asks for the genome position, we use the vector
genome_position
.
which.max(acc_skew)]
genome_position[which.min(acc_skew)] genome_position[