March 14, 2018
FASTA files show only one strand
We call it the forward strand
The other strand is not shown, but can be calculated
We call it the reverse strand
According to Watson & Crick, the reverse strand is complementary
a
by t
and vice-versac
by g
and vice-versaMoreover, the reverse strand is reversed
opposite_strand <- function(dna) { ans <- rep(NA, lenght(dna)) for(i in 1:lenght(ans)) { j <- lenght(dna) - i + 1 if(dna[j]=='a') { ans[i] <- 't' } else if(dna[j]=='c') { ans[i] <- 'g' } else if(dna[j]=='g') { ans[i] <- 'c' } else if(dna[j]=='t') { ans[i] <- 'a' } } return(ans) }
Remember that we can give names to the elements of a vector
and we ca use the name as index
opposite_strand <- function(dna) { ans <- rep(NA, lenght(dna)) compl <- c(a='t', c='g', g='c', t='a') for(i in 1:lenght(ans)) { j <- lenght(dna) - i + 1 ans[i] <- compl[ dna[j] ] } return(ans) }
Will GC content change?
Will GC skew change?
In the previous class we calculated GC content and GC skew for the complete genome
Now we want to calculate them in regions of the genome
This way we can see if there are differences in DNA composition
These differences can have biological importance
We will calculate the same formula on several DNA regions
Since it is the same pattern, we will prepare a function
What should be
Remember that an R function is defined like this
name <- function(input1, input2, ...) { Calculate return(output) }
The GC skew result should depend on:
These are the parameters of the function. Do they have
Note: avoid names that are already used for R functions
Here we have two cases
For gc_content()
the output is a number with the value \[\frac{G+C}{A+T+G+C}\]
For gc_skew()
the output is a number with the value \[\frac{G-C}{G+C}\]
In both cases the values in the formula are the number of nucleotides on the region
How do we transform the input parameters into the output value?
We can use any R function available, such as
seq(from, to, by, length.out)
Note: each input of seq()
is optional
gc_content <- function(dna) { freq <- table(dna) ans <- (freq["g"]+freq["c"])/length(dna) return(ans) }
New Version:
gc_content <- function(dna, start=1, size=length(dna)) { freq <- table(dna[seq(from=start, length.out=size)]) ans <- (freq["g"]+freq["c"])/size return(ans) }
Can you do the same for gc_skew()
?
We want to evaluate gc_content()
and gc_skew()
on different positions of the genome
We traverse the genome with non-overlapping windows of fixed size
position <- seq(from=1, to=length(dna), by=window_size)
How do we apply gc_skew()
to each element of position
?
position <- seq(from=1, to=length(dna), by=window_size) for(i in 1:length(position)) { skew[i] <- gc_skew(dna, position[i], window_size) }
What is wrong here?
[]
if the vector does not existThe first time we say skew[i]
the vector skew
does not exist
It has to be created before used
We can create it with just
skew <- NA
It is better if we declare the size
skew <- rep(NA, lenght(position))
window_size <- 1E5 position <- seq(from=1, to=length(dna), by=window_size) skew <- rep(NA, length(position)) for(i in 1:length(position)) { skew[i] <- gc_skew(dna, position[i], window_size) }
plot(position, skew, type="l", main=paste("window_size", window_size))
window_size
Can you calculate the local G skew for different window sizes?
Are they equal or different?
Can you calculate local GC content?
Is local GC content different in different parts of DNA?
As we discussed on the last class,
The reason for this asymmetry is the DNA replication process
The origin of replication is in one of the sign changing places
We want to write a program so we can find Ori on many different genomes
We can answer questions like:
The values can change a lot between a window and the next one
Can you plot the GC-skew of each window versus the next window?
each point is (skew[i],skew[i+1])
skew
data is noisy. A lot of small random changes
Instead of looking directly at skew
, we can look at cumulative skew
\[\text{cum_skew}[i]=\sum_{j=1}^i \text{skew}[j]\]
Now, instead of the change of sign, we look for the minimum and maximum (Why?)
How can we calculate cum_skew
?
It is easy to see that
\[\text{cum_skew}[i]=\text{cum_skew}[i-1] + \text{skew}[i]\]
So we can write something like this:
cum_skew[1] <- skew[1] for(i in 2:length(skew)) { cum_skew[i] <- cum_skew[i-1] + skew[i] }
Can you write this as a function?
max(cum_skew)
[1] 4.9877
which.max(cum_skew)
[1] 155
position[which.max(cum_skew)]
[1] 1540001
Ori is located between position[which.max(cum_skew)]
and position[which.max(cum_skew)] + window_size
window_size
be?We can make the window very small
How would be skew
and cum_skew
if window_size=1