Statistics tells a story that makes sense of the data
In genomics, we look for biological sense
That story can be global: about the complete genome
Or can be local: about some region of the genome
Last class we saw global properties
Local statistics look only through a small window
A window is a part of the genome,
For example, we can analyze a region of 500 bp starting at position 3000 of the genome
library(seqinr)
chromosomes <- read.fasta("NC_000913.fna", seqtype="DNA", set.attributes = FALSE)
length(chromosomes)
[1] 1
[1] 4641652
This is version 3, but we use a shorter name
gene_GC_content <- function(dna) {
V <- toupper(dna)
count_C <- sum(V=="C")
count_G <- sum(V=="G")
GC_content <- (count_C +count_G)/length(V)
return(GC_content)
}
This is a global statistic. Test it with the complete genome
[1] 0.5079071
Calculate the GC content for only part of the genome
Instead of all the genome, we only look through a window
The result should depend on:
position
, size
, and dna
gene_GC_content()
dna
in a fragment starting at position
size
position
size
The new function looks like this
window_GC_content <- function(position, size, dna) {
# we need to use indices starting at `position`
# the vector of indices must be of length `size`
# we need to "cut" `dna` using the indices
# we use `gene_GC_content()` on the dna fragment
return(answer)
}
What should be the code?
seq()
functionAll inputs are optional
Choose carefully which ones to use
seq()
[1] 5 6 7 8 9 10 11 12 13 14
[1] 5 6 7 8 9 10 11 12 13 14
[1] 5 8 11 14 17 20 23 26 29
position 3000, size 500
In the general case, we can write
window_GC_content <- function(position, size, dna) {
# we need to use indices starting at `position`
# the vector of indices must be of length `size`
indices <- seq(from=position, length.out=size)
# we need to "cut" `dna` using the indices
window <- dna[indices]
# we use `gene_GC_content()` on the dna fragment
answer <- gene_GC_content(window)
return(answer)
}
We traverse the genome with several windows of fixed size
Each window starts after the end of the previous one
In other words, the windows do not overlap
We start at the first DNA position, until the last
Be careful to not step out of DNA
Now we apply window_GC_content
to each element of win_pos
The result depends on window_size
What is the ratio of \(G-C\) respect to \(G+C\)?
Calculate the ratio \[\frac{G-C}{G+C}\]
using several windows. What do you see?
calculate_GC_skew <- function(dna) {
V <- toupper(dna)
count_C <- sum(V=="C")
count_G <- sum(V=="G")
GC_skew <- (count_G-count_C)/(count_C+count_G)
return(GC_skew)
}
Test it with the complete genome
[1] -0.001125755
window_GC_skew <- function(position, size, dna) {
# we need to use indices starting at `position`
# the vector of indices must be of length `size`
indices <- seq(from=position, length.out=size)
# we need to "cut" `dna` using the indices
window <- dna[indices]
# we use `calculate_GC_skew()` on the dna fragment
answer <- calculate_GC_skew(window)
return(answer)
}
gc_skew()
window by windowGC skew changes sign at the boundaries of the two replichores
This corresponds to DNA replication origin or terminus
The replication origin is usually called ori.
Marie Touchon, Eduardo P.C. Rocha, (2008), From GC skews to wavelets: A gentle guide to the analysis of compositional asymmetries in genomic data,
Biochimie, Volume 90, Issue 4,