April 27, 2018
experiment <- function(size) { sample(outcomes, prob, size, replace=TRUE) }
If the outcomes are integers 1 to N, we can use
experiment <- function(size) { sample.int(N, prob, size, replace=TRUE) }
Remember that we had
gc_content <- function(dna, start, window) { freq <- table(dna[seq(from=start, length.out=window)]) ans <- (freq["g"]+freq["c"])/window return(ans) }
We choose a fixed value of window
. Then we do
window <- 1000 position <- seq(from=1, to=length(dna), by=window) gc_cont <- rep(NA, length(position)) for(i in 1:length(position)) { gc_cont[i] <- gc_content(dna, position[i], window) }
Please read the data using this code, then plot it
library(seqinr) sequences <- read.fasta("NC_000913.fna") dna <- sequences[[1]]
Some GC content are very high or very low. Is that normal?
How can we find the GC content of a random window?
If we make an histogram of GC content we see the empirical frequency
h <- hist(gc_cont)
We can later use that frequency to simulate GC content
How can we look inside the histogram?
h
$breaks [1] 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52 0.54 0.56 [16] 0.58 0.60 0.62 0.64 0.66 0.68 $counts [1] 3 11 23 26 50 57 85 169 218 398 591 915 970 711 299 89 14 7 4 [20] 2 $density [1] 0.03231366 0.11848341 0.24773804 0.28005170 0.53856097 0.61395950 [7] 0.91555364 1.82033606 2.34812581 4.28694528 6.36579061 9.85566566 [13] 10.44808272 7.65833692 3.22059457 0.95863852 0.15079707 0.07539854 [19] 0.04308488 0.02154244 $mids [1] 0.29 0.31 0.33 0.35 0.37 0.39 0.41 0.43 0.45 0.47 0.49 0.51 0.53 0.55 0.57 [16] 0.59 0.61 0.63 0.65 0.67 $xname [1] "gc_cont" $equidist [1] TRUE attr(,"class") [1] "histogram"