May 9, 2018
G
, let’s say \(10^7\)G <- 1E7
N
of them, let’s say \(10^4\), each of length L
N <- 1E4 L <- 100
start
and end
position on the genomestart
randomly, sort it, then we calculate end
start <- sample.int(G, size=N, replace=TRUE) start <- sort(start) end <- start + L
gap_size <- function(s1, e1, s2, e2) { return(s2-e1) } gaps <- rep(NA, N-1) for(i in 1:(N-1)) { gaps[i] <- gap_size(start[i], end[i], start[i+1], end[i+1]) }
gap_sizes <- function(G, L, N) { start <- sample.int(G, N, replace=TRUE) start <- sort(start) end <- start + L gaps <- rep(NA, N-1) for(i in 1:(N-1)) { gaps[i] <- start[i+1] - end[i] } return(gaps) }
Homework: can we avoid the for()
loop?
gaps
are overlaps
We only see overlaps over a threshold T
The best T
has to be big enough to guarantee that the reads do not overlap by chance
Bigger values of T
reduce the probability of “overlap by chance”
A group of contiguous sequences is called Contig
The goal of the assembly process is to find one contig.
The sequence of this contig will be the genome
Most of times we do get several contigs
num_contigs <- function(G,L,N,T) { num_gaps <- sum(gap_sizes(G,L,N) > -T) return(num_gaps) }
num.reads <- seq(from=10, to=5E5, by=2E4) replicas <- 1:10 NN <- rep(NA, length(num.reads)*length(replicas)) n.contigs <- rep(NA, length(num.reads)*length(replicas)) k <- 1 for(size in num.reads) { for(i in replicas) { n.contigs[k] <- num_contigs(G, L, size, 20) NN[k] <- size k <- k+1 } }
plot(n.contigs ~ NN, type="b")
The values do not change very much between simulations (why?)
We do not need replicas here
num.reads <- seq(from=10, to=5E5, by=2E4) n.contigs <- rep(NA, length(num.reads)) for(k in 1:length(num.reads)) { n.contigs[k] <- num_contigs(G, L, num.reads[k], 20) }
plot(n.contigs ~ num.reads, type="b")
L <- 600
We have used “random” for:
Now we will see a third usage: Genetic Algorithms
This is a strategy to find “optimal solutions” to complex problems
Optimal solutions means minimum or maximum
If the problem is small we can just use which.min
or which.max
But if the problem is big we cannot test every combination
Genetic algorithms can solve this based on a model of evolution
pop<-data.frame(x=sample.int(100, 9, replace = TRUE), y=sample.int(100, 9, replace = TRUE))
id | elevation | ranking |
---|---|---|
1 | 5.3446109 | 7 |
2 | 2.5057566 | 3 |
3 | 2.1018444 | 2 |
4 | 2.5254296 | 4 |
5 | 0.9899444 | 1 |
6 | 6.1518677 | 9 |
7 | 3.9584632 | 6 |
8 | 5.5782467 | 8 |
9 | 2.8489728 | 5 |
We want to replace the individuals that are too uphill.
Probability of replacement is proportional to elevation
survival <- rep(NA, 9) for(i in 1:9) { p <- (ranking[i]-1)/8 survival[i] <- sample(c(FALSE,TRUE), size=1, prob=c(p,1-p)) } survival
[1] FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
parents <- pop[survival,] for(i in 1:9) { if(!survival[i]){ j <- sample.int(nrow(parents), size=1) pop[i,] <- parents[j,] k <- sample.int(ncol(parents), size=1) pop[i,k] <- pop[i,k] + sample(-10:10, 1) pop[i,k] <- min(max(pop[i,k], 1), 100) } }
min(max(...))
is just to keep it between 1 and 100
Our population has only vectors with ‘0’ and ‘1’
We want to find a vector (size m
) with all elements equal to 1
In this case the solution is trivial, but we will use it to learn