May 16, 2018
1 Empiric
2 Theoretical
3 Simulation Based
4 Data Based
First part was about deterministic simulations
This part is not in the exam. It will be on the makeup
This is what we did after midterm
We never see the population: it is too big
We only see samples: small size n
We study the relationship between two things:
We can simulate experiments with a given probability and see what can be the frequencies of the outcomes
This helps us to design our experiment
Events are functions returning TRUE or FALSE, like
We can use simulations to estimate the probability of an event
If we did an experiment or a simulation \(n\) times, and we saw an event \(m\) times, what did we learn about the population?
We cannot know the exact probability \(p\) but we can find a range for it
\[\frac{m+2}{n + 4}\pm 2{\sqrt{\frac{m+2}{(n + 4)^2}\left(1-\frac{m+2}{n + 4}\right)}}\]
This part is very practical
This is a specific technique to solve complex optimization problems
The important part is to choose the good fitness function
We will see more later
Imagine you are the main researcher in the ecology of a lake
You need to know how many fish are there in the lake
You catch some fish, put a mark on them and return them to the lake
You catch some fish again, how many of them are marked?
Let’s call L
to the number of fish in the lake
We catch a group of n
and we mark them
We catch another group of n
and find that m
of them have mark
What can be the value of m
?
We catch n
fish and we mark them 1 to n
The second time we catch n
, how many have a number 1 to n
?
Assuming that fishing is random, our catch is
sample.int(L, size=n)
We care how many of them are less or equal to n
sum(sample.int(L, size=n) <= n)
n
and L
The catch size n
plays two roles here
Fist, it fixes the probability for a marked fish: n/L
Second, it is the sample size that gives m
TRUE events
What is the relationship between frequency and probability?
Unless we catch all fish in the lake, we cannot know L
exactly
So we have to estimate n/L
based on m
and n
using the formula
\[\frac{m+2}{n + 4}\pm 2{\sqrt{\frac{m+2}{(n + 4)^2}\left(1-\frac{m+2}{n + 4}\right)}}\]
m <- 25 n <- 500 sigma <- sqrt( (m+2)/(n+4)^2 * (1-(m+2)/(n+4)) ) p_est <- c((m+2)/(n+4)-2*sigma, (m+2)/(n+4)+2*sigma) p_est
[1] 0.03351169 0.07363117
We are 95% sure that n/L
is on this interval
We got an interval for n/L
. It is easy to see that \[L = \frac{n}{p\_est}\] Thus we can estimate the number of fish as
n/p_est
[1] 14920.167 6790.603
In all our genetic algorithms we have
initial_pop
score_all
ranking
who_survives
next_generation
cross_over
initial_pop
initial_pop <- function(N, m, values) { pop <- list() for(i in 1:N) { pop[[i]] <- sample(values, size=m, replace=TRUE) } return(pop) }
you have to say which values
are valid
score_all
score_all <- function(pop, fitness) { score <- rep(NA, length(pop)) for(i in 1:length(pop)) { score[i] <- fitness(pop[[i]]) } return(score) }
you have to give a fitness
function
who_survives
who_survives <- function(pop, ranking, min) { survival <- rep(NA, length(pop)) range <- max(ranking)-min(ranking) for(i in 1:length(pop)) { p <- (ranking[i]-min(ranking))/range survival[i] <- sample(c(!min, min), size=1, prob=c(p,1-p)) } return(survival) }
min
is TRUE if you look for minimum, FALSE for maximum
next_gen
next_gen <- function(pop, survival, mutation) { parents <- pop[survival] for(i in 1:N) { if(!survival[i]){ j <- sample.int(length(parents), size=1) pop[[i]] <- parents[[j]] # cloning pop[[i]] <- mutation(pop[[i]]) # mutation } } return(pop) }
Give a proper mutation
function
crossover
crossover <- function(pop, survival, mutation) { parents <- pop[survival] for(i in 1:N) { if(!survival[i]) { mom_dad <- sample.int(length(parents), size=2) m <- length(pop[[i]]) # vector length co <- sample.int(m-1, size=1) # cross-over place pop[[i]] <- c(parents[[mom_dad[1]]][1:co], parents[[mom_dad[2]]][(co+1):m]) if(sample(c(FALSE,TRUE), 1)) { # maybe mutation pop[[i]] <- mutation(pop[[i]]) # mutation } } } return(pop) }
We know that \(\sqrt{2}\) is irrational. Nevertheless, we want to find two integer numbers x
and y
such that \[\frac{x}{y}\approx \sqrt{2}\] In that case we will have \[\frac{x^2}{y^2}-2\approx 0\] Since the result cannot be 0, we will look for an approximate solution
We want to minimize the absolute value \[\left\vert\frac{x^2}{y^2}-2\right\vert \qquad x,y\in\mathbb N\]
To make it interesting, we want \(x>100\) and \(y>100\)
What would be a good fitness function?
fitness
can include logic conditionsfitness <- function(v) { x <- v[1] y <- v[2] too_small <- x<=100 | y<=100 return(abs(x^2/y^2-2) + too_small*1e6) }
mutation
functionAll mutation
functions are similar.
This one can be adapted easily to other cases
mutation <- function(v) { k <- sample.int(length(v), size=1) # do not change v[k] <- v[k] + sample(-10:10, size=1) # change this return(v) # do not change }
m
and values
The last thing we need to decide is the vectors’ size m
and the values that each component can take.
In this case
m <- 2 values <- 101:1000
Now we are ready to proceed
N <- 1000 num_generations <- 300 best_score <- rep(NA, num_generations) pop <- initial_pop(N, m, values) for(i in 1:num_generations) { score <- score_all(pop, fitness) best_score[i] <- max(score) ranking <- rank(score, ties.method = "random") survival <- who_survives(pop, ranking, min=TRUE) pop <- next_gen(pop, survival, mutation) }
plot(best_score)
min(score)
[1] 6.007287e-06
ranking==1
pop[ranking==1]
[[1]] [1] 816 577
This is a list with one vector. Let’s see if it works
x <- pop[ranking==1][[1]][1] y <- pop[ranking==1][[1]][2] x/y
[1] 1.414211
Not bad