April 25, 2018
We can simulate what can be a course like us
epilepsy <- function(size) { return(sample(c("Yes","No"), size=size, replace = TRUE, prob = c(0.01, 0.99))) }
Here we fix the real probability to 1%
Here courses
is a matrix, not a data frame
courses <- replicate(1000, epilepsy(97)) courses[,723] # show replica 723
[1] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" [13] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" [25] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" [37] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" [49] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" [61] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "Yes" [73] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" [85] "No" "No" "No" "Yes" "No" "No" "No" "No" "No" "No" "No" "No" [97] "No"
There are several ways to do this.
For example we can use table()
table(courses[,723])["Yes"]
What happens if there isn’t any "Yes"
? We get NA
, and we have to handle it
cases[723] <- table(courses[,723])["Yes"] if(is.na(cases[723])) { cases[723] <- 0 }
logic
The best one is to use logic
vectors instead of character
courses[,723]=="Yes"
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [85] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [97] FALSE
logic
can be numeric
As we know, R has four basic data type: numeric
, factor
, logic
and character
. If we mix them, they can change. For example
c(TRUE, FALSE)
[1] TRUE FALSE
c(TRUE, FALSE, 2)
[1] 1 0 2
Mixing logic
and numeric
gives numbers
What happens when you mix factor
and numeric
?
And mixing factor
and character
?
Explain why
TRUE
is easySince TRUE
becomes 1 and FALSE
becomes 0, it is easy to count how many times the result was TRUE
sum(courses[,723]=="Yes")
[1] 2
Now we just have to do this for each column of courses
(that is, for each simulation of a course)
cases <- rep(NA, ncol(courses)) for(i in 1:ncol(courses)) { cases[i] <- sum(courses[,i]=="Yes") } cases
[1] 1 2 0 1 0 0 1 1 1 1 0 3 0 1 1 3 3 0 1 0 0 1 3 4 2 1 0 0 0 1 1 3 0 1 2 0 3 [38] 1 0 1 2 0 1 1 1 0 0 1 1 0 1 0 1 2 0 0 0 0 1 1 1 1 0 1 0 3 2 1 1 2 0 1 4 1 [75] 3 0 1 0 4 0 1 0 0 0 1 0 1 2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 3 2 0 1 0 0 1 1 [112] 2 0 1 0 2 1 2 0 0 4 0 2 1 1 1 0 0 0 1 0 0 0 1 2 1 1 1 1 0 0 1 1 1 1 0 0 1 [149] 1 0 0 0 3 0 1 3 2 2 1 0 0 1 2 0 1 0 0 1 2 1 3 1 3 1 2 2 3 1 0 1 0 0 2 0 1 [186] 0 0 0 1 1 1 0 1 1 2 2 0 0 0 1 0 2 0 2 2 0 2 2 1 0 3 0 0 3 0 0 0 1 2 1 0 1 [223] 0 2 0 0 0 1 2 1 1 2 0 2 3 1 1 1 1 3 0 0 1 2 2 3 0 2 1 1 4 0 2 1 2 0 0 2 0 [260] 1 1 1 1 2 2 0 2 1 1 1 2 0 0 2 0 2 0 1 0 1 2 1 1 2 1 1 3 1 1 0 0 2 0 1 1 1 [297] 1 0 0 2 2 1 2 0 0 1 2 3 2 4 1 1 0 0 1 0 2 1 1 2 1 0 1 2 1 2 1 2 1 0 1 1 0 [334] 2 2 3 1 2 1 1 1 0 2 0 1 0 2 2 1 0 4 1 0 0 0 1 1 2 2 0 1 2 1 2 0 2 1 3 0 0 [371] 0 1 0 1 3 0 3 1 1 3 4 0 2 0 2 0 1 2 0 1 1 2 1 1 2 2 2 1 1 0 1 1 0 0 1 3 1 [408] 0 1 3 1 0 3 2 2 0 0 2 1 1 1 0 2 0 0 0 0 0 1 0 0 2 1 2 0 3 1 1 3 0 2 2 1 1 [445] 1 0 2 1 0 3 1 0 1 1 0 2 2 1 0 1 0 1 0 0 1 1 0 0 2 4 0 0 1 0 0 0 0 1 2 0 0 [482] 3 1 0 1 1 1 1 0 3 1 0 1 0 1 2 1 2 0 0 2 0 1 0 1 2 0 0 0 0 0 1 0 1 1 0 1 1 [519] 0 2 1 1 0 1 0 0 1 1 1 0 0 0 0 0 0 1 0 0 2 1 2 0 1 2 1 2 3 1 2 1 1 0 0 1 1 [556] 1 2 2 1 0 2 1 2 0 1 1 0 0 0 3 2 0 3 1 0 2 0 1 0 1 0 0 1 1 3 0 3 1 1 1 2 2 [593] 0 2 3 0 2 0 0 1 1 0 0 1 2 0 0 0 2 2 1 1 0 2 1 1 0 0 0 2 2 0 2 0 1 0 2 1 4 [630] 0 0 1 1 1 0 0 2 1 0 0 0 0 3 0 1 1 2 0 0 1 2 2 1 0 2 2 0 1 0 3 1 0 1 1 0 2 [667] 5 2 2 1 0 2 2 2 0 1 0 1 1 0 1 1 1 3 0 0 1 1 0 0 2 2 0 1 2 1 0 1 2 3 1 0 0 [704] 2 0 0 1 4 1 2 1 0 1 0 0 2 4 1 1 0 0 0 2 0 2 0 0 2 0 0 0 0 1 3 2 1 1 1 1 1 [741] 0 0 1 0 1 0 2 2 1 2 2 0 1 2 1 2 0 2 2 2 1 1 2 0 1 1 0 0 2 2 1 0 1 0 2 2 1 [778] 0 1 0 1 0 0 0 0 1 0 1 2 1 2 1 0 2 1 3 0 3 1 0 1 0 1 1 4 1 0 0 2 0 4 1 0 0 [815] 1 0 0 0 1 1 0 1 1 1 2 5 3 0 0 1 1 2 2 1 2 0 1 0 0 1 1 2 2 2 0 1 0 0 0 0 2 [852] 2 0 2 1 1 1 0 2 1 1 1 0 0 3 1 1 1 0 0 0 1 1 0 0 0 0 2 1 0 2 2 3 0 1 0 2 0 [889] 0 0 0 0 2 1 2 0 1 2 0 0 0 1 0 3 1 1 2 3 1 0 1 0 0 1 1 0 1 0 0 2 1 1 0 0 2 [926] 1 1 1 0 0 3 1 2 1 1 0 1 2 2 1 0 1 0 0 3 0 1 2 4 1 0 0 1 0 1 3 0 0 1 3 1 1 [963] 1 0 1 1 0 1 1 1 2 0 1 0 4 0 1 0 0 1 0 0 1 0 0 0 1 0 0 2 1 0 1 1 2 0 1 2 2 [1000] 1
It is hard to see the pattern in the raw data
table(cases)
cases 0 1 2 3 4 5 380 358 188 56 16 2
barplot(table(cases))
What are your results?
We see near 30% of courses have nobody with epilepsy
Only one third of times we see one person in 97 with epilepsy
How can this be 1% of world population? Did we do something wrong?
mean(courses=="Yes")
[1] 0.01006186
No problem here. Overall population is ~1%
We saw the practical consequences of a theoretical probabilities
The simulation allows us to see the meaning of a population property
These two problems are related:
From reality to theory: we do experiments to learn about the population.
In particular about the probabilities of each outcome. That is, the probability distribution
From theory to reality: assuming some probabilities, we can use the computer to explore what can be the consequences.
If consequences do not match reality, we have to change our hypothesis
Many important or interesting question can be answered YES or NO
For example:
cases[723]>0
or directly
sum(courses[,723]=="Yes")>0
Even more easy and clear
any(courses[,723]=="Yes")
We store the official result in the vector official
We store our bet into our_bet
If official
and bet
are sorted we do
all(official==our_bet)
s <- sum_two_dice() s==2 | s==4 | s==6 | s==8 | s==10 | s==12
It is better if we use reminder
s %% 2 == 0
We normally represent events with a function
event <- function(outcome) {...}
Then we can calculate the empirical frequency of an event
result <- replicate(N, event(simulation(...)) freq <- sum(result)/N
We have to choose N
carefully
We do experiments to estimate the value of probabilities
We use the empirical frequencies
There is a lot of variability
How good is our experiment?
We will test different probabilities. Let’s start with 0.5.
p <- 0.5
We can simulate n
experiments with success probability p
empirical_freq <- function(n, p) { experiments <- sample(c(TRUE,FALSE), prob = c(p, 1-p), size=n, replace = TRUE) return(sum(experiments)/n) }
N <- c(10, 100, 1000, 10000) plot(x=range(N), y=c(0,1), log="x", type="n") for(n in N) { y <- replicate(200, empirical_freq(n, p)) points(x=rep(n, 200), y=y, pch=19) } abline(h=p)
N <- (10^seq(from=1, to=4, by=0.1)) plot(x=range(N), y=c(0,1), log="x", type="n") for(n in N) { y <- replicate(200, empirical_freq(n, p)) points(x=rep(n, 200), y=y, pch=19, cex=0.5) } abline(h=p)
N
we can find intervals containing most of the experiments.With few data we can be far away from the real probability
If we have more data, then we are probably closer
We can even know how far away.
The Central Limit Theorem tells us what is that distance
If we have enough data, the empirical frequency \(m/n\) cannot be far away from the real probability \(p\).
The interval depends on \(p\), which we usually do not know.
If we only know \(m/n\), how far can be the real probability?
We cannot use the same formula. It fails when \(m/n\) is 0.
The result depends on a value \(z\). Let \[\begin{aligned}\tilde{n} &= n + z^2\\ {\tilde {p}}&=\frac{m+z^2/2}{\tilde{n}}=\frac{m+z^2/2}{n+z^2}\end{aligned}\] Then \({\tilde {\sigma}(z)}={\sqrt{\tilde {p}\left(1-{\tilde {p}}\right)/\tilde {n}}}\), and the real probability \(p\) is in the range
Let \(z=2\). Then \(\tilde{n} = n + 4\) and \({\tilde {p}}=(m+2)/(n + 4).\)
The confidence interval for \(p\) is given by \[\frac{m+2}{n + 4}\pm 2{\sqrt{\frac{m+2}{(n + 4)^2}\left(1-\frac{m+2}{n + 4}\right)}}\]
Real probability is in this range with 95% confidence.
This is the formula we have to remember
any(...)
all(...)
sum(replicate(N, event(...)))
sum(replicate(N, event(...)))/N
Agresti, Alan; Coull, Brent A. (1998). “Approximate is better than ‘exact’ for interval estimation of binomial proportions”. The American Statistician. 52: 119–126. doi:10.2307/2685469. JSTOR 2685469. MR 1628435.