We have three levels of knowledge
We need the three levels
Replicating the experiment produces a sample
Experiments produce samples
We care about populations
Populations are BIG.
Like “all people in the planet” or “all experiments in all parallel universes”
In the class we use known populations
In real life populations are unknown (partially)
Experiments give samples, but we care about populations
What happens in the sample depends on the population
By looking at the samples we can learn about the population
Populations are big
Simulate experiments using the sample()
function
Prepare the outcomes
vector
Use sample(outcomes, size=n)
to get n
random elements from outcomes
Most times (but not always) we use replace=TRUE
This allows size
bigger than length(outcomes)
More important: probabilities do not change
When populations have infinite size, taking a sample has no effect
When we replace the sample, there is no effect on the population
Sampling with replacement is the same as having an infinite population
Samples are never the same, but they are similar
Bigger sample sizes will produce more similar results
The probability of an outcome is the proportion of that outcome in the population
In real life we usually do not know the probabilities, and we want to find them
In some cases we do know the probability of each outcome
Then we can simulate the experiment
sample
, each outcome can have a different probabilityprob=
to change the probability distributionprob=
, then all outcomes have the same probabilityprob=p
to change how sample()
worksThese simulations are called “Monte-Carlo Methods”
This allow us to explore cases that have too many combinations
We cannot see all possible genes of length 1000bp
There are 41000 = 22000 = 210x200 ≈ 103x200 = 10600 combinations
The age of the universe is ≈ 4.32x1017 seconds
Testing all possible cases is impossible
Random sampling allows us to get an idea of all possible cases
More simulations give better approximations, but take longer time
This is one of the most common uses of computers in Science
TRUE
or FALSE
Outcomes and events are different
An event can be true for several different outcomes
An experiment produces only one outcome, and several events
Write a function to represent an event
TRUE
or FALSE
depending on the event ruleoutcomes
vectorThe population standard deviation measures the population width
Chebyshev theorem says \[ℙ(|x_i-\bar{\mathbf x}|≥ k⋅\text{sd}(\mathbf x))≤ 1/k^2\] It can also be written as \[ℙ(|x_i-\bar{\mathbf x}|≤ k⋅\text{sd}(\mathbf x))≥ 1-1/k^2\]
Find the population width for different values of \(k\)
At least 75% of the population is near the average, by no more than 2 times the standard deviation \[ℙ(|x_i-\bar{\mathbf{x}}|≤ 2⋅\text{sd}(\mathbf x))≥ 1-1/2^2\] \[ℙ(\bar{\mathbf{x}} -2⋅\text{sd}(\mathbf x)≤ x_i ≤ \bar{\mathbf{x}} +2⋅\text{sd}(\mathbf x)) ≥ 0.75\]
At least 88.9% of the population is near the average, by less than 3 times the standard deviation \[ℙ(\bar{\mathbf{x}} -3⋅\text{sd}(\mathbf x)≤ x_i ≤ \bar{\mathbf{x}} +3⋅\text{sd}(\mathbf x)) ≥ 0.889\]
Which value of \(k\) will give you an interval containing at least 99% of the population?
(this 99% is called confidence level)
Everything we measure will be in an interval
The interval depends on the population standard deviation and the confidence level
Chebyshev theorem is always true, but in some cases is pessimistic
In some cases we can have better confidence levels
You can take the average of a sample to estimate the population average
The sample average is a random variable. Changes on every experiment
Here outcomes are real numbers
Any real number is possible
Probability of any \(x\) is zero (!)
We look for probabilities of intervals
≈95% of normal population is between \(-2⋅\text{sd}(\mathbf x)\) and \(2⋅\text{sd}(\mathbf x)\)
≈99% of normal population is between \(-3⋅\text{sd}(\mathbf x)\) and \(3⋅\text{sd}(\mathbf x)\)
If we have 95% of population in the center, then we have 2.5% to the left and 2.5% to the right
We can find the \(k\) value using R
[1] -1.959964
[1] 1.959964
If we have \(1-\alpha\) of population in the center, then we have \(\alpha/2\) to the left and \(\alpha/2\) to the right
Now you can find any interval
You may have noticed that we never get 100% confidence
That is a fact of life. We have to accept
To have very high confidence, we need wide intervals
But wide intervals are less useful