May 10, 2019
We have three levels of knowledge
We need the three levels
sample()
functionoutcomes
vectorsample(outcomes, size=n)
to get n
random elements from outcomes
replace=TRUE
size
bigger than length(outcomes)
sample
, each outcome can have a different probabilityprob=
to change the probability distributionprob=
, then all outcomes have the same probabilityprob=p
to change how sample()
worksTesting all possible cases is impossible
Random sampling allows us to get an idea of all possible cases
More simulations give better approximations, but take longer time
This is one of the most common uses of computers in Science
TRUE
or FALSE
Write a function to represent an event
TRUE
or FALSE
depending on the event ruleoutcomes
vectorThe population standard deviation measures the population width
Chebyshev theorem says \[\Pr(\vert x_i-\bar{\mathbf x}\vert\geq k\cdot\text{sd}(\mathbf x))\leq 1/k^2\] It can also be written as \[\Pr(\vert x_i-\bar{\mathbf x}\vert\leq k\cdot\text{sd}(\mathbf x))\geq 1-1/k^2\]
Find the population width for different values of \(k\)
Which value of \(k\) will give you an interval containing at least 99% of the population?
(this 99% is called confidence level)
Everything we measure will be in an interval
The interval depends on the population standard deviation and the confidence level
Chebyshev theorem is always true, but in some cases is pessimistic
In some cases we can have better confidence levels
Here outcomes are real numbers
Any real number is possible
Probability of any \(x\) is zero (!)
We look for probabilities of intervals
≈95% of normal population is between \(-2\cdot\text{sd}(\mathbf x)\) and \(2\cdot\text{sd}(\mathbf x)\)
≈99% of normal population is between \(-3\cdot\text{sd}(\mathbf x)\) and \(3\cdot\text{sd}(\mathbf x)\)
If we have 95% of population in the center, then we have 2.5% to the left and 2.5% to the right
We can find the \(k\) value using R
qnorm(0.025)
[1] -1.96
qnorm(0.975)
[1] 1.96
If we have \(1-\alpha\) of population in the center, then we have \(\alpha/2\) to the left and \(\alpha/2\) to the right
qnorm(alpha/2) qnorm(1-alpha/2)
Now you can find any interval
You may have noticed that we never get 100% confidence
That is a fact of life. We have to accept
To have very high confidence, we need wide intervals
But wide intervals are less useful
Some definitions
Distribution: probability of each outcome
When the outcome is numeric (i.e. when it is a random variable), the distribution is
We will represent population averages like this \[\langle f(X)\rangle \] The brackets \(\langle \quad \rangle\) represent population average
Do not confuse it with \(\bar x=\text{mean}(x)\) (sample average)
The population is big, therefore \(N\to \infty\) and \[\frac{1}{N}\sum_{x\in\text{Population}} f(x)\quad\xrightarrow{N\to \infty}\quad \langle f(X)\rangle \] If we know the proportion of each outcome in the population, we can write \[\frac{1}{N}\sum_{x\in\text{Population}} f(x)= \sum_{a\in\text{Outcomes}} \frac{n_a}{N}f(a)= \sum_{a\in\text{Outcomes}} p_a \cdot f(a)\]
If the outcomes are numbers, like optical density or concentration, then the function \(f(X)\) can be simply \[f(X)=X\] In that case the population average is simply \[\langle X\rangle\] (it’s easy, no?)
If \(X\) and \(Y\) are two random variables, then \[\langle X+Y\rangle = \langle X\rangle + \langle Y\rangle \] i.e. the average of the sum is the sum of the averages
If \(k\) is a fixed (non-random) number, then \[\langle k\, X\rangle = k\, \langle X\rangle\] i.e. constants can get out of averages
Variance is the average square error. Can be used for any function \(X\)
We will represent population variance like this \[\mathbb V(X)=\langle(X-\langle X\rangle)^2\rangle\] Do not confuse it with sample variance \(\text{var}(x)\)
Same idea, but different context
It is good to know that \[\mathbb V(X)=\langle(X-\langle X\rangle)^2\rangle=\langle X^2\rangle-\langle X\rangle^2\]
Variance is the average of squares minus the square of average
Exercise: verify that this is true
If \(X\) and \(Y\) are two independent random variables, then \[\mathbb V(X+Y) = \mathbb V(X) + \mathbb V(Y) \] i.e. the variance of the sum is the sum of the variances
If \(k\) is a fixed (non-random) number, then \[\mathbb V(k\, X) = k^2\, \mathbb V(X)\] i.e. the variance is multiplied by the square
Exercise: verify that this is true
Variance is easy to calculate and has good properties. But the values are too big.
In practice we use the square root of variance, called standard deviation
We will represent population standard deviation like this \[\mathbb S(X)=\sqrt{\langle(X-\langle X\rangle)^2\rangle}=\sqrt{\langle X^2\rangle-\langle X\rangle^2}\] Do not confuse it with sample standard deviation \(\text{sd}(x)\)
If \(X\) and \(Y\) are two independent random variables, then \[\mathbb S(X+Y) = \sqrt{\mathbb{S}(X)^2 + \mathbb{S}(Y)^2} \] i.e. the same formula of Pythagoras’ theorem
If \(k\) is a fixed (non-random) number, then \[\mathbb S(k\, X) = k\, \mathbb S(X)\] i.e. the variance is multiplied by the square
Exercise: verify that this is true
Imagine that we repeat the experiment \(m\) times. In other words, the sample size is \(m\)
We get the values \(x_1, x_2,\ldots, x_m.\) All result from the same experiment, with the same probabilities
We want to see what happens with \[\sum_{i=1}^m x_i\]
Using the rules we discussed before, we have \[\left\langle\sum_{i=1}^m x_i\right\rangle=\sum_{i=1}^m \langle X\rangle=m\,\langle X\rangle\] The average of the sum is \(m\) times the average of each value.
We also have \[\mathbb V\left(\sum_{i=1}^m x_i\right)=\sum_{i=1}^m \mathbb V( X)=m\,\mathbb V(X)\] The variance of the sum is \(m\) times the variance of each value. Finally, \[\mathbb S\left(\sum_{i=1}^m x_i\right)=\sqrt{\mathbb V\left(\sum_{i=1}^m x_i\right)}=\sqrt{m}\,\mathbb S(X)\]
If now we take the average of each sample, we have \[\left\langle\frac{1}{m}\sum_{i=1}^m x_i\right\rangle=\sum_{i=1}^m \frac{1}{m}\langle X\rangle=\langle X\rangle\] The population average of the sample means is the population average of each value.
We also have \[\mathbb V\left(\frac{1}{m}\sum_{i=1}^m x_i\right)=\sum_{i=1}^m \frac{1}{m^2}\mathbb V( X)=\frac{1}{m}\mathbb V(X)\] The variance of the sample means is \(1/m\) times the variance of each value.
Finally, \[\mathbb S\left(\frac{1}{m}\sum_{i=1}^m x_i\right)=\sqrt{\frac{1}{m}\mathbb V\left(\sum_{i=1}^m x_i\right)}=\frac{\mathbb S(X)}{\sqrt{m}}\]
If we know the average and standard deviation of population, we can say that
If we take averages from many samples, the sample average will follow a Normal distribution
All we said before is true, but cannot be used directly
Because we do not know the population variance
Thus, we also ignore the population standard deviation
What can we do instead?
The solution is to use the standard deviation of the sample
(do not confuse it with standard deviation of the sample means)
But we have to pay a price: lower confidence
Published by William Sealy Gosset in Biometrika (1908)
He worked at the Guinness Brewery in Ireland
Studied small samples (the chemical properties of barley)
He called it “frequency distribution of standard deviations of samples drawn from a normal population”
Story says that Guinness did not want their competitors to know this quality control, so he used the pseudonym “Student”
plot(x,dnorm(x), type="l", lty=2) lines(x, dt(x, df=3), lty=1) legend("topleft", legend = c("t-Student","Normal"), lty=1:2)
Intervals are wider (but less than with Chebyshev)
Here we use the sample standard deviation to approximate the population standard deviation
As we have seen, if the sample is small, these two values may be different
Thus, the Student’s distribution depends on the sample size
The key idea is that the sample has \(m\) elements, but they are constrained by 1 value: the sample average
We say that we have \(m-1\) degrees of freedom
If we have 95% of population in the center, then we have 2.5% to the left and 2.5% to the right
We can find the \(k\) value if the sample size is 5
qt(0.025, df=5-1)
[1] -2.78
qt(0.975, df=5-1)
[1] 2.78