There is population and samples
Do not confuse them
Identify them
Big. Sometimes imaginary
They have mean, variance and standard deviation
Variance is the square of standard deviation
Standard deviation is the square root of variance
An outcome is a random element of the population
Any outcome will be similar to the population mean
How similar? It depends on the population variance
The distance between outcome and population mean depends on the confidence level
outcome will probably be between mean(pop)±k*sd(pop)
\[\bar{X} - k\,𝕊𝔻(X)≤\text{outcome}≤\bar{X} + k\,𝕊𝔻(X)\]
The key idea is: different k
have different probability
(This is not always true)
When the population is Normal, then
k | Probability |
---|---|
2 | ≈ 95% |
3 | ≈ 99% |
qnorm(1-alpha/2) |
1-alpha |
Choose your own alpha
(Chebyshev always works)
k | Probability |
---|---|
2 | ≥ 1-1/22 = 75% |
3 | ≥ 1-1/32 = 88.9% |
10 | ≥ 1-1/102 = 99% |
k | ≥ 1-1/k2 |
If you do not have a bell shaped curve, this is your safety net
A sample is a group of outcomes
It has size, mean, variance, and standard deviation
Each sample is different (random)
Each sample mean is random
If the sample size is large, then sample mean has Normal distribution
If sample mean has Normal distribution, we need to know it parameters
The average of sample mean is the population mean
The standard deviation of sample mean is population standard deviation divided by the square root of sample size
The variance of sample mean is population variance divided by sample size
sample mean will probably be between mean(pop)±k*sd(pop)
\[\bar{X} - k\frac{𝕊𝔻(X)}{\sqrt{\text{n}}}≤\text{sample mean}≤\bar{X} + k\frac{𝕊𝔻(X)}{\sqrt{\text{n}}}\]
The key idea is: different k
have different probability
k | Probability |
---|---|
2 | ≈ 95% |
3 | ≈ 99% |
qnorm(1-alpha/2) |
1-alpha |
Choose your own alpha
In real life we do not know population mean, and we want to know.
We only know sample mean, and sample variance
We can approximate population variance by sample variance
But we have to pay a cost
population mean will probably be between mean(sample)±k*sd(pop)
\[\text{sample mean} - k\frac{𝕊𝔻(X)}{\sqrt{\text{n}}}≤ \bar{X} ≤\text{sample mean} + k\frac{𝕊𝔻(X)}{\sqrt{\text{n}}}\]
The key idea is: different k
have different probability
Now we have Student’s t distribution
This depends on degrees of freedom (sample size-1)
k | Probability |
---|---|
qt(1-alpha/2, df) |
1-alpha |
Choose your own alpha
The price to pay for not knowing the population variance is to use Student’s t instead of Normal distribution.
Intervals using Student’s t are wider (and less useful)
To avoid this problem, and get an useful results, we need large enough samples