A “coin” (officially: a Bernoulli random variable) is an experiment that has two possible outcomes
Obviously 𝑝 + 𝑞=1
This can represent the effect of a virus in the population, where 𝑝 and 𝑞 correspond to the proportion of sick and healthy people
We would like to know 𝑝
If we are rich and powerful, we can test all the population
Sometimes this is impossible, since the population may be
“all living organisms”
or at least
“all legumes in the last million years”
We need another strategy
If we represent success with 1 and failure with 0, then each individual is a random variable 𝑋. Then we have \[\begin{aligned} 𝔼X &= \sum_{x∈Ω} x⋅ℙ(X=x) \\ & = 1⋅ℙ(X=1) + 0⋅ℙ(X=0) \\ & = p\end{aligned}\]
That is, the expected value of this “coin” is the probability of success. We want to know \(𝔼X\)
We know that \(p=𝔼X,\) that is, to the mean value of successes in the population
So maybe it is a good idea to get a sample of size 𝑛 and calculate its mean value
But the sample is random, so the sample mean will be a random variable
what is the expected value of the sample mean?
what is its relation with the population mean?
Let’s assume that we have a small sample \((X_1,…,X_n).\)
All \(X_i\) are random variables taken from the same population. We take the average or sample mean: \[\text{mean}(X_1,…,X_n)=\bar{𝐗}=\frac{1}{n}\sum_i X_i\] Since the sample is random, \(\bar{𝐗}\) is also a random variable
What is the expected value of \(\bar{𝐗}\)?
We often assume that all outcomes in the sample are independent identically distributed (i.i.d.)
In that case we will have
Since \(𝔼(α X+βY)=α𝔼(X)+β𝔼(Y),\) we have \[𝔼(\bar{𝐗})=𝔼\left(\frac{1}{n}\sum_i X_i\right)=\frac{1}{n}𝔼\sum_i X_i=\frac{1}{n}\sum_i𝔼 X_i\] and since all \(X_i\) come from the same population \[𝔼(\bar{𝐗})=\frac{1}{n}\sum_i𝔼 X=\frac{n}{n}𝔼 X=𝔼 X\]
Good!
Now we have \(𝕍(α X+βY)=α^2𝕍(X)+β^2𝕍(Y),\) thus \[𝕍(\bar{𝐗})=𝕍\left(\frac{1}{n}\sum_i X_i\right)=\frac{1}{n^2}𝕍\sum_i X_i=\frac{1}{n^2}\sum_i 𝕍 X_i\] and since all \(X_i\) come from the same population \[𝕍(\bar{𝐗})=\frac{1}{n^2}\sum_i 𝕍 X=\frac{n}{n^2}𝕍 X=\frac{1}{n}𝕍 X\] So averages of bigger samples have smaller variance
For any random variable \(X,\) we have \[ℙ(|X-𝔼X| ≤ c\sqrt{𝕍X})≥ 1-1/c^2\] in the case of \(\bar{𝐗}\) we have \[ℙ\left(|\bar{𝐗}-𝔼\bar{𝐗}| ≤ c\sqrt{𝕍\bar{𝐗}}\right)≥ 1-1/c^2\] that is \[ℙ\left(|\bar{𝐗}-𝔼X| ≤ c\sqrt{𝕍(X)/n}\right)≥ 1-1/c^2\]
We have \[ℙ\left(-c\sqrt{𝕍(X)/n}≤ 𝔼X-\bar{𝐗}≤ c\sqrt{𝕍(X)/n}\right)≥ 1-1/c^2\] This can also be written as \[ℙ\left(\bar{𝐗}-c\sqrt{𝕍(X)/n}≤𝔼X ≤ \bar{𝐗}+c\sqrt{𝕍(X)/n}\right)≥ 1-1/c^2\]
Thus, we have an interval that probably contains the population mean
We want to know the population mean 𝔼X (i.e. the proportion of sick people)
We take a random sample (we test 𝑛 people)
The population average is in the interval \[\left[\bar{𝐗}-c\sqrt{𝕍(X)/n}, \bar{𝐗}+c\sqrt{𝕍(X)/n}\right]\] with probability at least \(1-1/c^2\)
This is called a confidence interval
This is an important result
It says that
As you know, there are two schools of probabilities
The Law of Large numbers shows that, if samples are large, both points of view give the same result
The margin of error depends on the square root of the sample size \(\sqrt{n}\)
Thus, to get double precision, we need 4 times more data
To get one more decimal place (10 times more precision) we need 100 times more data
How many people you need to interview to estimate the average age of Turkish population with a margin of error of 5 years?
… of 1 year?
… of 1 month?
The margin of error depends on the standard deviation of the population
That is, the square root of the population variance \(𝕍(X)\)
But we do not know the population variance
What can we do?