Each square is a sample. Volume is fixed. The cell count is an average of cell counts of some squares.
We want population cell density
We have a sample of cell densities
We care about sd(x)
because it tells us how close is the
mean to most of the population
It can be proved that always \[\Pr(\vert x_i-\bar{\mathbf x}\vert\geq k\cdot\text{sd}(\mathbf x))\leq 1/k^2\]
In other words, the probability that “the distance between the mean \(\bar{\mathbf x}\) and any element \(x_i\) is bigger than \(k\cdot\text{sd}(\mathbf x)\)” is less than \((1/k^2)\)
It is always valid, for any probability distribution
(Later we will see better rules valid only sometimes)
It can also be written as \[\Pr(\vert x_i-\bar{\mathbf x}\vert\leq k\cdot\text{sd}(\mathbf x))\geq 1-1/k^2\]
The probability that “the distance between the mean \(\bar{\mathbf x}\) and any element \(x_i\) is less than \(k\cdot\text{sd}(\mathbf x)\)” is greater than \(1-1/k^2\)
Another way to understand the meaning of this theorem is \[\Pr(\bar{\mathbf x} -k\cdot\text{sd}(\mathbf x)\leq x_i \leq \bar{\mathbf x} +k\cdot\text{sd}(\mathbf x))\geq 1-1/k^2\] Replacing \(k\) for some values, we get
\[\begin{aligned} \Pr(\bar{\mathbf x} -1\cdot\text{sd}(\mathbf x)\leq x_i \leq \bar{\mathbf x} +1\cdot\text{sd}(\mathbf x))&\geq 1-1/1^2 = 0\\ \Pr(\bar{\mathbf x} -2\cdot\text{sd}(\mathbf x)\leq x_i \leq \bar{\mathbf x} +2\cdot\text{sd}(\mathbf x))&\geq 1-1/2^2 = 0.75\\ \Pr(\bar{\mathbf x} -3\cdot\text{sd}(\mathbf x)\leq x_i \leq \bar{\mathbf x} +3\cdot\text{sd}(\mathbf x))&\geq 1-1/3^2 = 0.889 \end{aligned}\]
stats.libretexts.org
For any numerical data set
The Empirical Rule and Chebyshev’s Theorem. (2021, January 11). Retrieved May 25, 2021, from https://stats.libretexts.org/@go/page/559
pop_HD
These values should be more than 0, 0.75 and 0.889
[1] 0.583
[1] 1
[1] 1
Moreover, it is often different from the population average
When the sample size is big,
the sample average is closer to
the population average
(Intercept) log(size)
3.200 -0.516
\[\log(\text{sd_sample_mean}) = 3.2 + -0.516\cdot\log(\text{size})\] \[\begin{aligned}\text{sd_sample_mean} & = \exp(3.2) \cdot\text{size}^{-0.516}\\ & = 24.529\cdot\text{size}^{-0.516} \end{aligned}\]
\[\text{sd_sample_mean} = A\cdot \text{size}^B\]
A | B | std dev population | |
---|---|---|---|
pop_LD | 3.45 | -0.5261 | 3.188 |
pop_MD | 9.455 | -0.5167 | 8.919 |
pop_HD | 24.53 | -0.5158 | 23.14 |
Coefficient \(A\) is the standard
deviation of the population
Coefficient \(B\) is -0.5
If we know the population standard deviation, we can predict the sample standard deviation
\[\text{sd(sample mean)} = \frac{\text{sd(population)}}{\sqrt{\text{sample size}}}\]
Using Chebyshev formula, we know that, with high probability \[\vert \text{mean(sample)} -\text{mean(population)}\vert < k\cdot\frac{\text{sd(population)}}{\sqrt{\text{sample size}}}\]
Therefore the population average is inside the interval \[\text{mean(sample)} \pm k\cdot\frac{\text{sd(population)}}{\sqrt{\text{sample size}}}\] (probably)
Remember that we do not know neither the population mean nor the population variance
So we do not know the population standard deviation 😕
In most cases we can use the sample standard deviation
Answering scientific questions
Highest kidney cancer rates in US (1980–1989) were in rural areas
Why? Maybe…
Why? Maybe…
Something is wrong, of course. The rural lifestyle cannot explain both very high and very low incidence of kidney cancer.
What is the relationship between sample average and population average?
Can we learn the population average from the sample average?