This is called “Pareidolia”
🙂
Let’s say we measured the differential expression of a gene several times
We want to know if the real differential expression is not zero
We want to find a confidence interval for the real differential expression
And we want to see if 0 is in the interval
We measure the biological signal and noise from the instrument
In general, after normalization, we can assume that the noise follows a normal distribution
If the real expression is \(μ\), then we measure \[X ∼ Normal(μ, σ^2)\]
For each gene we calculate the average \(\bar{X}\)
We know that the average will follow a normal \[\bar{X} ∼ Normal(μ, σ^2/n)\] Thus we can make an interval for \(μ\) \[\bar{X}-k⋅σ/\sqrt{n} ≤μ ≤ \bar{X}+k⋅σ/\sqrt{n}\] We have to choose \(k\) depending on the confidence level we aim
We can always use Chebyshev’s Theorem
k | Probability |
---|---|
2 | ≥ 1-1/22 = 75% |
3 | ≥ 1-1/32 = 88.9% |
10 | ≥ 1-1/102 = 99% |
31.6 | ≥ 1-1/1000 = 99.9% |
k | ≥ 1-1/k2 |
but these intervals are too wide to be useful
In this case we know that the distribution is Normal, so
k | Probability |
---|---|
1.959964 | 95% |
2 | ≈ 95% |
2.5758293 | 99% |
3 | ≈ 99% |
(These values can be found in tables, or using the computer)
Take the Normal curve with mean 0 and variance 1
We want the blue area to be large.
So the white area should be small
If the blue area is 1-α, then the white area is α
The area of each white part is α/2.
We lool for the points 𝑘 giving areas α/2 and 1-α/2
For example, 95% confidence means that 1-α=0.95
Therefore α=0.05, and α/2=0.025
We look for 0.025 and 0.975 in the table
[1] -1.959964
[1] 1.959964
Until now we have assumed that we knew the population standard deviation
But we do not
We can approximate it with the sample standard deviation
But we have to pay a cost
This one depends on the degrees of freedom
The price to pay for not knowing the population variance is to use Student’s t instead of Normal distribution.
Intervals using Student’s t are wider (and less useful)
To avoid this problem, and get an useful results, we need large enough samples
k (df=2) | k (df=5) | k (df=10) | Normal | Probability |
---|---|---|---|---|
4.3 | 2.57 | 2.23 | 1.96 | 95% |
9.92 | 4.03 | 3.17 | 2.58 | 99% |
31.6 | 6.87 | 4.59 | 3.29 | 99.9% |
Here we have the measured differential gene expression of several genes
Replica 1 | Replica 2 | Replica 3 |
---|---|---|
-0.6356720 | 0.5445543 | 0.5056405 |
0.9198619 | -0.6887110 | -0.2273942 |
1.1870043 | 1.0710029 | 1.3180957 |
0.1376069 | 1.7086511 | 1.1611300 |
0.8551033 | -1.0060231 | 0.4222059 |
There are three biological replicas for each gene
The values of first gene are
[1] -0.6356720 0.5445543 0.5056405
The mean is
[1] 0.1381743
The standard deviation is
[1] 0.6704529
We have 𝑛=3 values, and we are estimating 1 value (the mean)
Thus, we have 3-1=2 degrees of freedom
The t distribution for 95% and 2 degrees of freedom is
[1] 4.302653
Thus, the 95%-confidence interval for the expression is
[1] -2.746552 3.022900
The interval contains 0, so it seems that the gene is not differentially expressed
The values of first gene are
[1] 1.187004 1.071003 1.318096
The mean is
[1] 1.192034
The standard deviation is
[1] 0.1236232
The t distribution for 95% and 2 degrees of freedom is
[1] 4.302653
Thus, the 95%-confidence interval for the expression is
[1] 0.6601268 1.7239418
The interval does not contain 0, so it seems that the gene is differentially expressed
The t distribution for 99% and 2 degrees of freedom is
[1] 9.924843
Thus, the 99%-confidence interval for the expression is
[1] -0.03490616 2.41897477
Now the interval contains 0, so it seems that the gene is not differentially expressed
In Case 2 we have different results depending on the confidence level
One can ask “What is the largest confidence level that will not include 0?”
In other words, what is the smallest α that will not include 0?
That is the 𝑝-value
The interval can be written as \[-k⋅sd(X)/\sqrt{n} ≤ \bar{X} - μ ≤ k⋅sd(X)/\sqrt{n}\] In the limit case we have \[\bar{X}-μ = k⋅sd(X)/\sqrt{n}\] so \[k=\frac{\bar{X}-μ}{sd(X)/\sqrt{n}}\]
In this case we have n=3, mean=1.192 and sd=0.124, so \[k=\frac{1.192}{0.124/\sqrt{3}} = 5.567\] We use this value to find α in the table
[1] 0.01539187
The best confidence level is 1-α=0.985