In the cases we see here, we assume that the underlying distribution is Normal
(that is usually the case for averages)
We need to know the population mean and standard deviation
When we have a sample of size 𝑛, we calculate
and we use them to approximate the population values
Then we use Student’s t distribution
Genotype | Stress | logExpr | |
---|---|---|---|
1 | WT | Low | 4.439524 |
2 | WT | Low | 4.769823 |
3 | WT | Low | 6.558708 |
4 | WT | High | 7.070508 |
5 | WT | High | 7.129288 |
6 | WT | High | 8.715065 |
7 | KO | Low | 6.460916 |
8 | KO | Low | 4.734939 |
9 | KO | Low | 5.313147 |
10 | KO | High | 7.554338 |
11 | KO | High | 9.224082 |
12 | KO | High | 8.359814 |
We want to find the change in gene expression due to genotype and stress condition
A common approach is to model gene expression as \[\text{Gene expression} = β_0 + β_1⋅\text{Genotype} + β_2⋅\text{Stress}\]
Here Genotype and Stress are either 0 or 1, depending on the case, and we want to find \(β_0, β_1,\) and \(β_2\) (the coefficients)
(Intercept) | GenotypeKO | StressHigh | |
---|---|---|---|
1 | 1 | 0 | 0 |
2 | 1 | 0 | 0 |
3 | 1 | 0 | 0 |
4 | 1 | 0 | 1 |
5 | 1 | 0 | 1 |
6 | 1 | 0 | 1 |
7 | 1 | 1 | 0 |
8 | 1 | 1 | 0 |
9 | 1 | 1 | 0 |
10 | 1 | 1 | 1 |
11 | 1 | 1 | 1 |
12 | 1 | 1 | 1 |
Most people use the R package limma for gene expression
Here we show a simple version, with one single gene
Call:
lm(formula = logExpr ~ Genotype + Stress)
Coefficients:
(Intercept) GenotypeKO StressHigh
5.1325 0.4941 2.6293
Estimated coefficients are the averages from the sample
Call:
lm(formula = logExpr ~ Genotype + Stress)
Residuals:
Min 1Q Median 3Q Max
-0.8916 -0.6917 -0.3380 0.8641 1.4262
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.1325 0.4553 11.273 1.31e-06 ***
GenotypeKO 0.4941 0.5257 0.940 0.371872
StressHigh 2.6293 0.5257 5.001 0.000738 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9106 on 9 degrees of freedom
Multiple R-squared: 0.7421, Adjusted R-squared: 0.6848
F-statistic: 12.95 on 2 and 9 DF, p-value: 0.002247
Estimated coefficients follow a t distribution
Estimated coefficients follow a t distribution
We have 12 values, and we evaluate 3 coefficients
Thus, we have 12-3=9 degrees of freedom
Then we can use the 9 DF table of Student’s t to get a confidence interval
2.5 % 97.5 %
(Intercept) 4.1025560 6.162410
GenotypeKO -0.6952039 1.683310
StressHigh 1.4400824 3.818597
2.5 % 97.5 %
(Intercept) 4.1025560 6.162410
GenotypeKO -0.6952039 1.683310
StressHigh 1.4400824 3.818597
(Intercept)
is the base expression level for Wild Type and Low Stress
GenotypeKO
is the differential expression due to gene KO. It can be 0, so we do not have enough evidence to say “there is an effect”
StressHigh
is the differential expression due to stress. It seems greater than 0 at 95% confidence. Will still be >0 at 99% confidence?
Call:
lm(formula = logExpr ~ Genotype + Stress)
Residuals:
Min 1Q Median 3Q Max
-0.8916 -0.6917 -0.3380 0.8641 1.4262
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.1325 0.4553 11.273 1.31e-06 ***
GenotypeKO 0.4941 0.5257 0.940 0.371872
StressHigh 2.6293 0.5257 5.001 0.000738 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9106 on 9 degrees of freedom
Multiple R-squared: 0.7421, Adjusted R-squared: 0.6848
F-statistic: 12.95 on 2 and 9 DF, p-value: 0.002247
There are several approaches
The basic difference is the trade-off between False Positives and False Negatives
In every hypothesis test, we can be wrong in two ways
Usually improving one means worsening of the other
Family Wise Error Rate multiplies each p-value by the number of cases \[p.adj = p.value ⋅ N\]
It reduces False Positives and increases False Negatives
Sometimes we get nothing significant
FDR sorts the p-values and multiplies each by an increasing value \[\text{p.adj} = \text{p-value} \cdot\frac{i}{N}\]
If we get \(p.adj<0.05\) then the probability that it is a false positive is 5%
At this point we have a list of DE genes
Let’s say they are 10% of all genes
Each gene is part of one or more networks
Beyond pathways, it is useful to look at “Gene Ontology”
It has three branches
To find the relevant networks, we use Enrichment Analysis
For example, let’s say we have 10000 genes in total, and 1000 DE genes
Let’s say that carbon metabolism takes 3000 genes in total
If, among the DE genes, 300 genes are in carbon metabolism, we will not be surprised
3K carbon metabolism in 10K total genes is 30%
If we take 1K random genes, we expect to have 30% of them in carbon metabolism, just by randomness
If instead we have 600 carbon metabolism among the 1K DE genes, we are surprised
In that case we say that the set of carbon metabolism is enriched in the DE experiment
This problems follows an hypergeometric distribution
\[ℙ(k \text{ observed sucesses} | N,K,n)\]
\[ℙ(k \text{ DE genes in group} | N,K,n)\]