Raw data is light intensity (luminescence) for Control and Treatment \[C, T\]
We work with the logarithm (base 2) of these values \[LC=\log_2(C)\\LT=\log_2(T)\]
Then Average expression is \[AvgExp = \frac{LT + LC}{2}\] and Fold change is \[logFC = LT - LC=\log_2\left(\frac{T}{C}\right)\]
Classical statistics. We have two scenarios, called hypothesis
But our experiment cannot tell us directly which one is true
Every experiment has 4 contributions
We want to test the hypothesis “black horses are taller or shorter than white horses”
We may have bad luck. Maybe black and white horses have the same average height, but
Therefore, we cannot be 100% sure that our results correspond to reality
But we can have a degree of confidence that we are not far away
In terms of hypothesis test, we have
The p-value is the probability of observing the experimental data \(X\), assuming that H0 is true \[ℙ(X|H_0)\]
Notice that \[ℙ(X|H_0)≠ℙ(H_0|X)\] In other words, the p-value is not the probability that the null hypothesis is true, given the experimental result
Ideally we will like to know this last probability, but it is hard to do so
Under the null hypothesis, the height difference is 0
If we can also assume that the noise and variability follow a Normal distribution \(N(0,σ^2),\) we have
We have another problem. We do not know σ²
A clever biologist found a solution
We measure the variance in our data and we use it
But we have to pay a price: We have less confidence
Traditionally, 5% and 1% are used as p-value thresholds
But there is nothing to decide that these are good values
Indeed, in Gene Expression, these value are usually too big
There are several approaches
The basic difference is the trade-off between False Positives and False Negatives
In every hypothesis test, we can be wrong in two ways
Usually improving one means worsening of the other
Family Wise Error Rate multiplies each p-value by the number of cases \[p.adj = p.value ⋅ N\]
It reduces False Positives and increases False Negatives
Sometimes we get nothing significant
FDR sorts the p-values and multiplies each by an increasing value \[p.adj = p.value \cdot\frac{i}{N}\]
If we get \(p.adj<0.05\) then the probability that it is a false positive is 5%