More precisely, mRNA concentration
We want to know
Measuring protein concentration is hard
We assume that protein concentration is proportional to mRNA concentration
If you have primers for each gene
Raw data: CT value for each gene/condition
and CT value for calibration reference
Southern/Northern/Western blot can detect, but not quantify
(I think so. I’m not a biologist)
Instead, we have macro- and microarrays
Raw data: Light intensity (luminescence) in one or more wave length
This is measured in arbitrary units, and is a number between 0 and 65536
(that is, a 16-bits value)
Compare each gene with itself in a different condition
In other words, evaluate differential expression
mRNA is retro-transcribed and fragmented.
Fragments are sequenced. Reads are aligned to reference genome
Raw data: SAM/BAM file with location of each read in the reference genome
Processed data: Number of reads per gene
Solution 1: Normalization
Reads per kilobase per million:
FPKM is the same, but with fragments instead of reads
Transcripts per million: similar to RPKM and FPKM
Just a different order of operations
Compare each gene with itself in a different condition
In other words, evaluate differential expression
Let’s say we have two conditions:
Let’s take wild type as the base condition.
For a given gene G, we want to calculate \[\frac{\text{Expression}(M)}{\text{Expression}(WT)}\]
This would be a number between 0 and ∞
When the gene is under-expressed, we get something between 0 and 1
When it is over-expressed, we get something between 1 and ∞
It is better to have a symmetrical result \[\log_2 \left(\frac{\text{Expression}(M)}{\text{Expression}(WT)}\right)\]
We use base 2 to get fold-change
Gene Expression Omnibus
Let’s analyze gene_expression.csv
Each column is a gene, each row is a sample, taken from NCBI GSE95670. The values are absolute expression, we want to know
We can add a column with a factor describing the condition
We want to measure fold-change. That means, we want to calculate log_2_(mutant/wild type)
Call:
lm(formula = log2(LOC100652730) ~ condition, data = m)
Coefficients:
(Intercept) conditionMutant
7.2767 0.2465
Call:
lm(formula = log2(LOC100652730) ~ condition, data = m)
Residuals:
1 2 3 4 5 6
0.5004 -0.2829 -0.2175 0.3705 -0.2239 -0.1466
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.2767 0.2211 32.912 5.08e-06 ***
conditionMutant 0.2465 0.3127 0.788 0.475
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.383 on 4 degrees of freedom
Multiple R-squared: 0.1344, Adjusted R-squared: -0.08195
F-statistic: 0.6213 on 1 and 4 DF, p-value: 0.4747
2.5 % 97.5 %
(Intercept) 6.6628636 7.890598
conditionMutant -0.6216828 1.114595
Check if the confidence interval contains 0
2.5 % 97.5 %
conditionMutant 0.5694141 3.608096
2.5 % 97.5 %
conditionMutant -1.433264 -0.2166465
2.5 % 97.5 %
conditionMutant 4.45331 8.712368
2.5 % 97.5 %
conditionMutant -4.362937 -2.184466