The data for today is from the paper
“Putative MET30 and WD40/YVTN Proteins Regulate NaCl and High Temperature Stress in Legumes”
by Haluk Çelik, Andres Aravena and Neslihan Turgut Kara
submitted this year to Frontiers in Plant Science, section Plant Abiotic Stress
To identify genes involved in early response to Salt and/or High Temperature stress in legumes
In particular, we selected 3 legume species and identified in them two clusters of ortholog of genes containing both an F-Box domain and a WD40 domain
We tested if these genes have an early response to two separate stresses
Three legumes:
Two Genes, named according to A. thaliana ortholog:
Today we will see an application using R
Next class we will do the same in Excel
It has a new way of writing complex formulas
Instead of writing this
we can (if we want) write this
We read “do the linear model AND THEN do summary”
primers represent all systematic technical distortions
\[ logFC(organism, time, tissue, stress, gene, replica) \]
The formula corresponds to the data table
Organism | Time | Tissue | Stress | Gene | Replica | logFC |
---|---|---|---|---|---|---|
P. vulgaris | Control | Root | Heat | MET30 | Rep1 | 7.588 |
P. vulgaris | Control | Root | Heat | MET30 | Rep2 | 7.044 |
P. vulgaris | Control | Root | Heat | MET30 | Rep3 | 7.282 |
P. vulgaris | Control | Stem | Heat | MET30 | Rep1 | 5.801 |
P. vulgaris | Control | Stem | Heat | MET30 | Rep2 | 5.172 |
P. vulgaris | Control | Stem | Heat | MET30 | Rep3 | 5.348 |
(we show only the first lines)
We compare one gene’s expression in Root for two conditions: Control and 1 Hour. First, we isolate the data
Organism Time Tissue Stress Gene Replica logFC
1 P. vulgaris Control Root Heat MET30 Rep1 7.588
2 P. vulgaris Control Root Heat MET30 Rep2 7.044
3 P. vulgaris Control Root Heat MET30 Rep3 7.282
10 P. vulgaris 1 hour Root Heat MET30 Rep1 4.156
11 P. vulgaris 1 hour Root Heat MET30 Rep2 3.612
12 P. vulgaris 1 hour Root Heat MET30 Rep3 3.804
Welch Two Sample t-test
data: hs_1$logFC[hs_1$Time == "Control"] and hs_1$logFC[hs_1$Time == "1 hour"]
t = 15.392, df = 3.9995, p-value = 0.000104
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.825462 4.069205
sample estimates:
mean of x mean of y
7.304667 3.857333
Welch Two Sample t-test
data: logFC by Time
t = 15.392, df = 3.9995, p-value = 0.000104
alternative hypothesis: true difference in means between group Control and group 1 hour is not equal to 0
95 percent confidence interval:
2.825462 4.069205
sample estimates:
mean in group Control mean in group 1 hour
7.304667 3.857333
Welch Two Sample t-test
data: logFC by Time
t = 15.392, df = 3.9995, p-value = 0.000104
alternative hypothesis: true difference in means between group Control and group 1 hour is not equal to 0
95 percent confidence interval:
2.825462 4.069205
sample estimates:
mean in group Control mean in group 1 hour
7.304667 3.857333
Call:
lm(formula = logFC ~ Time, data = one_gene, subset = Tissue ==
"Root" & Time != "2 hours")
Residuals:
1 2 3 10 11 12
0.28333 -0.26067 -0.02267 0.29867 -0.24533 -0.05333
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.3047 0.1584 46.12 1.32e-06 ***
Time1 hour -3.4473 0.2240 -15.39 0.000104 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2743 on 4 degrees of freedom
Multiple R-squared: 0.9834, Adjusted R-squared: 0.9792
F-statistic: 236.9 on 1 and 4 DF, p-value: 0.000104
The p-values are the same
Welch Two Sample t-test
data: logFC by Time
t = 15.576, df = 2.6234, p-value = 0.00117
alternative hypothesis: true difference in means between group Control and group 2 hours is not equal to 0
95 percent confidence interval:
5.140307 8.073026
sample estimates:
mean in group Control mean in group 2 hours
7.304667 0.698000
Call:
lm(formula = logFC ~ Time, data = one_gene, subset = Tissue ==
"Root" & Time != "1 hour")
Residuals:
1 2 3 19 20 21
0.28333 -0.26067 -0.02267 0.74400 -0.59600 -0.14800
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.3047 0.2999 24.36 1.69e-05 ***
Time2 hours -6.6067 0.4241 -15.58 9.92e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5195 on 4 degrees of freedom
Multiple R-squared: 0.9838, Adjusted R-squared: 0.9797
F-statistic: 242.6 on 1 and 4 DF, p-value: 9.918e-05
This time the p values are different. Why?
We do not know the population variance
We estimate it using our data
If we assume that all values come from the same population, then we can use this extra information to get a better standard error
(we are dividing by a larger \(n\))
Two Sample t-test
data: logFC by Time
t = 15.576, df = 4, p-value = 9.918e-05
alternative hypothesis: true difference in means between group Control and group 2 hours is not equal to 0
95 percent confidence interval:
5.429051 7.784282
sample estimates:
mean in group Control mean in group 2 hours
7.304667 0.698000
Now the p-values are the same as in the linear model
If we can assume that all results come from the same distribution, then we can use a linear model and calculate all at the same time
Call:
lm(formula = logFC ~ Time, data = one_gene, subset = Tissue ==
"Root")
Residuals:
Min 1Q Median 3Q Max
-0.59600 -0.24533 -0.05333 0.28333 0.74400
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.3047 0.2616 27.925 1.40e-07 ***
Time1 hour -3.4473 0.3699 -9.319 8.65e-05 ***
Time2 hours -6.6067 0.3699 -17.859 1.98e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4531 on 6 degrees of freedom
Multiple R-squared: 0.9815, Adjusted R-squared: 0.9754
F-statistic: 159.6 on 2 and 6 DF, p-value: 6.283e-06
Under the same hypotheses we can do a one-way ANOVA
Df Sum Sq Mean Sq F value Pr(>F)
Time 2 65.51 32.76 159.6 6.28e-06 ***
Residuals 6 1.23 0.21
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Notice that the F value and p value are the same as in the last line of the linear model (in previous slide)
The linear model summary gives us
Each p-value evaluates different null hypotheses