# A tibble: 129 x 2
distance time
<dbl> <dbl>
1 0 0
2 -0.00237 0.00417
3 0 0.00834
4 0 0.0125
5 0 0.0167
6 0 0.0208
7 0 0.0250
8 -0.0119 0.0292
9 -0.0332 0.0333
10 -0.0380 0.0375
# … with 119 more rows
Call:
lm(formula = distance ~ time, data = fall)
Coefficients:
(Intercept) time
0.2235 -2.8614
time
2 to the formula(Intercept) time time_sq
-0.0154099 -0.1528996 -5.0772634
The model is \[distance = A + B\cdot time + C\cdot time^2\]
𝐴: distance at time 0
𝐵: speed at time 0
𝐶: half of acceleration, 𝑔/2
# A tibble: 1 x 3
distance time time_sq
<dbl> <dbl> <dbl>
1 0 0 0
So \(distance=0\) when \(time=0\)
This is how we designed the experiment
We want to use another formula in the model \[distance = 0 + B\cdot time + C\cdot time^2\]
model_3 <- lm(distance ~ time + time_sq + 0, data=fall)
ggplot(fall, aes(x=time, y=distance)) + geom_point(color="steelblue") +
geom_line(aes(y=predict(model_3)), color=3, width=2.5)
Warning: Ignoring unknown parameters: width
Call:
lm(formula = distance ~ time + time_sq + 0, data = fall)
Coefficients:
time time_sq
-0.268 -4.898
time time_sq
-0.2679933 -4.8981729
2.5 % 97.5 %
time -0.2956294 -0.2403571
time_sq -4.9647940 -4.8315517
Or, in a more professional way, using 3 significant digits
2.5 % 97.5 %
time -0.296 -0.24
time_sq -4.960 -4.83
To know why, you must take my course “Methodology of Scientific Research”
summary()
gives more detail
Call:
lm(formula = distance ~ time + time_sq + 0, data = fall)
Residuals:
Min 1Q Median 3Q Max
-0.0316174 -0.0116885 -0.0008487 0.0077115 0.0193887
Coefficients:
Estimate Std. Error t value Pr(>|t|)
time -0.26799 0.01397 -19.19 <2e-16 ***
time_sq -4.89817 0.03367 -145.49 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.01224 on 127 degrees of freedom
Multiple R-squared: 0.9997, Adjusted R-squared: 0.9997
F-statistic: 2.155e+05 on 2 and 127 DF, p-value: < 2.2e-16
students
datastudents <- read_tsv("students2018-2020-tidy.tsv",
col_types = cols(sex = col_factor(levels = c("Female", "Male")),
handedness = col_factor(levels = c("Left", "Right"))))
# A tibble: 117 x 10
answer_date id english_level sex birthdate birthplace height_cm
<date> <chr> <chr> <fct> <date> <chr> <dbl>
1 2018-09-17 3e50… I can speak … Male 1993-02-01 -/Turkey 179
2 2018-09-17 479d… I can unders… Fema… 1998-05-21 Kahramanm… 168
3 2018-09-17 39df… I can read a… Fema… 1998-01-18 Batman/Tu… NA
4 2018-09-17 d2b0… I can read a… Male 1998-08-29 Antalya/T… 170
5 2018-09-17 f22b… I can read a… Fema… 1998-05-03 Izmir/Tur… 162
6 2018-09-17 849c… İngilizce bi… Fema… 1995-10-09 Yalova/Tu… 167
7 2018-09-17 8381… I can speak … Fema… 1997-09-19 Adıyaman/… 174
8 2018-09-17 b0dd… I can read a… Male 1997-11-27 Bursa/Tur… 180
9 2018-09-17 2972… I can read a… Fema… 1999-01-02 Istanbul/… 162
10 2018-09-17 72c0… I can read a… Fema… 1998-10-02 Istanbul/… 172
# … with 107 more rows, and 3 more variables: weight_kg <dbl>,
# handedness <fct>, hand_span <dbl>
NA
valuesIf any column has NA
, then all row is omitted
# A tibble: 96 x 10
answer_date id english_level sex birthdate birthplace height_cm
<date> <chr> <chr> <fct> <date> <chr> <dbl>
1 2018-09-17 3e50… I can speak … Male 1993-02-01 -/Turkey 179
2 2018-09-17 479d… I can unders… Fema… 1998-05-21 Kahramanm… 168
3 2018-09-17 d2b0… I can read a… Male 1998-08-29 Antalya/T… 170
4 2018-09-17 f22b… I can read a… Fema… 1998-05-03 Izmir/Tur… 162
5 2018-09-17 849c… İngilizce bi… Fema… 1995-10-09 Yalova/Tu… 167
6 2018-09-17 8381… I can speak … Fema… 1997-09-19 Adıyaman/… 174
7 2018-09-17 b0dd… I can read a… Male 1997-11-27 Bursa/Tur… 180
8 2018-09-17 2972… I can read a… Fema… 1999-01-02 Istanbul/… 162
9 2018-09-17 72c0… I can read a… Fema… 1998-10-02 Istanbul/… 172
10 2018-09-17 d292… I can read a… Male 1997-05-18 Van/Turkey 181
# … with 86 more rows, and 3 more variables: weight_kg <dbl>, handedness <fct>,
# hand_span <dbl>
(Intercept) sexMale
59.85484 18.17457
Here (Intercept)
is “standard weight”, and sexMale
is the extra weight when sex=='Male'
(on average)
\[\text{weight_kg} = \begin{cases} A + 0 & \text{if sex is “Female”}\\ A + B & \text{if sex is “Male”} \end{cases}\]
The model is
\[\text{weight_kg} = A + B\cdot\text{[sex is “Male”]}\]
where \(\text{[sex is “Male”]}\) is 1 when sex=="Male"
, and is 0 when sex=="Female"
\[\text{[sex is “Male”]} = \begin{cases} 0 & \text{if sex is “Female”}\\ 1 & \text{if sex is “Male”} \end{cases}\]
sexFemale sexMale
59.85484 78.02941
Now the formula includes +0
. This means “do not use intercept”.
Now
sexFemale
is the average weight when sex=="Female"
sexMale
is the average weight when sex=="Male"
Let’s analyze gene_expression.csv
id LOC100652730 IZUMO4 RHOU STC1 PDGFC
D D 260.2225 207.44155 227.0810 574.453525 432.0676
D2 D2 151.1980 301.53198 154.6539 817.333091 203.0373
D3 D3 158.2063 293.09798 143.2183 877.628619 183.1862
K K 200.4629 125.21024 333.8940 18.971249 2578.1927
K2 K2 132.7736 49.79009 278.8245 7.745125 2391.0307
K3 K3 140.0867 38.20547 300.3375 3.183789 2358.1266
Each column is a gene, each row is a microarray (or RNAseq)
These values are real, taken from NCBI GSE95670
The values are absolute expression, we want to know
We can add a column with a factor describing the condition
We want to measure fold-change. That means, we want to calculate log_2(mutant/wild type)
(Intercept) conditionMutant
7.2767306 0.2464562
2.5 % 97.5 %
conditionMutant -0.6216828 1.114595
Check if the confidence interval contains 0
2.5 % 97.5 %
conditionMutant 0.5694141 3.608096
2.5 % 97.5 %
conditionMutant -1.433264 -0.2166465
2.5 % 97.5 %
conditionMutant 4.45331 8.712368
2.5 % 97.5 %
conditionMutant -4.362937 -2.184466
Let’s start with the simplest model
Call:
lm(formula = weight_kg ~ height_cm, data = students)
Coefficients:
(Intercept) height_cm
-73.6833 0.8215
Obviously the weight also depends on the sex
Call:
lm(formula = weight_kg ~ height_cm + sex, data = students)
Coefficients:
(Intercept) height_cm sexMale
12.8319 0.2843 14.1486
There are two lines with the same slope: is the coefficient for height_cm
The “intercept” of each line are the coefficients sexFemale
and sexMale
Call:
lm(formula = weight_kg ~ height_cm:sex + sex + 0, data = students)
Coefficients:
sexFemale sexMale height_cm:sexFemale
71.30494 -23.12179 -0.06924
height_cm:sexMale
0.56342
What do you see?
+0
in the formula:
to indicate interaction*
in formulas