Class 28: Linear models with factors

Computing for Molecular Biology 1

Andrés Aravena, PhD

18 January 2021

We read the data

library(readr)
library(dplyr)
library(ggplot2)
fall <- read_csv("free-fall.csv")
fall
# A tibble: 129 x 2
   distance    time
      <dbl>   <dbl>
 1  0       0      
 2 -0.00237 0.00417
 3  0       0.00834
 4  0       0.0125 
 5  0       0.0167 
 6  0       0.0208 
 7  0       0.0250 
 8 -0.0119  0.0292 
 9 -0.0332  0.0333 
10 -0.0380  0.0375 
# … with 119 more rows

Let’s take a quick look

plot(distance ~ time, data=fall)

Use the same formula for the first model

model_1 <- lm(distance ~ time, data=fall)
model_1

Call:
lm(formula = distance ~ time, data = fall)

Coefficients:
(Intercept)         time  
     0.2235      -2.8614  

Compare experiment and model

ggplot(fall, aes(x=time, y=distance)) + geom_point(color="steelblue") + 
    geom_line(aes(y=predict(model_1)), color=1)

Adding time2 to the formula

fall <- mutate(fall, time_sq=time*time)
model_2 <- lm(distance ~ time + time_sq, data=fall)
ggplot(fall, aes(x=time, y=distance)) + geom_point(color="steelblue") + 
    geom_line(aes(y=predict(model_2)), color=2)

The result are the coefficients

coef(model_2)
(Intercept)        time     time_sq 
 -0.0154099  -0.1528996  -5.0772634 

The model is \[distance = A + B\cdot time + C\cdot time^2\]

𝐴: distance at time 0
𝐵: speed at time 0
𝐶: half of acceleration, 𝑔/2

But we know that 𝐴=0

slice_head(fall, n=1)
# A tibble: 1 x 3
  distance  time time_sq
     <dbl> <dbl>   <dbl>
1        0     0       0

So \(distance=0\) when \(time=0\)

This is how we designed the experiment

We want to use another formula in the model \[distance = 0 + B\cdot time + C\cdot time^2\]

Forcing A=0 in the model

model_3 <- lm(distance ~ time + time_sq + 0, data=fall)
ggplot(fall, aes(x=time, y=distance)) + geom_point(color="steelblue") + 
    geom_line(aes(y=predict(model_3)), color=3, width=2.5)
Warning: Ignoring unknown parameters: width

Result

model_3

Call:
lm(formula = distance ~ time + time_sq + 0, data = fall)

Coefficients:
   time  time_sq  
 -0.268   -4.898  
coef(model_3)
      time    time_sq 
-0.2679933 -4.8981729 

Confidence interval

confint(model_3)
             2.5 %     97.5 %
time    -0.2956294 -0.2403571
time_sq -4.9647940 -4.8315517

Or, in a more professional way, using 3 significant digits

confint(model_3) %>% signif(digits=3)
         2.5 % 97.5 %
time    -0.296  -0.24
time_sq -4.960  -4.83

To know why, you must take my course “Methodology of Scientific Research”

summary() gives more detail

summary(model_3)

Call:
lm(formula = distance ~ time + time_sq + 0, data = fall)

Residuals:
       Min         1Q     Median         3Q        Max 
-0.0316174 -0.0116885 -0.0008487  0.0077115  0.0193887 

Coefficients:
        Estimate Std. Error t value Pr(>|t|)    
time    -0.26799    0.01397  -19.19   <2e-16 ***
time_sq -4.89817    0.03367 -145.49   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01224 on 127 degrees of freedom
Multiple R-squared:  0.9997,    Adjusted R-squared:  0.9997 
F-statistic: 2.155e+05 on 2 and 127 DF,  p-value: < 2.2e-16

Modeling Using Factors

With students data

students <- read_tsv("students2018-2020-tidy.tsv",
  col_types = cols(sex = col_factor(levels = c("Female", "Male")),
            handedness = col_factor(levels = c("Left", "Right"))))
# A tibble: 117 x 10
   answer_date id    english_level sex   birthdate  birthplace height_cm
   <date>      <chr> <chr>         <fct> <date>     <chr>          <dbl>
 1 2018-09-17  3e50… I can speak … Male  1993-02-01 -/Turkey         179
 2 2018-09-17  479d… I can unders… Fema… 1998-05-21 Kahramanm…       168
 3 2018-09-17  39df… I can read a… Fema… 1998-01-18 Batman/Tu…        NA
 4 2018-09-17  d2b0… I can read a… Male  1998-08-29 Antalya/T…       170
 5 2018-09-17  f22b… I can read a… Fema… 1998-05-03 Izmir/Tur…       162
 6 2018-09-17  849c… İngilizce bi… Fema… 1995-10-09 Yalova/Tu…       167
 7 2018-09-17  8381… I can speak … Fema… 1997-09-19 Adıyaman/…       174
 8 2018-09-17  b0dd… I can read a… Male  1997-11-27 Bursa/Tur…       180
 9 2018-09-17  2972… I can read a… Fema… 1999-01-02 Istanbul/…       162
10 2018-09-17  72c0… I can read a… Fema… 1998-10-02 Istanbul/…       172
# … with 107 more rows, and 3 more variables: weight_kg <dbl>,
#   handedness <fct>, hand_span <dbl>

Let’s drop all NA values

If any column has NA, then all row is omitted

students <- na.omit(students)
students
# A tibble: 96 x 10
   answer_date id    english_level sex   birthdate  birthplace height_cm
   <date>      <chr> <chr>         <fct> <date>     <chr>          <dbl>
 1 2018-09-17  3e50… I can speak … Male  1993-02-01 -/Turkey         179
 2 2018-09-17  479d… I can unders… Fema… 1998-05-21 Kahramanm…       168
 3 2018-09-17  d2b0… I can read a… Male  1998-08-29 Antalya/T…       170
 4 2018-09-17  f22b… I can read a… Fema… 1998-05-03 Izmir/Tur…       162
 5 2018-09-17  849c… İngilizce bi… Fema… 1995-10-09 Yalova/Tu…       167
 6 2018-09-17  8381… I can speak … Fema… 1997-09-19 Adıyaman/…       174
 7 2018-09-17  b0dd… I can read a… Male  1997-11-27 Bursa/Tur…       180
 8 2018-09-17  2972… I can read a… Fema… 1999-01-02 Istanbul/…       162
 9 2018-09-17  72c0… I can read a… Fema… 1998-10-02 Istanbul/…       172
10 2018-09-17  d292… I can read a… Male  1997-05-18 Van/Turkey       181
# … with 86 more rows, and 3 more variables: weight_kg <dbl>, handedness <fct>,
#   hand_span <dbl>

Boys are heavier than girls

plot(weight_kg ~ sex,
    data=students)

lm(weight_kg ~ sex,
    data=students)

Call:
lm(formula = weight_kg ~ sex, data = students)

Coefficients:
(Intercept)      sexMale  
      59.85        18.17  

Models with factors

model_A <- lm(weight_kg ~ sex, data = students)
coef(model_A)
(Intercept)     sexMale 
   59.85484    18.17457 

Here (Intercept) is “standard weight”, and sexMale is the extra weight when sex=='Male' (on average)

\[\text{weight_kg} = \begin{cases} A + 0 & \text{if sex is “Female”}\\ A + B & \text{if sex is “Male”} \end{cases}\]

In other words …

The model is

\[\text{weight_kg} = A + B\cdot\text{[sex is “Male”]}\]

where \(\text{[sex is “Male”]}\) is 1 when sex=="Male", and is 0 when sex=="Female"

\[\text{[sex is “Male”]} = \begin{cases} 0 & \text{if sex is “Female”}\\ 1 & \text{if sex is “Male”} \end{cases}\]

Model with factor, without intercept

model_B <- lm(weight_kg ~ sex + 0, data = students)
coef(model_B)
sexFemale   sexMale 
 59.85484  78.02941 

Now the formula includes +0. This means “do not use intercept”.

Now

  • sexFemale is the average weight when sex=="Female"
  • sexMale is the average weight when sex=="Male"

Application: gene expression analysis

Let’s analyze gene_expression.csv

   id LOC100652730    IZUMO4     RHOU       STC1     PDGFC
D   D     260.2225 207.44155 227.0810 574.453525  432.0676
D2 D2     151.1980 301.53198 154.6539 817.333091  203.0373
D3 D3     158.2063 293.09798 143.2183 877.628619  183.1862
K   K     200.4629 125.21024 333.8940  18.971249 2578.1927
K2 K2     132.7736  49.79009 278.8245   7.745125 2391.0307
K3 K3     140.0867  38.20547 300.3375   3.183789 2358.1266

Each column is a gene, each row is a microarray (or RNAseq)

Experiment description

These values are real, taken from NCBI GSE95670

The values are absolute expression, we want to know

  • The fold-change in expression
  • If that change is real, or just noise

Factors indicate the condition

  • The first three rows are mutants
  • The last three rows are wild-type, or control

We can add a column with a factor describing the condition

m <- mutate(m, condition=factor(
    rep(c("Mutant", "Control"), c(3, 3)),
    levels=c("Control","Mutant"))
)

Now we can do a model for each gene

We want to measure fold-change. That means, we want to calculate log_2(mutant/wild type)

gene_model <- lm(log2(LOC100652730) ~ condition, data=m)
coef(gene_model)
    (Intercept) conditionMutant 
      7.2767306       0.2464562 
confint(gene_model, parm = "conditionMutant")
                     2.5 %   97.5 %
conditionMutant -0.6216828 1.114595

Check if the confidence interval contains 0

Other cases

lm(log2(IZUMO4) ~ condition, data=m) %>% confint(parm = "conditionMutant")
                    2.5 %   97.5 %
conditionMutant 0.5694141 3.608096
lm(log2(RHOU) ~ condition, data=m) %>% confint(parm = "conditionMutant")
                    2.5 %     97.5 %
conditionMutant -1.433264 -0.2166465
lm(log2(STC1) ~ condition, data=m) %>% confint(parm = "conditionMutant")
                  2.5 %   97.5 %
conditionMutant 4.45331 8.712368
lm(log2(PDGFC) ~ condition, data=m) %>% confint(parm = "conditionMutant")
                    2.5 %    97.5 %
conditionMutant -4.362937 -2.184466

Combining factors and numeric

Weight v/s Height

Let’s start with the simplest model

model_4 <- lm(weight_kg ~ height_cm, data=students)
ggplot(students, aes(x=height_cm, y=weight_kg)) + 
    geom_point(aes(color=sex)) + geom_line(aes(y=predict(model_4)))

Interpretation

model_4

Call:
lm(formula = weight_kg ~ height_cm, data = students)

Coefficients:
(Intercept)    height_cm  
   -73.6833       0.8215  

Including the sex factor

Obviously the weight also depends on the sex

model_5 <- lm(weight_kg ~ height_cm + sex, data = students)
ggplot(students, aes(x=height_cm, y=weight_kg)) + 
    geom_point(aes(color=sex)) + geom_point(aes(y=predict(model_5)))

Interpretation

model_5

Call:
lm(formula = weight_kg ~ height_cm + sex, data = students)

Coefficients:
(Intercept)    height_cm      sexMale  
    12.8319       0.2843      14.1486  

There are two lines with the same slope: is the coefficient for height_cm

The “intercept” of each line are the coefficients sexFemale and sexMale

Using independent slopes for each sex

model_6 <- lm(weight_kg ~ height_cm:sex + sex + 0, data = students)
ggplot(students, aes(x=height_cm, y=weight_kg)) + 
    geom_point(aes(color=sex)) + geom_point(aes(y=predict(model_6)))

Interpretation

model_6

Call:
lm(formula = weight_kg ~ height_cm:sex + sex + 0, data = students)

Coefficients:
          sexFemale              sexMale  height_cm:sexFemale  
           71.30494            -23.12179             -0.06924  
  height_cm:sexMale  
            0.56342  

What do you see?

Summary

  • Factors can be used in formulas
  • The intercept can be taken out using +0 in the formula
  • Variables can have interactions (synergy)
    • Use : to indicate interaction
  • Homework: Learn about * in formulas