Class 28: Linear models with factors

Computing for Molecular Biology 1

Andrés Aravena, PhD

18 January 2021

We read the data

library(readr)
library(dplyr)
library(ggplot2)
fall <- read_csv("free-fall.csv")
fall

# A tibble: 129 x 2
   distance    time
      <dbl>   <dbl>
 1  0       0      
 2 -0.00237 0.00417
 3  0       0.00834
 4  0       0.0125 
 5  0       0.0167 
 6  0       0.0208 
 7  0       0.0250 
 8 -0.0119  0.0292 
 9 -0.0332  0.0333 
10 -0.0380  0.0375 
# … with 119 more rows

Let’s take a quick look

plot(distance ~ time, data=fall)

Use the same formula for the first model

model_1 <- lm(distance ~ time, data=fall)
model_1


Call:
lm(formula = distance ~ time, data = fall)

Coefficients:
(Intercept)         time  
     0.2235      -2.8614

Compare experiment and model

ggplot(fall, aes(x=time, y=distance)) + geom_point(color="steelblue") + 
    geom_line(aes(y=predict(model_1)), color=1)

Adding `time`² to the formula

fall <- mutate(fall, time_sq=time*time)
model_2 <- lm(distance ~ time + time_sq, data=fall)
ggplot(fall, aes(x=time, y=distance)) + geom_point(color="steelblue") + 
    geom_line(aes(y=predict(model_2)), color=2)

The result are the coefficients

coef(model_2)

(Intercept)        time     time_sq 
 -0.0154099  -0.1528996  -5.0772634

The model is \[distance = A + B\cdot time + C\cdot time^2\]

𝐴: distance at time 0
𝐵: speed at time 0
𝐶: half of acceleration, 𝑔/2

But we know that 𝐴=0

slice_head(fall, n=1)

# A tibble: 1 x 3
  distance  time time_sq
     <dbl> <dbl>   <dbl>
1        0     0       0

So \(distance=0\) when \(time=0\)

This is how we designed the experiment

We want to use another formula in the model \[distance = 0 + B\cdot time + C\cdot time^2\]

Forcing A=0 in the model

model_3 <- lm(distance ~ time + time_sq + 0, data=fall)
ggplot(fall, aes(x=time, y=distance)) + geom_point(color="steelblue") + 
    geom_line(aes(y=predict(model_3)), color=3, width=2.5)

Warning: Ignoring unknown parameters: width

Result

model_3


Call:
lm(formula = distance ~ time + time_sq + 0, data = fall)

Coefficients:
   time  time_sq  
 -0.268   -4.898

coef(model_3)

      time    time_sq 
-0.2679933 -4.8981729

Confidence interval

confint(model_3)

             2.5 %     97.5 %
time    -0.2956294 -0.2403571
time_sq -4.9647940 -4.8315517

Or, in a more professional way, using 3 significant digits

confint(model_3) %>% signif(digits=3)

         2.5 % 97.5 %
time    -0.296  -0.24
time_sq -4.960  -4.83

To know why, you must take my course “Methodology of Scientific Research”

`summary()` gives more detail

summary(model_3)


Call:
lm(formula = distance ~ time + time_sq + 0, data = fall)

Residuals:
       Min         1Q     Median         3Q        Max 
-0.0316174 -0.0116885 -0.0008487  0.0077115  0.0193887 

Coefficients:
        Estimate Std. Error t value Pr(>|t|)    
time    -0.26799    0.01397  -19.19   <2e-16 ***
time_sq -4.89817    0.03367 -145.49   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01224 on 127 degrees of freedom
Multiple R-squared:  0.9997,    Adjusted R-squared:  0.9997 
F-statistic: 2.155e+05 on 2 and 127 DF,  p-value: < 2.2e-16

Modeling Using Factors

With `students` data

students <- read_tsv("students2018-2020-tidy.tsv",
  col_types = cols(sex = col_factor(levels = c("Female", "Male")),
            handedness = col_factor(levels = c("Left", "Right"))))

# A tibble: 117 x 10
   answer_date id    english_level sex   birthdate  birthplace height_cm
   <date>      <chr> <chr>         <fct> <date>     <chr>          <dbl>
 1 2018-09-17  3e50… I can speak … Male  1993-02-01 -/Turkey         179
 2 2018-09-17  479d… I can unders… Fema… 1998-05-21 Kahramanm…       168
 3 2018-09-17  39df… I can read a… Fema… 1998-01-18 Batman/Tu…        NA
 4 2018-09-17  d2b0… I can read a… Male  1998-08-29 Antalya/T…       170
 5 2018-09-17  f22b… I can read a… Fema… 1998-05-03 Izmir/Tur…       162
 6 2018-09-17  849c… İngilizce bi… Fema… 1995-10-09 Yalova/Tu…       167
 7 2018-09-17  8381… I can speak … Fema… 1997-09-19 Adıyaman/…       174
 8 2018-09-17  b0dd… I can read a… Male  1997-11-27 Bursa/Tur…       180
 9 2018-09-17  2972… I can read a… Fema… 1999-01-02 Istanbul/…       162
10 2018-09-17  72c0… I can read a… Fema… 1998-10-02 Istanbul/…       172
# … with 107 more rows, and 3 more variables: weight_kg <dbl>,
#   handedness <fct>, hand_span <dbl>

Let’s drop all `NA` values

If any column has NA, then all row is omitted

students <- na.omit(students)
students

# A tibble: 96 x 10
   answer_date id    english_level sex   birthdate  birthplace height_cm
   <date>      <chr> <chr>         <fct> <date>     <chr>          <dbl>
 1 2018-09-17  3e50… I can speak … Male  1993-02-01 -/Turkey         179
 2 2018-09-17  479d… I can unders… Fema… 1998-05-21 Kahramanm…       168
 3 2018-09-17  d2b0… I can read a… Male  1998-08-29 Antalya/T…       170
 4 2018-09-17  f22b… I can read a… Fema… 1998-05-03 Izmir/Tur…       162
 5 2018-09-17  849c… İngilizce bi… Fema… 1995-10-09 Yalova/Tu…       167
 6 2018-09-17  8381… I can speak … Fema… 1997-09-19 Adıyaman/…       174
 7 2018-09-17  b0dd… I can read a… Male  1997-11-27 Bursa/Tur…       180
 8 2018-09-17  2972… I can read a… Fema… 1999-01-02 Istanbul/…       162
 9 2018-09-17  72c0… I can read a… Fema… 1998-10-02 Istanbul/…       172
10 2018-09-17  d292… I can read a… Male  1997-05-18 Van/Turkey       181
# … with 86 more rows, and 3 more variables: weight_kg <dbl>, handedness <fct>,
#   hand_span <dbl>

Boys are heavier than girls

plot(weight_kg ~ sex,
    data=students)

lm(weight_kg ~ sex,
    data=students)


Call:
lm(formula = weight_kg ~ sex, data = students)

Coefficients:
(Intercept)      sexMale  
      59.85        18.17

Models with factors

model_A <- lm(weight_kg ~ sex, data = students)
coef(model_A)

(Intercept)     sexMale 
   59.85484    18.17457

Here (Intercept) is “standard weight”, and sexMale is the extra weight when sex=='Male' (on average)

\[\text{weight_kg} = \begin{cases} A + 0 & \text{if sex is “Female”}\\ A + B & \text{if sex is “Male”} \end{cases}\]

In other words …

The model is

\[\text{weight_kg} = A + B\cdot\text{[sex is “Male”]}\]

where \(\text{[sex is “Male”]}\) is 1 when sex=="Male", and is 0 when sex=="Female"

\[\text{[sex is “Male”]} = \begin{cases} 0 & \text{if sex is “Female”}\\ 1 & \text{if sex is “Male”} \end{cases}\]

Model with factor, without intercept

model_B <- lm(weight_kg ~ sex + 0, data = students)
coef(model_B)

sexFemale   sexMale 
 59.85484  78.02941

Now the formula includes +0. This means “do not use intercept”.

Now

sexFemale is the average weight when sex=="Female"
sexMale is the average weight when sex=="Male"

Application: gene expression analysis

Let’s analyze gene_expression.csv

   id LOC100652730    IZUMO4     RHOU       STC1     PDGFC
D   D     260.2225 207.44155 227.0810 574.453525  432.0676
D2 D2     151.1980 301.53198 154.6539 817.333091  203.0373
D3 D3     158.2063 293.09798 143.2183 877.628619  183.1862
K   K     200.4629 125.21024 333.8940  18.971249 2578.1927
K2 K2     132.7736  49.79009 278.8245   7.745125 2391.0307
K3 K3     140.0867  38.20547 300.3375   3.183789 2358.1266

Each column is a gene, each row is a microarray (or RNAseq)

Experiment description

These values are real, taken from NCBI GSE95670

The values are absolute expression, we want to know

The fold-change in expression
If that change is real, or just noise

Factors indicate the condition

The first three rows are mutants
The last three rows are wild-type, or control

We can add a column with a factor describing the condition

m <- mutate(m, condition=factor(
    rep(c("Mutant", "Control"), c(3, 3)),
    levels=c("Control","Mutant"))
)

Now we can do a model for each gene

We want to measure fold-change. That means, we want to calculate log_2(mutant/wild type)

gene_model <- lm(log2(LOC100652730) ~ condition, data=m)
coef(gene_model)

    (Intercept) conditionMutant 
      7.2767306       0.2464562

confint(gene_model, parm = "conditionMutant")

                     2.5 %   97.5 %
conditionMutant -0.6216828 1.114595

Check if the confidence interval contains 0

Other cases

lm(log2(IZUMO4) ~ condition, data=m) %>% confint(parm = "conditionMutant")

                    2.5 %   97.5 %
conditionMutant 0.5694141 3.608096

lm(log2(RHOU) ~ condition, data=m) %>% confint(parm = "conditionMutant")

                    2.5 %     97.5 %
conditionMutant -1.433264 -0.2166465

lm(log2(STC1) ~ condition, data=m) %>% confint(parm = "conditionMutant")

                  2.5 %   97.5 %
conditionMutant 4.45331 8.712368

lm(log2(PDGFC) ~ condition, data=m) %>% confint(parm = "conditionMutant")

                    2.5 %    97.5 %
conditionMutant -4.362937 -2.184466

Combining factors and numeric

Weight v/s Height

Let’s start with the simplest model

model_4 <- lm(weight_kg ~ height_cm, data=students)
ggplot(students, aes(x=height_cm, y=weight_kg)) + 
    geom_point(aes(color=sex)) + geom_line(aes(y=predict(model_4)))

Interpretation

model_4


Call:
lm(formula = weight_kg ~ height_cm, data = students)

Coefficients:
(Intercept)    height_cm  
   -73.6833       0.8215

Including the sex factor

Obviously the weight also depends on the sex

model_5 <- lm(weight_kg ~ height_cm + sex, data = students)
ggplot(students, aes(x=height_cm, y=weight_kg)) + 
    geom_point(aes(color=sex)) + geom_point(aes(y=predict(model_5)))

Interpretation

model_5


Call:
lm(formula = weight_kg ~ height_cm + sex, data = students)

Coefficients:
(Intercept)    height_cm      sexMale  
    12.8319       0.2843      14.1486

There are two lines with the same slope: is the coefficient for height_cm

The “intercept” of each line are the coefficients sexFemale and sexMale

Using independent slopes for each sex

model_6 <- lm(weight_kg ~ height_cm:sex + sex + 0, data = students)
ggplot(students, aes(x=height_cm, y=weight_kg)) + 
    geom_point(aes(color=sex)) + geom_point(aes(y=predict(model_6)))

Interpretation

model_6


Call:
lm(formula = weight_kg ~ height_cm:sex + sex + 0, data = students)

Coefficients:
          sexFemale              sexMale  height_cm:sexFemale  
           71.30494            -23.12179             -0.06924  
  height_cm:sexMale  
            0.56342

What do you see?

Summary

Factors can be used in formulas
The intercept can be taken out using +0 in the formula
Variables can have interactions (synergy)
- Use : to indicate interaction
Homework: Learn about * in formulas

Class 28: Linear models with factors

Computing for Molecular Biology 1

Andrés Aravena, PhD

18 January 2021

We read the data

Let’s take a quick look

Use the same formula for the first model

Compare experiment and model

Adding time2 to the formula

The result are the coefficients

But we know that 𝐴=0

Forcing A=0 in the model

Result

Confidence interval

summary() gives more detail

Modeling Using Factors

With students data

Let’s drop all NA values

Boys are heavier than girls

Models with factors

In other words …

Model with factor, without intercept

Application: gene expression analysis

Experiment description

Factors indicate the condition

Now we can do a model for each gene

Other cases

Combining factors and numeric

Weight v/s Height

Interpretation

Including the sex factor

Interpretation

Using independent slopes for each sex

Interpretation

Summary

Adding `time`² to the formula

`summary()` gives more detail

With `students` data

Let’s drop all `NA` values