Class 3: Simple linear models

Systems Biology

Andrés Aravena, PhD

October 19, 2023

Summary of last class

Measured = Real ⊕ Instrument ⊕ Noise ⊕ Diversity \[y_m(x,i,t) = f(x)⊕I(x,t)⊕v(i)⊕b(x,i)\]

Normalized = Real ⊕ Noise ⊕ Diversity \[y(x,i) = f(x)⊕v(i)⊕b(x,i)\]

Weight

\[w_N(h,s,i) = w(h,s)⊕v(i)⊕b(h,s,i)\]

\[\underbrace{y_i}_{w_N(h,s,i)} = \underbrace{β_0 + β_1 x_i + β_2 s_i}_{w(h,s)} + \underbrace{e_i}_{v(i)⊕b(h,s,i)}\]

(To make it more general, we use \(x_i\) to represent height)

This way the formulas will not be only about weight and height

The error term

The \(e_i\) values will be important later

They are called “errors” or “residuals”

Since there is no systematic error, we must have \[\sum_i e_i=0\] (otherwise we can change \(\beta_0\) or \(\beta_1\))

Data from CMB1

sex birth_date height weight handedness hand_span
1 Male 1993-02-01 179 67 Right 15
2 Female 1998-05-21 168 55 Right 14
4 Male 1998-08-29 170 74 Right 25
5 Female 1998-05-03 162 68 Right 13
6 Female 1995-10-09 167 58 Right 18
7 Female 1997-09-19 174 72 Right 16
8 Male 1997-11-27 180 68 Right 19
9 Female 1999-01-02 162 58 Right 19
10 Female 1998-10-02 172 55 Right 20
11 Male 1997-05-18 181 81 Right 20

Visualization

plot(survey$weight)

Modeling weight alone: Distribution

hist(survey$weight, nclass=30)

First simple model

This is the simplest model for the set \(𝐲\)

\[y_i = β_0 + e_i\]

We look for a constant \(β_0\) representing all values

The error made by this approximation is \(e_i\)

Basic descriptive statistics: Mean

The best \(β_0\) is found to be

\[\text{mean}(𝐲)=\overline{𝐲}=\frac{1}{n}\sum_i y_i\]

mean(survey$weight)
[1] 66.36869

Variance: how bad is this model

This is the mean square error of the model \[MSE_0=\text{var}(𝐲) =\frac{1}{n}\sum_i (y_i - \overline{𝐲})^2\]

mean(survey$weight^2)- mean(survey$weight)^2
[1] 170.2504

Using more information

Modeling weight knowing height

It is easy to see that taller people will be heavier

You need more bones to be taller, and more muscles to move

plot(weight ~ height, data=survey)

Seems like a straight line

We will model using this expression \[y_i = β_0 + β_1 x_i + e_i\] where \[\begin{aligned} β_0 &= \overline{𝐲} - β_1 \overline{𝐱}\\ β_1 &= \frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)} \end{aligned}\]

In R

model_1 <- lm(weight ~ height, data=survey)
coef(model_1)
(Intercept)      height 
 -84.007888    0.882157 

Prediction

plot(weight ~ height, data=survey)
points(predict(model_1) ~ height, data=survey, col="red", pch=16)

How bad is this model?

This time instead of \[MSE_0=\frac{1}{n}\sum_i (y_i - \overline{𝐲})^2\] we have \[MSE_1=\text{var}(𝐲) - \frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)}\]

Thus, mean square error is better in model 1 than in model 0

Relative improvement

The variance in the new model is better, but how much? \[\frac{MSE_0-MSE_1}{MSE_0}=\frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)\text{var}(𝐲)}\] This number represents the percentage of the original variance that is explained by the new model

The name of this number is \(R^2\)

Correlation

The Pearson correlation coefficient between two variables is \[r=\frac{\text{cov}(𝐱,𝐲)}{\text{sd}(𝐱)\text{sd}(𝐲)}\] so we have in this case that \[R^2 = r^2\] This is valid for linear models with a single independent variable. It will not be valid for larger models

The interesting part

These are random values

Our results include a random component (the \(e_i\))

Thus, the values of \(β_0\) and \(β_1\) are also random

But they are not far away from the “real” ones

More details in the next class

Confidence intervals

We can get a 95% or 99% confidence interval for the values

confint(model_1, level=0.99)
                   0.5 %     99.5 %
(Intercept) -132.7312228 -35.284553
height         0.5967649   1.167549

If the confidence interval does not contains 0, then the real value is not 0

“The evidence shows that the coefficient is not 0”

On the other case

If the confidence interval contains 0,
then the real value may be 0

“We do not have evidence that the coefficient is ≠0”

p-values

summary(model_1)
Call:
lm(formula = weight ~ height, data = survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.016  -7.225   0.216   7.396  35.630 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -84.0079    18.5438  -4.530 1.68e-05 ***
height        0.8822     0.1086   8.122 1.48e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.17 on 97 degrees of freedom
Multiple R-squared:  0.4048,    Adjusted R-squared:  0.3986 
F-statistic: 65.96 on 1 and 97 DF,  p-value: 1.479e-12

This is a Student’s t-test

The values \(β_0\) and \(β_1\) follow a Student’s t distribution

(We will show that later)

Thus, we can use a t-test to see if they are not zero

In this case we let the computer do it for us

Confidence interval for the line

plot(weight ~ height, data=survey)
pred <- predict(model_1, interval = "confidence")
points(pred[,"fit"] ~ height, data=survey, pch=16, col="red")
points(pred[,"lwr"] ~ height, data=survey, pch=16, col="blue")
points(pred[,"upr"] ~ height, data=survey, pch=16, col="purple")

Confidence interval for the prediction

plot(weight ~ height, data=survey)
pred <- predict(model_1, interval = "prediction", level=0.95)
points(pred[,"fit"] ~ height, data=survey, pch=16, col="red")
points(pred[,"lwr"] ~ height, data=survey, pch=16, col="blue")
points(pred[,"upr"] ~ height, data=survey, pch=16, col="purple")

Sex

Modeling weight knowing sex

plot(weight ~ sex, data=survey)

Sex is a factor, so we get a box plot

Two groups

We still have \(n\) individuals, but now we can split them in two groups

The groups are called “female” and “male”

There are \(n_0\) females and \(n_1\) males, so \(n=n_0+n_1.\)

How to model a factor?

The easiest way is to use a number \[s_i=\begin{cases}0\quad\text{if person }i\text{ is female}\\ 1\quad\text{if person }i\text{ is male}\end{cases}\]

It is easy to see that \(\sum_i s_i = n_1\)

How does it work

We have now \[y_i = β_0 + β_1 s_i + e_i\] So for all the female individuals, we have \[y_i = β_0 + e_i\] Taking averages we have \[\frac{1}{n_0}\sum_{i\text{ female}}y_i = β_0\] so \(β_0\) is the average weight of female \[β_0=\text{mean}(𝐲\text{ female})\]

Case of male

So for all the male individuals, we have \[y_i = β_0 + β_1 + e_i\] Taking averages we have \[\frac{1}{n_1}\sum_{i\text{ male}}y_i = β_0+ β_1\] In other words \[β_0 + β_1=\text{mean}(𝐲\text{ male})\]

Interpretation of \(β\)

Wh have \[β_0=\text{mean}(𝐲\text{ female})\] and \[β_0 + β_1=\text{mean}(𝐲\text{ male})\] Replacing the value of \(β_0\) we have \[β_1=\text{mean}(𝐲\text{ male})-\text{mean}(𝐲\text{ female})\]

In practice

model_2 <- lm(weight ~ sex, data=survey)
coef(model_2)
confint(model_2, level = 0.99)
(Intercept)     sexMale 
   59.57937    18.67063 
               0.5 %   99.5 %
(Intercept) 56.41406 62.74467
sexMale     13.42158 23.91969

(Intercept) is the average female weight
sexMale is the average extra weight for male

p-values in the new model

summary(model_2)
Call:
lm(formula = weight ~ sex, data = survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.250  -5.579  -1.579   5.421  27.750 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   59.579      1.205  49.456  < 2e-16 ***
sexMale       18.671      1.998   9.346 3.47e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.562 on 97 degrees of freedom
Multiple R-squared:  0.4738,    Adjusted R-squared:  0.4684 
F-statistic: 87.34 on 1 and 97 DF,  p-value: 3.47e-15

What about handiness

model_3 <- lm(weight ~ handedness, data=survey)
coef(model_3)
confint(model_3, level = 0.99)
    (Intercept) handednessRight 
     67.0000000      -0.7022472 
                    0.5 %   99.5 %
(Intercept)      56.04894 77.95106
handednessRight -12.25216 10.84767

What are the coefficients

What can we conclude here?

p-values for third model

summary(model_3)
Call:
lm(formula = weight ~ handedness, data = survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-23.798 -10.649  -1.298   8.351  39.702 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      67.0000     4.1679   16.07   <2e-16 ***
handednessRight  -0.7022     4.3958   -0.16    0.873    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.18 on 97 degrees of freedom
Multiple R-squared:  0.000263,  Adjusted R-squared:  -0.01004 
F-statistic: 0.02552 on 1 and 97 DF,  p-value: 0.8734