Class 3: Simple linear models

Systems Biology

Andrés Aravena, PhD

October 19, 2023

Summary of last class

Measured = Real ⊕ Instrument ⊕ Noise ⊕ Diversity \[y_m(x,i,t) = f(x)⊕I(x,t)⊕v(i)⊕b(x,i)\]

Normalized = Real ⊕ Noise ⊕ Diversity \[y(x,i) = f(x)⊕v(i)⊕b(x,i)\]

Weight

\[w_N(h,s,i) = w(h,s)⊕v(i)⊕b(h,s,i)\]

\[\underbrace{y_i}_{w_N(h,s,i)} = \underbrace{β_0 + β_1 x_i + β_2 s_i}_{w(h,s)} + \underbrace{e_i}_{v(i)⊕b(h,s,i)}\]

(To make it more general, we use \(x_i\) to represent height)

This way the formulas will not be only about weight and height

The error term

The \(e_i\) values will be important later

They are called “errors” or “residuals”

Since there is no systematic error, we must have \[\sum_i e_i=0\] (otherwise we can change \(\beta_0\) or \(\beta_1\))

Data from CMB1

	sex	birth_date	height	weight	handedness	hand_span
1	Male	1993-02-01	179	67	Right	15
2	Female	1998-05-21	168	55	Right	14
4	Male	1998-08-29	170	74	Right	25
5	Female	1998-05-03	162	68	Right	13
6	Female	1995-10-09	167	58	Right	18
7	Female	1997-09-19	174	72	Right	16
8	Male	1997-11-27	180	68	Right	19
9	Female	1999-01-02	162	58	Right	19
10	Female	1998-10-02	172	55	Right	20
11	Male	1997-05-18	181	81	Right	20

Visualization

plot(survey$weight)

Modeling weight alone: Distribution

hist(survey$weight, nclass=30)

First simple model

This is the simplest model for the set \(𝐲\)

\[y_i = β_0 + e_i\]

We look for a constant \(β_0\) representing all values

The error made by this approximation is \(e_i\)

Basic descriptive statistics: Mean

The best \(β_0\) is found to be

\[\text{mean}(𝐲)=\overline{𝐲}=\frac{1}{n}\sum_i y_i\]

mean(survey$weight)

[1] 66.36869

Variance: how bad is this model

This is the mean square error of the model \[MSE_0=\text{var}(𝐲) =\frac{1}{n}\sum_i (y_i - \overline{𝐲})^2\]

mean(survey$weight^2)- mean(survey$weight)^2

[1] 170.2504

Using more information

Modeling weight knowing height

It is easy to see that taller people will be heavier

You need more bones to be taller, and more muscles to move

plot(weight ~ height, data=survey)

Seems like a straight line

We will model using this expression \[y_i = β_0 + β_1 x_i + e_i\] where \[\begin{aligned} β_0 &= \overline{𝐲} - β_1 \overline{𝐱}\\ β_1 &= \frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)} \end{aligned}\]

In R

model_1 <- lm(weight ~ height, data=survey)
coef(model_1)

(Intercept)      height 
 -84.007888    0.882157

Prediction

plot(weight ~ height, data=survey)
points(predict(model_1) ~ height, data=survey, col="red", pch=16)

How bad is this model?

This time instead of \[MSE_0=\frac{1}{n}\sum_i (y_i - \overline{𝐲})^2\] we have \[MSE_1=\text{var}(𝐲) - \frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)}\]

Thus, mean square error is better in model 1 than in model 0

Relative improvement

The variance in the new model is better, but how much? \[\frac{MSE_0-MSE_1}{MSE_0}=\frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)\text{var}(𝐲)}\] This number represents the percentage of the original variance that is explained by the new model

The name of this number is \(R^2\)

Correlation

The Pearson correlation coefficient between two variables is \[r=\frac{\text{cov}(𝐱,𝐲)}{\text{sd}(𝐱)\text{sd}(𝐲)}\] so we have in this case that \[R^2 = r^2\] This is valid for linear models with a single independent variable. It will not be valid for larger models

The interesting part

These are random values

Our results include a random component (the \(e_i\))

Thus, the values of \(β_0\) and \(β_1\) are also random

But they are not far away from the “real” ones

More details in the next class

Confidence intervals

We can get a 95% or 99% confidence interval for the values

confint(model_1, level=0.99)

                   0.5 %     99.5 %
(Intercept) -132.7312228 -35.284553
height         0.5967649   1.167549

If the confidence interval does not contains 0, then the real value is not 0

“The evidence shows that the coefficient is not 0”

On the other case

If the confidence interval contains 0,
then the real value may be 0

“We do not have evidence that the coefficient is ≠0”

p-values

summary(model_1)

Call:
lm(formula = weight ~ height, data = survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.016  -7.225   0.216   7.396  35.630 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -84.0079    18.5438  -4.530 1.68e-05 ***
height        0.8822     0.1086   8.122 1.48e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.17 on 97 degrees of freedom
Multiple R-squared:  0.4048,    Adjusted R-squared:  0.3986 
F-statistic: 65.96 on 1 and 97 DF,  p-value: 1.479e-12

This is a Student’s t-test

The values \(β_0\) and \(β_1\) follow a Student’s t distribution

(We will show that later)

Thus, we can use a t-test to see if they are not zero

In this case we let the computer do it for us

Confidence interval for the line

plot(weight ~ height, data=survey)
pred <- predict(model_1, interval = "confidence")
points(pred[,"fit"] ~ height, data=survey, pch=16, col="red")
points(pred[,"lwr"] ~ height, data=survey, pch=16, col="blue")
points(pred[,"upr"] ~ height, data=survey, pch=16, col="purple")

Confidence interval for the prediction

plot(weight ~ height, data=survey)
pred <- predict(model_1, interval = "prediction", level=0.95)
points(pred[,"fit"] ~ height, data=survey, pch=16, col="red")
points(pred[,"lwr"] ~ height, data=survey, pch=16, col="blue")
points(pred[,"upr"] ~ height, data=survey, pch=16, col="purple")

Sex

Modeling weight knowing sex

plot(weight ~ sex, data=survey)

Sex is a factor, so we get a box plot

Two groups

We still have \(n\) individuals, but now we can split them in two groups

The groups are called “female” and “male”

There are \(n_0\) females and \(n_1\) males, so \(n=n_0+n_1.\)

How to model a factor?

The easiest way is to use a number \[s_i=\begin{cases}0\quad\text{if person }i\text{ is female}\\ 1\quad\text{if person }i\text{ is male}\end{cases}\]

It is easy to see that \(\sum_i s_i = n_1\)

How does it work

We have now \[y_i = β_0 + β_1 s_i + e_i\] So for all the female individuals, we have \[y_i = β_0 + e_i\] Taking averages we have \[\frac{1}{n_0}\sum_{i\text{ female}}y_i = β_0\] so \(β_0\) is the average weight of female \[β_0=\text{mean}(𝐲\text{ female})\]

Case of male

So for all the male individuals, we have \[y_i = β_0 + β_1 + e_i\] Taking averages we have \[\frac{1}{n_1}\sum_{i\text{ male}}y_i = β_0+ β_1\] In other words \[β_0 + β_1=\text{mean}(𝐲\text{ male})\]

Interpretation of \(β\)

Wh have \[β_0=\text{mean}(𝐲\text{ female})\] and \[β_0 + β_1=\text{mean}(𝐲\text{ male})\] Replacing the value of \(β_0\) we have \[β_1=\text{mean}(𝐲\text{ male})-\text{mean}(𝐲\text{ female})\]

In practice

model_2 <- lm(weight ~ sex, data=survey)
coef(model_2)
confint(model_2, level = 0.99)

(Intercept)     sexMale 
   59.57937    18.67063 
               0.5 %   99.5 %
(Intercept) 56.41406 62.74467
sexMale     13.42158 23.91969

(Intercept) is the average female weight
sexMale is the average extra weight for male

p-values in the new model

summary(model_2)

Call:
lm(formula = weight ~ sex, data = survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.250  -5.579  -1.579   5.421  27.750 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   59.579      1.205  49.456  < 2e-16 ***
sexMale       18.671      1.998   9.346 3.47e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.562 on 97 degrees of freedom
Multiple R-squared:  0.4738,    Adjusted R-squared:  0.4684 
F-statistic: 87.34 on 1 and 97 DF,  p-value: 3.47e-15

What about handiness

model_3 <- lm(weight ~ handedness, data=survey)
coef(model_3)
confint(model_3, level = 0.99)

    (Intercept) handednessRight 
     67.0000000      -0.7022472 
                    0.5 %   99.5 %
(Intercept)      56.04894 77.95106
handednessRight -12.25216 10.84767

What are the coefficients

What can we conclude here?

p-values for third model

summary(model_3)

Call:
lm(formula = weight ~ handedness, data = survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-23.798 -10.649  -1.298   8.351  39.702 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      67.0000     4.1679   16.07   <2e-16 ***
handednessRight  -0.7022     4.3958   -0.16    0.873    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.18 on 97 degrees of freedom
Multiple R-squared:  0.000263,  Adjusted R-squared:  -0.01004 
F-statistic: 0.02552 on 1 and 97 DF,  p-value: 0.8734