Class 12: Statistics. Control overfitting

Systems Biology

Andrés Aravena, PhD

December 14, 2023

So far we did descriptive statistics

We described one dataset and nothing more

Now we will make inferences about a larger population

𝑦 is random

We have \[y=Xβ +e\]

We assume \(X\) is not random, but \(e\) is random \[𝔼(e)=0\qquad 𝕍(e)=σ^2 I\]

(How do we calculate \(𝕍(e)\)?)

Thus \(y\) is random

The coefficients are random

In the sample we calculate \(\hat\beta,\) the random version of \(\beta\)

What is their expected value?
Variance?
Distribution?

This should help us prepare better experiments

Contrasts

Let \(C\) be an invertible matrix of dimension \(n\)

Let \(γ = C β\) be a combination of the coefficients

What can we find with them?
What are their statistics?
How do we use them in R?

Linear models are more than straight lines

Exponentials
power laws
Polinomials

Too many variables can fool us

Overfitting

Controlling overfitting by Penalization

Instead of minimizing \[\sum (y_i - X_i\beta)^2\] we minimize \[\sum (y_i - X_i\beta)^2 + P(n,m,…)\] where \(P\) is a penalization component, forcing “simpler” models

Examples:

Akaike information criterion (AIC)
Bayesian information criterion (BIC)
Ridge regression

Ridge regression

We minimize \[\sum (y_i - X_i\beta)^2 + \lambda \sum \beta_j^2\] so we ask the coefficients to be small

\(λ\) controls how much we restrict the coefficients

LASSO

We minimize \[\sum (y_i - X_i\beta)^2 + \lambda \sum \vert\beta_j\vert\]

That is, we use Manhattan distance instead of Euclidean

This results in more coefficients forced to be zero