We described one dataset and nothing more
Now we will make inferences about a larger population
We have \[y=Xβ +e\]
We assume \(X\) is not random, but \(e\) is random \[𝔼(e)=0\qquad 𝕍(e)=σ^2 I\]
(How do we calculate \(𝕍(e)\)?)
Thus \(y\) is random
In the sample we calculate \(\hat\beta,\) the random version of \(\beta\)
What is their expected value?
Variance?
Distribution?
This should help us prepare better experiments
Let \(C\) be an invertible matrix of dimension \(n\)
Let \(γ = C β\) be a combination of the coefficients
Overfitting
Instead of minimizing \[\sum (y_i - X_i\beta)^2\] we minimize \[\sum (y_i - X_i\beta)^2 + P(n,m,…)\] where \(P\) is a penalization component, forcing “simpler” models
Examples:
We minimize \[\sum (y_i - X_i\beta)^2 + \lambda \sum \beta_j^2\] so we ask the coefficients to be small
\(λ\) controls how much we restrict the coefficients
We minimize \[\sum (y_i - X_i\beta)^2 + \lambda \sum \vert\beta_j\vert\]
That is, we use Manhattan distance instead of Euclidean
This results in more coefficients forced to be zero