Abstraction: forget details to make it generic
Formalization: avoid ambiguity and logic errors
Solution: use a computer or ask a friend
Interpret: this is the biology
23 September 2016
Abstraction: forget details to make it generic
Formalization: avoid ambiguity and logic errors
Solution: use a computer or ask a friend
Interpret: this is the biology
2, 9, 6, 2, 5, 10, 4, 6, 8, 7, 11, 7, 6, 7, 8, 4, 7, 4, 2, 8, 3, 5, 7, 3, 7, 6, 7, 2, 4, 6, 5, 6, 12, 8, 6, 5, 6, 8, 3, 2, 7, 7, 6, 8, 9, 8, 11, 2, 6, 3, 11, 11, 5, 9, 7, 5, 8, 6, 11, 5, 7, 4, 5, 7, 3, 2, 10, 10, 10, 3, 3, 8, 7, 10, 10, 2, 2, 6, 7, 6, 4, 4, 9, 11, 12, 5, 2, 9, 2, 5, 10, 12, 7, 8, 6, 7, 3, 9, 7 and 5
2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|
11 | 8 | 7 | 11 | 14 | 17 | 10 | 6 | 7 | 6 | 3 |
The model is \[\mathbf{y} = \beta + \mathbf{e}\]
Which one is the best \(\beta\)?
Mean absolute error \[\mathrm{MAE}(\beta, \mathbf{y})=\frac{1}{n}\sum_i |y_i-\beta|\]
Mean square error \[\mathrm{MSE}(\beta, \mathbf{y})=\frac{1}{n}\sum_i (y_i-\beta)^2\]
\[\beta^* = 6\]
Show that the median of \(\mathbf{y}\) is a minimum of the mean absolute error
\[\beta^* = 6.38\]
The error is \[\mathrm{MSE}(\beta, \mathbf{y})=\frac{1}{n}\sum_i (y_i-\beta)^2\]
To find the minimal value we can take the derivative of \(MSE\) with respect to \(\beta\)
\[\frac{d}{d\beta} \mathrm{MSE}(\beta, \mathbf{y})= \frac{2}{n}\sum_i (y_i - \beta)\]
The minimal values of functions are located where the derivative is zero
Now we find the value of \(\beta\) that makes the derivative equal to zero.
\[\frac{d}{d\beta} \mathrm{MSE}(\beta, \mathbf{y})= \frac{2}{n}\sum_i (y_i - \beta)\]
Making this last formula equal to zero and solving for \(\beta\) we found that the best one is
\[\beta^* = \frac{1}{n} \sum_i y_i = \bar{\mathbf{y}}\]
If \(\bar{\mathbf{y}}\) is the best representative, the error is
\[\mathrm{MSE}(\bar{\mathbf{y}}, \mathbf{y})=\frac{1}{n}\sum_i (y_i-\bar{\mathbf{y}})^2\]
This is sometimes called variance of the sample. We write then
\[\mathrm{S}_n(\mathbf{y})=\frac{1}{n}\sum_i (y_i-\bar{\mathbf{y}})^2\]
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 12, 12 and 12
Each \(y_i\) appears \(N(y_i)\) times, given by the table
2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|
11 | 8 | 7 | 11 | 14 | 17 | 10 | 6 | 7 | 6 | 3 |
With \(N(y_i)\) given by the table, we have
\[\bar{\mathbf{y}} = \frac{1}{n} \sum_i y_i = \sum_y y\cdot \frac{N(y)}{n} = \sum_y y \cdot p(y)\] \[\mathrm{MSE}(\bar{y}, \mathbf{y})=\frac{1}{n}\sum_i (y_i-\bar{\mathbf{y}})^2 =\sum_y (y-\bar{\mathbf{y}})^2\cdot p(y)\]
The empirical frequency \(p(y)=N(y)/n\) contains all the information of \(\mathbf{y}\)
Now we have a second vector \(\mathbf{x}\)
The new model is \[{y}_i = \beta_0 + \beta_1{x}_i + {e}_i\] for \(i=1,\ldots,n\). All these equations can be written in one as \[\mathbf{y} = \beta_0\mathbf{1} + \beta_1\mathbf{x} + \mathbf{e}\]
Now we want to minimize \[\mathrm{MSE}(\begin{pmatrix}\beta_0\\ \beta_1\end{pmatrix}, \mathbf{y}, \mathbf{x}) = \frac{1}{n}\sum_i (y_i-\beta_0 - \beta_1{x}_i)^2\] which can also be written as \[\frac{1}{n}\sum_i e_i^2\] Indeed, we are minimizing the square of errors (like before)
Since ancient times it is known that \(a^2+b^2=c^2\)
Using this idea we see that \[\sum_i e_i^2 = \mathrm{length}(\mathbf{e})^2\]
So we want to minimize the lenght of vector \(\mathbf{e}\).
We want to find the “good” \(\beta_0\) and \(\beta_1\) that minimize the length of \(\mathbf{e}\)
Ancient knowledge again:
The vector \(\mathbf{e}\) is perpendicular to \(\mathbf{x}\) if and only if
\[\mathbf{x}^T\mathbf{e}=0\]
In the same way, \(\mathbf{e}\) is perpendicular to \(\mathbf{1}\) if \[\mathbf{1}^T\mathbf{e}=0\]
We can see the big picture if we use matrices: \[\begin{pmatrix}1 & x_1\\ \vdots & \vdots \\1 & x_n\end{pmatrix}= \begin{pmatrix}\mathbf{1} & \mathbf{x}\end{pmatrix}=\mathbf{A}\] \[\begin{pmatrix}\beta_0\\ \beta_1\end{pmatrix}=\mathbf{b}\] then the smallest \(\mathbf{e}\) obeys \[ \mathbf{A}^T \mathbf{e} = 0\]
The model was \[\mathbf{y} = \mathbf{Ab} + \mathbf{e}\] so the error is \[\mathbf{e} = \mathbf{y} - \mathbf{Ab}\] Multiplying by \(\mathbf{A}^T\) we have \[\mathbf{A}^T \mathbf{e} = \mathbf{A}^T \mathbf{y} - \mathbf{A}^T \mathbf{Ab}\]
To have \(\mathbf{A}^T \mathbf{e} = 0\) we need to make \[\mathbf{A}^T \mathbf{y} = \mathbf{A}^T \mathbf{Ab}^*\] We write \(\mathbf{b}^*\) because these are the “good” \(\beta_0^*\) and \(\beta_1^*\)
Now, if \(\mathbf{A}^T \mathbf{A}\) is “well behavied”, \[\mathbf{b}^* = (\mathbf{A}^T \mathbf{A})^{-1}\mathbf{A}^T \mathbf{y}\]
Replacing \(\mathbf{b}^* = (\mathbf{A}^T \mathbf{A})^{-1}\mathbf{A}^T \mathbf{y}\) in the formula of error \[\mathbf{e} = \mathbf{y} - \mathbf{Ab}\] we have \[\mathbf{e}^* = (\mathbf{I} - \mathbf{A}(\mathbf{A}^T \mathbf{A})^{-1}\mathbf{A}^T )\mathbf{y}\] (no surprise, simple substitution)
What happens to the mean square error \(\mathrm{MSE}(\mathbf{b}^*,\mathbf{y}, \mathbf{x})=\frac{1}{n}\sum_i e_i^2\)?
\[\begin{aligned} \mathrm{MSE}(\mathbf{b}^*,\mathbf{y}, \mathbf{x})&=\frac{1}{n}\sum_i e_i^2=\frac{1}{n}\mathbf{e}^T\mathbf{e}\\ &=\frac{1}{n}\mathbf{y}^T (\mathbf{I} - \mathbf{A}(\mathbf{A}^T \mathbf{A})^{-1}\mathbf{A}^T)^T(\mathbf{I} - \mathbf{A}(\mathbf{A}^T \mathbf{A})^{-1}\mathbf{A}^T) \mathbf{y}\\ &=\frac{1}{n}\mathbf{y}^T (\mathbf{I} - \mathbf{A}(\mathbf{A}^T \mathbf{A})^{-1}\mathbf{A}^T) \mathbf{y}\end{aligned}\] (do the algebra and see that many things vanish)
So the Mean Square Error depends on \(\mathbf{y}\) and \(\mathbf{A}\), which depends on \(\mathbf{x}\). Choose them carefully
All the argument is valid if \(\mathbf{A}\) has any number of columns
If \(\mathbf{A} =\mathbf{1}\) (no independent variable), then \[\begin{aligned} \mathbf{b}^* &= (\mathbf{A}^T \mathbf{A})^{-1}\mathbf{A}^T \mathbf{y}\\ \mathbf{b}^* &= (\mathbf{1}^T \mathbf{1})^{-1}\mathbf{1}^T \mathbf{y}\\ \mathbf{b}^* &=\frac{1}{n}\sum{y}_i \end{aligned}\] just as before
\[\begin{aligned} \frac{1}{n}\mathbf{e}^T\mathbf{e} &= \frac{1}{n}\mathbf{y}^T (\mathbf{I} - \mathbf{A}(\mathbf{A}^T \mathbf{A})^{-1}\mathbf{A}^T) \mathbf{y}\\ &= \frac{1}{n}\mathbf{y}^T (\mathbf{I} - \mathbf{1}(\mathbf{1}^T \mathbf{1})^{-1}\mathbf{1}^T) \mathbf{y}\\ &= \frac{1}{n}\mathbf{y}^T \mathbf{y} - \frac{1}{n}\mathbf{y}^T\mathbf{1}(n)^{-1}\mathbf{1}^T \mathbf{y}\\ &=\frac{1}{n}\sum{y}_i^2 - (\frac{1}{n}\sum{y}_i)^2 \end{aligned}\] just as before
The only condition to have a solution is \[\mathbf{A}^T \mathbf{A}\] has an inverse. This is equivalent to
All columns of \(\mathbf{A}\) are linearly independent