Assume we have a vector of \(n\) values \[𝐲=\{y_1, y_2, …, y_n \}\] If we want to describe the set \(𝐲\) with a single number \(x\), which would it be?
If we have to replace each one of \(y_i\) for a single number, which number is “the best”?
Better choose one that is the “less wrong”
How can \(x\) be wrong?
Many alternatives to measure the error
and maybe other ways
Today we will use the square of the error
The squared error when \(x\) represents \(𝐲\) is \[\mathrm{SE}(x)=\sum_i (y_i-x)^2\] Which \(x\) minimizes the squared error?
We can write \[\begin{aligned} \mathrm{SE}(x)&=\sum_i (y_i-x)^2 =\sum_i (y_i^2 - 2y_ix + x^2)\\ &=\sum_i y_i^2 - \sum_i 2 y_ix + \sum_i x^2\\ &=\sum_i y_i^2 - x\sum_i 2 y_i + n x^2\\ \end{aligned}\]
This is a second degree expression, corresponding to a parabola
We have \[\mathrm{SE}(x) =\underbrace{n}_a x^2 - \underbrace{\sum_i 2 y_i}_b \, x+ \underbrace{\sum_i y_i^2}_c\] which has the form of \(ax^2+ bx + c\)
Let’s explore it in Geogebra
When we have \(ax^2+ bx + c =0\) then the two roots are \[\begin{aligned} x_1 &= \frac{-b-\sqrt{b^2-4ac} }{2a}\\ x_2 &= \frac{-b+\sqrt{b^2-4ac} }{2a} \end{aligned}\] and the middle point is \[\frac{x_1 + x_2}{2} = \frac{-b}{2a}\]
We have \[\mathrm{SE}(x) =\underbrace{n}_a x^2 - \underbrace{\sum_i 2 y_i}_b \, x+ \underbrace{\sum_i y_i^2}_c\] so the center point is \[\frac{-b}{2a}=\frac{\sum_i 2 y_i}{2n}=\frac{\sum_i y_i}{n}\]
We get the minimum squared error when \(x\) is the mean
The arithmetic mean of \(𝐲\) is \[\text{mean}(𝐲) = \frac{1}{n}\sum_{i=1}^n y_i\] where \(n\) is the size of the set \(𝐲\).
Sometimes it is written as \(\bar{𝐲}\)
This value is usually called mean, sometimes average
A function is a rule that takes a number and gives another number
In this case \(\mathrm{SE}(β)\) takes \(β\) and returns the squared error
The red and blue lines corresponds to equations like \[y=ax+b\] where
This is called equation of the straight line or linear equation
For any value \(β\) we can find the slope of \(\mathrm{SE}\) at position \(β\)
This is called the derivative of \(\mathrm{SE}\)
In general, we can use Wolfram Alpha (https://www.wolframalpha.com/)
We focus in the idea, not in the technique
To find the value of \(β\) that minimizes \(\mathrm{SE}(β)\) we
Calculate the derivative of \(\mathrm{SE}(β)\), written as \[\frac{d\mathrm{SE}}{dβ}(β)\]
Find \(β\) such that the derivative is zero. That is, solve \[\frac{d\mathrm{SE}}{dβ}(β)=0\]
We have \(\mathrm{SE}(β)=\sum_i (y_i-β)^2\). The derivative is \[\frac{d}{dβ} \mathrm{SE}(β)= 2\sum_i (y_i - β)= 2\sum_i y_i - 2nβ\] Then we need to find \(β\) such that \[2\sum_i y_i - 2nβ = 0\]
The equation we want to solve is \[2\sum_i y_i - 2nβ = 0\]
The smallest squared error is obtained when \[β = \frac{1}{n} \sum_i y_i\]
\[\begin{aligned} \mathrm{var}(𝐲)&=\frac 1 n \sum_i (y_i-\bar{𝐲})^2=\frac 1 n \sum_i (y_i^2-2\bar{𝐲}y_i+ \bar{𝐲}^2)\\ &=\frac 1 n \sum_i y_i^2-2\bar{𝐲}\frac 1 n \sum_i y_i+ \bar{𝐲}^2\frac 1 n \sum_i 1\\ &=\frac 1 n \sum_i y_i^2-2\bar{𝐲}\bar{𝐲}+ \bar{𝐲}^2\frac 1 n n\\ &=\frac 1 n \sum_i y_i^2-2\bar{𝐲}^2+ \bar{𝐲}^2\\ &=\frac 1 n \sum_i y_i^2-\bar{𝐲}^2\\ \end{aligned}\]
\[\mathrm{var}(𝐲)=\frac 1 n \sum_i (y_i-\bar{𝐲})^2=\frac 1 n \sum_i y_i^2-\bar{𝐲}^2\]
“The average of the squares minus the square of the average”
\[\begin{aligned} \mathrm{var}(𝐱+𝐲)&=\frac 1 n \sum_i (x_i+ y_i-\bar{𝐱}-\bar{𝐲})^2\\ &=\frac 1 n \sum_i ((x_i-\bar{𝐱})+ (y_i-\bar{𝐲}))^2\\ &=\frac 1 n \sum_i \left((x_i-\bar{𝐱})^2 +(y_i-\bar{𝐲})^2+ 2(x_i-\bar{𝐱})(y_i-\bar{𝐲})\right)\\ &=\frac 1 n \sum_i (x_i-\bar{𝐱})^2 +\frac 1 n \sum_i (y_i-\bar{𝐲})^2+ 2\frac 1 n \sum_i (x_i-\bar{𝐱})(y_i-\bar{𝐲})\\ &=\mathrm{var}(𝐱) +\mathrm{var}(𝐲)+ 2\frac 1 n \sum_i (x_i-\bar{𝐱})(y_i-\bar{𝐲}) \end{aligned}\]
The expression \[\frac 1 n \sum_i (x_i-\bar{𝐱})(y_i-\bar{𝐲})\] is called covariance of \(𝐱\) and \(𝐲\)
We write it as \[\mathrm{cov}(𝐱,𝐲)\]
\[ \mathrm{var}(𝐱+𝐲)=\mathrm{var}(𝐱) +\mathrm{var}(𝐲)+ 2\mathrm{cov}(𝐱,𝐲) \]
The variance of the sum is the sum of the variances plus twice the covariance
\[\begin{aligned} \frac 1 n \sum_i (x_i-\bar{𝐱})(y_i-\bar{𝐲})&=\frac 1 n \sum_i (x_i y_i-\bar{𝐱}y_i+x_i\bar{𝐲}-\bar{𝐱}\bar{𝐲})\\ &=\frac 1 n \sum_i x_i y_i-\frac 1 n \sum_i\bar{𝐱}y_i-\frac 1 n \sum_i x_i\bar{𝐲}+\frac 1 n \sum_i\bar{𝐱}\bar{𝐲}\\ &=\frac 1 n \sum_i x_i y_i-\bar{𝐱}\frac 1 n \sum_i y_i - \bar{𝐲}\frac 1 n \sum_i x_i + \bar{𝐱}\bar{𝐲}\frac 1 n \sum_i 1\\ &=\frac 1 n \sum_i x_i y_i-\bar{𝐱}\bar{𝐲}- \bar{𝐱}\bar{𝐲}+\bar{𝐱}\bar{𝐲}\\ &=\frac 1 n \sum_i x_i y_i-\bar{𝐱}\bar{𝐲}\\ \end{aligned}\]
\[\mathrm{cov}(𝐱,𝐲)=\frac 1 n \sum_i (x_i-\bar{𝐱})(y_i-\bar{𝐲})=\frac 1 n \sum_i x_i y_i-\bar{𝐱}\bar{𝐲}\]
The second formula is easier to calculate
“The average of the products minus the product of the averages”
If \(𝐱\) and \(𝐲\) go in the same direction,
then the covariance is positive
If \(𝐱\) and \(𝐲\) go in oposite directions,
then the covariance is negative
It is easy to see that, for any constants \(a\) and \(b\), we have \[\begin{aligned} \mathrm{cov}(a\, 𝐱,𝐲)&=a\, \mathrm{cov}(𝐱,𝐲)\\ \mathrm{cov}(𝐱, b\,𝐲)&=b\, \mathrm{cov}(𝐱,𝐲)\\ \mathrm{cov}(a\, 𝐱, b\,𝐲)&=ab\, \mathrm{cov}(𝐱,𝐲)\\ \end{aligned}\] It would be nice to have a “covariance” value that is independent of the scale
One way to be independent of the scale is to use \[\mathrm{corr}(𝐱,𝐲)=\frac{\mathrm{cov}(𝐱,𝐲)}{\mathrm{sdev}(𝐱)\mathrm{sdev}(𝐲)}\] This is the correlation between \(𝐱\) and \(𝐲\)
It is always a value between -1 and 1
\[SE(β_0, β_1) = \sum_i (y_i - β_0 - β_1 x_i)^2\] This time we need two derivatives \[\begin{aligned} \frac{d}{dβ_0} SE(β_0, β_1) &= 2\sum_i (y_i - β_0 - β_1 x_i)\\ \frac{d}{dβ_1} SE(β_0, β_1) &= 2\sum_i (y_i - β_0 - β_1 x_i)⋅x_i \end{aligned}\] Each one must be equal to 0
The first equation to solve is \(\frac{d}{dβ_0} SE(β_0, β_1) = 0\)
That is, we look for \(β_0\) such that \[2\sum_i (y_i - β_0 - β_1 x_i) = 0\] We can divide by 2 and expand the parenthesis \[\sum_i y_i - \sum_i β_0 - \sum_i β_1 x_i = 0\]
If \(\sum_i y_i - \sum_i β_0 - \sum_i β_1 x_i = 0\) then \[\sum_i y_i = n\cdot β_0 + β_1\sum_i x_i\] Therefore, dividing by \(n\), we have \[\overline{𝐲} =β_0 + β_1 \overline{𝐱}\] In other words, we have \[β_0 = \overline{𝐲} - β_1 \overline{𝐱}\]
We want to solve \(\frac{d}{dβ_1} SE(β_0, β_1) = 0\)
That is, we want to find \(β_1\) such that \[2\sum_i (y_i - β_0 - β_1 x_i)⋅x_i = 0\] Dropping the 2 and expanding the parenthesis we have \[\sum_i x_i y_i - \sum_i β_0 x_i - \sum_i β_1 x_i^2 = 0\]
We have \[\sum_i x_iy_i - β_0\sum_i x_i - β_1\sum_i x_i^2 = 0\] It is convenient to divide everything by \(n\) \[\begin{aligned} \frac 1 n \sum_i x_iy_i - β_0\frac 1 n \sum_i x_i - β_1\frac 1 n \sum_i x_i^2 &= 0\\ \frac 1 n \sum_i x_iy_i - β_0\overline{𝐱}- β_1 \frac 1 n \sum_i x_i^2 &=0\\ \end{aligned}\]
Since \(β_0 = \overline{𝐲} - β_1 \overline{𝐱}\) we have \[\begin{aligned} \frac 1 n \sum_i x_iy_i - (\overline{𝐲} - β_1 \overline{𝐱}) \overline{𝐱} - β_1\frac 1 n \sum_i x_i^2 &=0\\ \frac 1 n \sum_i x_iy_i - \overline{𝐱}\overline{𝐲} + β_1 \overline{𝐱}^2 - β_1\frac 1 n \sum_i x_i^2 &=0\\ \left(\frac 1 n \sum_i x_iy_i - \overline{𝐱}\overline{𝐲}\right) - β_1 \left( \frac 1 n \sum_i x_i^2 - \overline{𝐱}^2\right) &=0\\ \end{aligned}\]
The best \(β_1\) is the solution of \[\left(\frac 1 n \sum_i x_iy_i - \overline{𝐱}\overline{𝐲}\right) - β_1 \left( \frac 1 n \sum_i x_i^2 - \overline{𝐱}^2\right) =0\] We have seen these formulas last class \[\text{cov}(𝐱, 𝐲) - β_1 \text{var}(𝐱) =0\]
If \[\text{cov}(𝐱, 𝐲) - β_1 \text{var}(𝐱) =0\] Then the best \(β_1\) is \[β_1 = \frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)}\]
The best straight line is
\[y = β_0 + β_1 x\] where \[\begin{aligned} β_0 &= \overline{𝐲} - β_1 \overline{𝐱}\\ β_1 &= \frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)} \end{aligned}\]