Measured = Real ⊕ Instrument ⊕ Noise ⊕ Diversity \[y_m(x,i,t) = f(x)⊕I(x,t)⊕v(i)⊕b(x,i)\]
Normalized = Real ⊕ Noise ⊕ Diversity \[y(x,i) = f(x)⊕v(i)⊕b(x,i)\]
\[w_N(h,s,i) = w(h,s)⊕v(i)⊕b(h,s,i)\]
\[\underbrace{y_i}_{w_N(h,s,i)} = \underbrace{β_0 + β_1 x_i + β_2 s_i}_{w(h,s)} + \underbrace{e_i}_{v(i)⊕b(h,s,i)}\]
(To make it more general, we use \(x_i\) to represent height)
This way the formulas will not be only about weight and height
The \(e_i\) values will be important later
They are called “errors” or “residuals”
Since there is no systematic error, we must have \[\sum_i e_i=0\] (otherwise we can change \(\beta_0\) or \(\beta_1\))
sex | birth_date | height | weight | handedness | hand_span | |
---|---|---|---|---|---|---|
1 | Male | 1993-02-01 | 179 | 67 | Right | 15 |
2 | Female | 1998-05-21 | 168 | 55 | Right | 14 |
4 | Male | 1998-08-29 | 170 | 74 | Right | 25 |
5 | Female | 1998-05-03 | 162 | 68 | Right | 13 |
6 | Female | 1995-10-09 | 167 | 58 | Right | 18 |
7 | Female | 1997-09-19 | 174 | 72 | Right | 16 |
8 | Male | 1997-11-27 | 180 | 68 | Right | 19 |
9 | Female | 1999-01-02 | 162 | 58 | Right | 19 |
10 | Female | 1998-10-02 | 172 | 55 | Right | 20 |
11 | Male | 1997-05-18 | 181 | 81 | Right | 20 |
This is the simplest model for the set \(𝐲\)
\[y_i = β_0 + e_i\]
We look for a constant \(β_0\) representing all values
The error made by this approximation is \(e_i\)
The best \(β_0\) is found to be
\[\text{mean}(𝐲)=\overline{𝐲}=\frac{1}{n}\sum_i y_i\]
[1] 66.36869
This is the mean square error of the model \[MSE_0=\text{var}(𝐲) =\frac{1}{n}\sum_i (y_i - \overline{𝐲})^2\]
[1] 170.2504
It is easy to see that taller people will be heavier
You need more bones to be taller, and more muscles to move
We will model using this expression \[y_i = β_0 + β_1 x_i + e_i\] where \[\begin{aligned} β_0 &= \overline{𝐲} - β_1 \overline{𝐱}\\ β_1 &= \frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)} \end{aligned}\]
(Intercept) height
-84.007888 0.882157
This time instead of \[MSE_0=\frac{1}{n}\sum_i (y_i - \overline{𝐲})^2\] we have \[MSE_1=\text{var}(𝐲) - \frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)}\]
Thus, mean square error is better in model 1 than in model 0
The variance in the new model is better, but how much? \[\frac{MSE_0-MSE_1}{MSE_0}=\frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)\text{var}(𝐲)}\] This number represents the percentage of the original variance that is explained by the new model
The name of this number is \(R^2\)
The Pearson correlation coefficient between two variables is \[r=\frac{\text{cov}(𝐱,𝐲)}{\text{sd}(𝐱)\text{sd}(𝐲)}\] so we have in this case that \[R^2 = r^2\] This is valid for linear models with a single independent variable. It will not be valid for larger models
Our results include a random component (the \(e_i\))
Thus, the values of \(β_0\) and \(β_1\) are also random
But they are not far away from the “real” ones
More details in the next class
We can get a 95% or 99% confidence interval for the values
0.5 % 99.5 %
(Intercept) -132.7312228 -35.284553
height 0.5967649 1.167549
If the confidence interval does not contains 0, then the real value is not 0
“The evidence shows that the coefficient is not 0”
If the confidence interval contains 0,
then the real value may be 0
“We do not have evidence that the coefficient is ≠0”
Call:
lm(formula = weight ~ height, data = survey)
Residuals:
Min 1Q Median 3Q Max
-18.016 -7.225 0.216 7.396 35.630
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -84.0079 18.5438 -4.530 1.68e-05 ***
height 0.8822 0.1086 8.122 1.48e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 10.17 on 97 degrees of freedom
Multiple R-squared: 0.4048, Adjusted R-squared: 0.3986
F-statistic: 65.96 on 1 and 97 DF, p-value: 1.479e-12
The values \(β_0\) and \(β_1\) follow a Student’s t distribution
(We will show that later)
Thus, we can use a t-test to see if they are not zero
In this case we let the computer do it for us
Sex is a factor, so we get a box plot
We still have \(n\) individuals, but now we can split them in two groups
The groups are called “female” and “male”
There are \(n_0\) females and \(n_1\) males, so \(n=n_0+n_1.\)
The easiest way is to use a number \[s_i=\begin{cases}0\quad\text{if person }i\text{ is female}\\ 1\quad\text{if person }i\text{ is male}\end{cases}\]
It is easy to see that \(\sum_i s_i = n_1\)
We have now \[y_i = β_0 + β_1 s_i + e_i\] So for all the female individuals, we have \[y_i = β_0 + e_i\] Taking averages we have \[\frac{1}{n_0}\sum_{i\text{ female}}y_i = β_0\] so \(β_0\) is the average weight of female \[β_0=\text{mean}(𝐲\text{ female})\]
So for all the male individuals, we have \[y_i = β_0 + β_1 + e_i\] Taking averages we have \[\frac{1}{n_1}\sum_{i\text{ male}}y_i = β_0+ β_1\] In other words \[β_0 + β_1=\text{mean}(𝐲\text{ male})\]
Wh have \[β_0=\text{mean}(𝐲\text{ female})\] and \[β_0 + β_1=\text{mean}(𝐲\text{ male})\] Replacing the value of \(β_0\) we have \[β_1=\text{mean}(𝐲\text{ male})-\text{mean}(𝐲\text{ female})\]
(Intercept) sexMale
59.57937 18.67063
0.5 % 99.5 %
(Intercept) 56.41406 62.74467
sexMale 13.42158 23.91969
(Intercept)
is the average female weight
sexMale
is the average extra weight for male
Call:
lm(formula = weight ~ sex, data = survey)
Residuals:
Min 1Q Median 3Q Max
-19.250 -5.579 -1.579 5.421 27.750
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 59.579 1.205 49.456 < 2e-16 ***
sexMale 18.671 1.998 9.346 3.47e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.562 on 97 degrees of freedom
Multiple R-squared: 0.4738, Adjusted R-squared: 0.4684
F-statistic: 87.34 on 1 and 97 DF, p-value: 3.47e-15
(Intercept) handednessRight
67.0000000 -0.7022472
0.5 % 99.5 %
(Intercept) 56.04894 77.95106
handednessRight -12.25216 10.84767
What are the coefficients
What can we conclude here?
Call:
lm(formula = weight ~ handedness, data = survey)
Residuals:
Min 1Q Median 3Q Max
-23.798 -10.649 -1.298 8.351 39.702
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.0000 4.1679 16.07 <2e-16 ***
handednessRight -0.7022 4.3958 -0.16 0.873
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13.18 on 97 degrees of freedom
Multiple R-squared: 0.000263, Adjusted R-squared: -0.01004
F-statistic: 0.02552 on 1 and 97 DF, p-value: 0.8734