If you have to describe the vector v
with a single number x
, which would it be?
If we have to replace each one of v[i]
for a single number, which number is “the best”?
Better choose one that is the “less wrong”
How can x
be wrong?
November 8th, 2016
If you have to describe the vector v
with a single number x
, which would it be?
If we have to replace each one of v[i]
for a single number, which number is “the best”?
Better choose one that is the “less wrong”
How can x
be wrong?
Many alternatives
sum(x!=v[i])
sum(abs(v-x))
sum((v-x)^2)
Absolute error when \(x\) represents \(\mathbf v\) \[\mathrm{AE}(x, \mathbf{v})=\sum_i |v_i-x|\] or, in R code
sum(abs(v-x))
Which \(x\) minimizes absolute error?
We get the minimum absolute error when \(x=425\)
If x
is the median of v
, then
v
are smaller than x
v
are bigger than x
The median minimizes the absolute error
The squared error when \(x\) represents \(\mathbf v\) is \[\mathrm{SE}(x, \mathbf{v})=\sum_i (v_i-x)^2\] or, in R code
sum((v-x)^2)
Which \(x\) minimizes the squared error?
We get the minimum squared error when \(x=591.1843972\)
The error is \[\mathrm{SE}(x, \mathbf{v})=\sum_i (v_i-x)^2\]
To find the minimal value we can take the derivative of \(SE\) with respect to \(x\)
\[\frac{d}{dx} \mathrm{SE}(x, \mathbf{v})= 2\sum_i (v_i - x)= 2\sum_i v_i - 2nx\]
The minimal values of functions are located where the derivative is zero
Now we find the value of \(x\) that makes the derivative equal to zero.
\[\frac{d}{dx} \mathrm{SE}(x, \mathbf{v})= 2\sum_i v_i - 2nx\]
Making this last formula equal to zero and solving for \(x\) we found that the best one is
\[x = \frac{1}{n} \sum_i v_i\]
The mean value of \(\mathbf v\) is \[\text{mean}(\mathbf v) = \frac{1}{n}\sum_{i=1}^n v_i\] where \(n\) is the length of the vector \(\mathbf v\).
Sometimes it is written as \(\bar{\mathbf v}\)
This value is called mean
In R we write mean(v)
summary(rivers)
Min. 1st Qu. Median Mean 3rd Qu. Max. 135.0 310.0 425.0 591.2 680.0 3710.0
What are these values?
The easiest to understand are minimum and maximum
min(rivers)
[1] 135
max(rivers)
[1] 3710
Which sometimes can be useful together
range(rivers)
[1] 135 3710
Quart means one fourth in latin.
If we split the set of values in four subsets of the same size
Which are the limits of these sets?
\(Q_0\): Zero elements are smaller than this one
\(Q_1\): One quarter of the elements are smaller
\(Q_2\): Two quarters (half) of the elements are smaller
\(Q_3\): Three quarters of the elements are smaller
\(Q_4\): Four quarters (all) of the elements are smaller
It is easy to see that \(Q_0\) is the minimum, \(Q_2\) is the median, and \(Q_4\) is the maximum
Generalizing, we can ask, for each percentage \(p\), which is the value on the vector v
which is greater than \(p\)% of the rest of the values.
The function in R for that is called quantile()
By default it gives us the quartiles
quantile(rivers)
0% 25% 50% 75% 100% 135 310 425 680 3710
quantile(rivers, seq(0, 1, by=0.1))
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 135 255 291 330 375 425 505 610 735 1054 3710