survey <- read.table("survey1-tidy.txt")
We can even take a vector from our data
height <- survey$height
So what?
October 18, 2018
survey <- read.table("survey1-tidy.txt")
We can even take a vector from our data
height <- survey$height
So what?
We have data, we want to tell something about them
What can we tell about this set of numbers?
How can we make a summary of all the values in a few numbers?
length()
nrow()
dim()
gives us rows and columnslength(height)
[1] 51
nrow(survey)
[1] 51
dim(survey)
[1] 51 8
table()
should be called count. It is good for factors
table(survey$handness)
Left Right 4 47
table(survey$Gender)
Female Male 30 21
table(survey$handness, survey$Gender)
Female Male Left 3 1 Right 27 20
This looks more like a table
table()
is not good with numerictable(height)
height 155 157 158 159 160 162 163 164 165 166 167 168 170 1 1 2 1 3 3 3 1 3 2 2 1 2 171 172 173 174 175 176 177 178 179 180 181 182 183 1 1 3 3 4 1 1 2 1 2 1 1 1 184 185 188 195 1 1 1 1
It is not a good summary
What can we say?
TRUE
valuesHow many people is taller than 165cm?
sum(height > 165)
[1] 33
If you have to describe the vector v
with a single number x
, which would it be?
If we have to replace each one of v[i]
for a single number, which number is “the best”?
Better choose one that is the “less wrong”
How can x
be wrong?
x
be wrong?Many alternatives to measure the error
x!=v[i]
Absolute error when \(x\) represents \(\mathbf v\) \[\mathrm{AE}(x, \mathbf{v})=\sum_i |v_i-x|\]
Which \(x\) minimizes absolute error?
We get the minimum absolute error when \(x=171\)
If x
is the median of v
, then
v
are smaller than x
v
are bigger than x
The median minimizes the absolute error
The squared error when \(x\) represents \(\mathbf v\) is \[\mathrm{SE}(x, \mathbf{v})=\sum_i (v_i-x)^2\] Which \(x\) minimizes the squared error?
We get the minimum squared error when \(x=170.6862745\)
The error is \[\mathrm{SE}(x, \mathbf{v})=\sum_i (v_i-x)^2\]
To find the minimal value we can take the derivative of \(SE\) with respect to \(x\)
\[\frac{d}{dx} \mathrm{SE}(x, \mathbf{v})= 2\sum_i (v_i - x)= 2\sum_i v_i - 2nx\]
The minimal values of functions are located where the derivative is zero
Now we find the value of \(x\) that makes the derivative equal to zero.
\[\frac{d}{dx} \mathrm{SE}(x, \mathbf{v})= 2\sum_i v_i - 2nx\]
Making this last formula equal to zero and solving for \(x\) we found that the best one is
\[x = \frac{1}{n} \sum_i v_i\]
The mean value of \(\mathbf v\) is \[\text{mean}(\mathbf v) = \frac{1}{n}\sum_{i=1}^n v_i\] where \(n\) is the length of the vector \(\mathbf v\).
Sometimes it is written as \(\bar{\mathbf v}\)
This value is called mean
In R we write mean(v)
summary(height)
Min. 1st Qu. Median Mean 3rd Qu. Max. 155.0 163.0 171.0 170.7 176.5 195.0
What are these values?
The easiest to understand are minimum and maximum
min(height)
[1] 155
max(height)
[1] 195
Which sometimes can be useful together
range(height)
[1] 155 195
Quart means one fourth in latin.
If we split the set of values in four subsets of the same size
Which are the limits of these sets? \[Q_0, Q_1, Q_2, Q_3, Q_4\]
It is easy to know \(Q_0, Q_2\) and \(Q_4\)
\(Q_0\): Zero elements are smaller than this one
\(Q_1\): One quarter of the elements are smaller
\(Q_2\): Two quarters (half) of the elements are smaller
\(Q_3\): Three quarters of the elements are smaller
\(Q_4\): Four quarters (all) of the elements are smaller
It is easy to see that \(Q_0\) is the minimum, \(Q_2\) is the median, and \(Q_4\) is the maximum
Generalizing, we can ask, for each percentage \(p\), which is the value on the vector v
which is greater than \(p\)% of the rest of the values.
The function in R for that is called quantile()
By default it gives us the quartiles
quantile(height)
0% 25% 50% 75% 100% 155.0 163.0 171.0 176.5 195.0
quantile()
gives quartilesquantile(height)
0% 25% 50% 75% 100% 155.0 163.0 171.0 176.5 195.0
unless we ask for something else
quantile(height, seq(0, 1, by=0.1))
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 155 160 162 165 167 171 174 175 178 182 195
summary()
summary(height)
Min. 1st Qu. Median Mean 3rd Qu. Max. 155.0 163.0 171.0 170.7 176.5 195.0
The command cut()
separates the vector and makes a factor for each group. This is a factor:
cut(height, 4)
[1] (175,185] (165,175] (165,175] (175,185] (155,165] [6] (165,175] (165,175] (155,165] (165,175] (185,195] [11] (175,185] (175,185] (155,165] (155,165] (175,185] [16] (155,165] (165,175] (155,165] (155,165] (155,165] [21] (165,175] (155,165] (165,175] (155,165] (155,165] [26] (155,165] (165,175] (165,175] (175,185] (175,185] [31] (175,185] (165,175] (165,175] (165,175] (165,175] [36] (165,175] (165,175] (155,165] (165,175] (155,165] [41] (175,185] (155,165] (175,185] (165,175] (165,175] [46] (155,165] (155,165] (185,195] (155,165] (175,185] [51] (175,185] Levels: (155,165] (165,175] (175,185] (185,195]
Used this way, the range is split in parts of the same size, not with the same number of people
table(cut(height, 4))
(155,165] (165,175] (175,185] (185,195] 18 19 12 2
These are not the quartiles
We can specify the cut points using quantile()
table(cut(height, quantile(height), include.lowest = TRUE))
[155,163] (163,171] (171,176] (176,195] 14 12 12 13
Now every group has (almost) the same size