Telling stories

October 22, 2019

We have our own data

survey <- read.table("survey1-tidy.txt")

We can even take a vector from our data

height <- survey$height

So what?

Descriptive Statistics

We have data, we want to tell something about them

What can we tell about this set of numbers?

How can we make a summary of all the values in a few numbers?

Standard Data Descriptors

Number of elements (How many?)
Location (Where?)
Dispersion (Are they homogeneous? Are they similar to each other?)

How many in total

For vectors we use length()
For matrices and data frames we use nrow()
dim() gives us rows and columns

Counting how many in total

length(height)

[1] 51

nrow(survey)

[1] 51

dim(survey)

[1] 51  8

Counting how many of each

table() should be called count. It is good for factors

table(survey$handness)

 Left Right 
    4    47

table(survey$Gender)

Female   Male 
    30     21

We can count combinations

table(survey$handness, survey$Gender)

       
        Female Male
  Left       3    1
  Right     27   20

This looks more like a table

`table()` is not good with numeric

table(height)

height
155 157 158 159 160 162 163 164 165 166 167 168 170 
  1   1   2   1   3   3   3   1   3   2   2   1   2 
171 172 173 174 175 176 177 178 179 180 181 182 183 
  1   1   3   3   4   1   1   2   1   2   1   1   1 
184 185 188 195 
  1   1   1   1

It is not a good summary

What can we say?

Counting `TRUE` values

How many people is taller than 165cm?

sum(height > 165)

[1] 33

Location

If you have to describe the vector v with a single number x, which would it be?

If we have to replace each one of v[i] for a single number, which number is “the best”?

Better choose one that is the “less wrong”

How can x be wrong?

How can `x` be wrong?

Many alternatives to measure the error

Number of times that x!=v[i]
Sum of absolute value of error
Sum of the square of error

Absolute error

Absolute error when \(x\) represents \(\mathbf v\) \[\mathrm{AE}(x, \mathbf{v})=\sum_i |v_i-x|\]

Which \(x\) minimizes absolute error?

Absolute error

Median: minimum Absolute Error

We get the minimum absolute error when \(x=171\)

Median

If x is the median of v, then

half of the values in v are smaller than x
half of the values in v are bigger than x

The median minimizes the absolute error

Squared error

The squared error when \(x\) represents \(\mathbf v\) is \[\mathrm{SE}(x, \mathbf{v})=\sum_i (v_i-x)^2\] Which \(x\) minimizes the squared error?

Squared error

Mean: minimum Squared error

We get the minimum squared error when \(x=170.6862745\)

Median and mean are different

usually

Minimizing SE using math

The error is \[\mathrm{SE}(x, \mathbf{v})=\sum_i (v_i-x)^2\]

To find the minimal value we can take the derivative of \(SE\) with respect to \(x\)

\[\frac{d}{dx} \mathrm{SE}(x, \mathbf{v})= 2\sum_i (v_i - x)= 2\sum_i v_i - 2nx\]

The minimal values of functions are located where the derivative is zero

Minimizing SE using math

Now we find the value of \(x\) that makes the derivative equal to zero.

\[\frac{d}{dx} \mathrm{SE}(x, \mathbf{v})= 2\sum_i v_i - 2nx\]

Making this last formula equal to zero and solving for \(x\) we found that the best one is

\[x = \frac{1}{n} \sum_i v_i\]

Arithmetic Mean

The mean value of \(\mathbf v\) is \[\text{mean}(\mathbf v) = \frac{1}{n}\sum_{i=1}^n v_i\] where \(n\) is the length of the vector \(\mathbf v\).

Sometimes it is written as \(\bar{\mathbf v}\)

This value is called mean

In R we write mean(v)

In summary

summary(height)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  155.0   163.0   171.0   170.7   176.5   195.0

What are these values?

Minimum, Maximum and Range

The easiest to understand are minimum and maximum

min(height)

[1] 155

max(height)

[1] 195

Range: min and max together

Which sometimes can be useful together

range(height)

[1] 155 195

Quartiles

Quart means one fourth in latin.

If we split the set of values in four subsets of the same size

Which are the limits of these sets? \[Q_0, Q_1, Q_2, Q_3, Q_4\]

It is easy to know \(Q_0, Q_2\) and \(Q_4\)

Quartiles

\(Q_0\): Zero elements are smaller than this one
\(Q_1\): One quarter of the elements are smaller
\(Q_2\): Two quarters (half) of the elements are smaller
\(Q_3\): Three quarters of the elements are smaller
\(Q_4\): Four quarters (all) of the elements are smaller

It is easy to see that \(Q_0\) is the minimum, \(Q_2\) is the median, and \(Q_4\) is the maximum

Quartiles and Quantiles

Generalizing, we can ask, for each percentage \(p\), which is the value on the vector v which is greater than \(p\)% of the rest of the values.

The function in R for that is called quantile()

By default it gives us the quartiles

quantile(height)

   0%   25%   50%   75%  100% 
155.0 163.0 171.0 176.5 195.0

`quantile()` gives quartiles

quantile(height)

   0%   25%   50%   75%  100% 
155.0 163.0 171.0 176.5 195.0

unless we ask for something else

quantile(height, seq(0, 1, by=0.1))

  0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
 155  160  162  165  167  171  174  175  178  182  195

In `summary()`

Summary

summary(height)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  155.0   163.0   171.0   170.7   176.5   195.0

Which is my quartile?

Dividing the vector in groups

The command cut() separates the vector and makes a factor for each group. This is a factor:

cut(height, 4)

 [1] (175,185] (165,175] (165,175] (175,185] (155,165]
 [6] (165,175] (165,175] (155,165] (165,175] (185,195]
[11] (175,185] (175,185] (155,165] (155,165] (175,185]
[16] (155,165] (165,175] (155,165] (155,165] (155,165]
[21] (165,175] (155,165] (165,175] (155,165] (155,165]
[26] (155,165] (165,175] (165,175] (175,185] (175,185]
[31] (175,185] (165,175] (165,175] (165,175] (165,175]
[36] (165,175] (165,175] (155,165] (165,175] (155,165]
[41] (175,185] (155,165] (175,185] (165,175] (165,175]
[46] (155,165] (155,165] (185,195] (155,165] (175,185]
[51] (175,185]
Levels: (155,165] (165,175] (175,185] (185,195]

We need an extra option

We need to include the smallest value also. We need an extra option.

cut(height, 4, include.lowest = TRUE)

 [1] (175,185] (165,175] (165,175] (175,185] [155,165]
 [6] (165,175] (165,175] [155,165] (165,175] (185,195]
[11] (175,185] (175,185] [155,165] [155,165] (175,185]
[16] [155,165] (165,175] [155,165] [155,165] [155,165]
[21] (165,175] [155,165] (165,175] [155,165] [155,165]
[26] [155,165] (165,175] (165,175] (175,185] (175,185]
[31] (175,185] (165,175] (165,175] (165,175] (165,175]
[36] (165,175] (165,175] [155,165] (165,175] [155,165]
[41] (175,185] [155,165] (175,185] (165,175] (165,175]
[46] [155,165] [155,165] (185,195] [155,165] (175,185]
[51] (175,185]
Levels: [155,165] (165,175] (175,185] (185,195]

Are these the quartiles?

Used this way, the range is split in parts of the same size, not with the same number of people

table(cut(height, 4, include.lowest = TRUE))

[155,165] (165,175] (175,185] (185,195] 
       18        19        12         2

These are not the quartiles. They do not have the same number of elements

Using the real quartiles

We can specify the cut points using quantile()

table(cut(height, quantile(height),
          include.lowest = TRUE))

[155,163] (163,171] (171,176] (176,195] 
       14        12        12        13

Now every group has (almost) the same number of elements

We have our own data

Descriptive Statistics

Standard Data Descriptors

How many in total

Counting how many in total

Counting how many of each

We can count combinations

table() is not good with numeric

Counting TRUE values

Location

Location

How can x be wrong?

Absolute error

Absolute error

Median: minimum Absolute Error

Median

Squared error

Squared error

Mean: minimum Squared error

Median and mean are different

usually

Minimizing SE using math

Minimizing SE using math

Arithmetic Mean

In summary

Minimum, Maximum and Range

Range: min and max together

Quartiles

Quartiles

Quartiles and Quantiles

quantile() gives quartiles

In summary()

Summary

Which is my quartile?

Dividing the vector in groups

We need an extra option

Are these the quartiles?

Using the real quartiles

`table()` is not good with numeric

Counting `TRUE` values

How can `x` be wrong?

`quantile()` gives quartiles

In `summary()`