# A tibble: 117 x 10
answer_date id english_level sex birthdate birthplace
<date> <chr> <chr> <chr> <date> <chr>
1 2018-09-17 3e50… I can speak … Male 1993-02-01 turkey
2 2018-09-17 479d… I can unders… Fema… 1998-05-21 Kahramanm…
3 2018-09-17 39df… I can read a… Fema… 1998-01-18 Batman, T…
4 2018-09-17 d2b0… I can read a… Male 1998-08-29 Antalya,T…
5 2018-09-17 f22b… I can read a… Fema… 1998-05-03 izmir
6 2018-09-17 849c… İngilizce bi… Fema… 1995-10-09 Türkiye /…
7 2018-09-17 8381… I can speak … Fema… 1997-09-19 Adıyaman,…
8 2018-09-17 b0dd… I can read a… Male 1997-11-27 Bursa
9 2018-09-17 2972… I can read a… Fema… 1999-01-02 İstanbul/…
10 2018-09-17 72c0… I can read a… Fema… 1998-10-02 İstanbul,…
# … with 107 more rows, and 4 more variables: height_cm <dbl>,
# weight_kg <dbl>, handness <chr>, hand_span <dbl>
Today we will not use NA
values
[1] 67.0 55.0 74.0 68.0 58.0 72.0 68.0 58.0 55.0
[10] 81.0 42.5 69.0 58.0 47.0 78.0 57.0 55.0 55.0
[19] 65.0 60.0 50.0 52.0 54.0 75.0 105.0 56.0 50.0
[28] 67.0 59.0 75.0 60.0 60.0 106.0 94.0 63.0 54.0
[37] 53.0 75.0 70.0 65.0 65.0 55.0 68.0 55.0 80.0
[46] 77.0 85.0 65.0 64.0 64.0 60.0 76.0 56.0 78.0
[55] 77.0 72.0 58.0 66.0 52.0 73.0 82.0 55.0 86.0
[64] 63.0 85.0 58.0 65.0 65.0 70.0 47.0 82.0 70.0
[73] 75.0 47.0 72.0 61.0 79.0 55.0 74.0 47.0 54.0
[82] 60.0 74.0 56.0 65.0 49.0 63.0 65.0 47.0 90.0
[91] 90.0 76.0 88.0 80.0 72.0 47.0 61.0 95.0 67.0
[100] 80.0
We have data
we want to tell something about it
What can we tell about this set of numbers?
How can we make a summary of all the values using only a few numbers?
nrow()
[1] 117
dim()
gives us rows and columns[1] 117 10
length()
[1] 100
table()
should be called count
Left Right
12 105
Female Male
77 39
This looks more like a table
Female Male
Left 9 3
Right 68 36
table()
is not good with numericweight
42.5 47 49 50 52 53 54 55 56 57 58 59
1 6 1 2 2 1 3 8 3 1 5 1
60 61 63 64 65 66 67 68 69 70 72 73
5 2 3 2 8 1 3 3 1 3 4 1
74 75 76 77 78 79 80 81 82 85 86 88
3 4 2 2 2 1 3 1 2 2 1 1
90 94 95 105 106
2 1 1 1 1
It is not a good summary
What can we say?
Where are the values?
The easiest are minimum and maximum
[1] 42.5
[1] 106
Sometimes can be useful together
[1] 42.5 106.0
(This is like dim()
: it combines two functions into one)
If you have to describe the vector v
with a single number x
, which would it be?
If we have to replace each one of v[i]
for a single number, which number is “the best”?
There are several possible answers to that question.
There are several possible averages
If \(\mathbf v=(v_1,…,v_n)\) is a vector, then the mean value of \(\mathbf v\) is \[\text{mean}(\mathbf v) = \frac{1}{n}\sum_{i=1}^n v_i\]
This value is called the mean of \(\mathbf v\)
In R we write mean(v)
Besides arithmetic mean, we have
and many others
We use them only in a few specific places
If x
is the median of v
, then
v
are smaller than x
v
are bigger than x
The median is often used as “average”
Like in “the average person think he/she is smarter than the average person”
These are our values
[1] 66.485
[1] 65
The problem with averages is that they are too sensitive to extreme values
Imagine that one day an elephant comes to our class
What happens with “the average weight”
[1] 125.2327
[1] 65
Which one represent us better?
What are these values?
Min. 1st Qu. Median Mean 3rd Qu. Max.
42.50 56.00 65.00 66.48 75.00 106.00
What is 1st Qu.
and 3rd Qu.
Quart means one fourth in latin.
If we split the set of values in four subsets of the same size, what are the limits of these sets? \[Q_0, Q_1, Q_2, Q_3, Q_4\]
It is easy to know \(Q_0, Q_2\) and \(Q_4\)
\(Q_0\): Zero elements are smaller than this one
\(Q_1\): One quarter of the elements are smaller
\(Q_2\): Two quarters (half) of the elements are smaller
\(Q_3\): Three quarters of the elements are smaller
\(Q_4\): Four quarters (all) of the elements are smaller
It is easy to see that \(Q_0\) is the minimum, \(Q_2\) is the median, and \(Q_4\) is the maximum
Generalizing, we can ask, for each percentage \(p\),
which is the value on the vector
v
that is greater than \(p\)% of all the values?
These values are called Quantiles, or sometimes Percentiles
The function in R for quantiles is called quantile()
By default it gives us the quartiles
0% 25% 50% 75% 100%
42.5 56.0 65.0 75.0 106.0
quantile()
gives quartiles 0% 25% 50% 75% 100%
42.5 56.0 65.0 75.0 106.0
unless we ask for something else
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
42.5 51.8 55.0 58.0 61.0 65.0 68.0 73.3 77.0 82.3
100%
106.0
summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
42.50 56.00 65.00 66.48 75.00 106.00