Class 16: Telling stories

Computing in Molecular Biology and Genetics 1

Andrés Aravena, PhD

November 23, 2019

We have our own data

library(readr)
students <- read_tsv("students2018-2020.tsv")
students

# A tibble: 117 x 10
   answer_date id    english_level sex   birthdate  birthplace
   <date>      <chr> <chr>         <chr> <date>     <chr>     
 1 2018-09-17  3e50… I can speak … Male  1993-02-01 turkey    
 2 2018-09-17  479d… I can unders… Fema… 1998-05-21 Kahramanm…
 3 2018-09-17  39df… I can read a… Fema… 1998-01-18 Batman, T…
 4 2018-09-17  d2b0… I can read a… Male  1998-08-29 Antalya,T…
 5 2018-09-17  f22b… I can read a… Fema… 1998-05-03 izmir     
 6 2018-09-17  849c… İngilizce bi… Fema… 1995-10-09 Türkiye /…
 7 2018-09-17  8381… I can speak … Fema… 1997-09-19 Adıyaman,…
 8 2018-09-17  b0dd… I can read a… Male  1997-11-27 Bursa     
 9 2018-09-17  2972… I can read a… Fema… 1999-01-02 İstanbul/…
10 2018-09-17  72c0… I can read a… Fema… 1998-10-02 İstanbul,…
# … with 107 more rows, and 4 more variables: height_cm <dbl>,
#   weight_kg <dbl>, handness <chr>, hand_span <dbl>

We can take a vector from our data

Today we will not use NA values

weight <- students$weight_kg[!is.na(students$weight_kg)]
weight

  [1]  67.0  55.0  74.0  68.0  58.0  72.0  68.0  58.0  55.0
 [10]  81.0  42.5  69.0  58.0  47.0  78.0  57.0  55.0  55.0
 [19]  65.0  60.0  50.0  52.0  54.0  75.0 105.0  56.0  50.0
 [28]  67.0  59.0  75.0  60.0  60.0 106.0  94.0  63.0  54.0
 [37]  53.0  75.0  70.0  65.0  65.0  55.0  68.0  55.0  80.0
 [46]  77.0  85.0  65.0  64.0  64.0  60.0  76.0  56.0  78.0
 [55]  77.0  72.0  58.0  66.0  52.0  73.0  82.0  55.0  86.0
 [64]  63.0  85.0  58.0  65.0  65.0  70.0  47.0  82.0  70.0
 [73]  75.0  47.0  72.0  61.0  79.0  55.0  74.0  47.0  54.0
 [82]  60.0  74.0  56.0  65.0  49.0  63.0  65.0  47.0  90.0
 [91]  90.0  76.0  88.0  80.0  72.0  47.0  61.0  95.0  67.0
[100]  80.0

So what?

Descriptive Statistics

We have data

we want to tell something about it

What can we tell about this set of numbers?

How can we make a summary of all the values using only a few numbers?

Standard Data Descriptors

Number of elements
- How many?
Location
- Where?
Dispersion
- Are they homogeneous?
- Are they similar to each other?

How many in total

For data frames and tibbles we use nrow()

nrow(students)

[1] 117

dim() gives us rows and columns

dim(students)

[1] 117  10

For vectors we use length()

length(weight)

[1] 100

Counting how many of each

table() should be called count

table(students$handness)


 Left Right 
   12   105

table(students$sex)


Female   Male 
    77     39

We can count combinations

This looks more like a table

table(students$handness, students$sex)

       
        Female Male
  Left       9    3
  Right     68   36

`table()` is not good with numeric

table(weight)

weight
42.5   47   49   50   52   53   54   55   56   57   58   59 
   1    6    1    2    2    1    3    8    3    1    5    1 
  60   61   63   64   65   66   67   68   69   70   72   73 
   5    2    3    2    8    1    3    3    1    3    4    1 
  74   75   76   77   78   79   80   81   82   85   86   88 
   3    4    2    2    2    1    3    1    2    2    1    1 
  90   94   95  105  106 
   2    1    1    1    1

It is not a good summary

What can we say?

Location

Where are the values?

Minimum and Maximum

The easiest are minimum and maximum

min(weight)

[1] 42.5

max(weight)

[1] 106

Range: min and max together

Sometimes can be useful together

range(weight)

[1]  42.5 106.0

(This is like dim(): it combines two functions into one)

Average

If you have to describe the vector v with a single number x, which would it be?

If we have to replace each one of v[i] for a single number, which number is “the best”?

There are several possible answers to that question.

There are several possible averages

Arithmetic Mean

If \(\mathbf v=(v_1,…,v_n)\) is a vector, then the mean value of \(\mathbf v\) is \[\text{mean}(\mathbf v) = \frac{1}{n}\sum_{i=1}^n v_i\]

This value is called the mean of \(\mathbf v\)

In R we write mean(v)

Other means

Besides arithmetic mean, we have

Geometrical mean
Harmonic mean
Quadratic mean
Cubic mean

and many others

We use them only in a few specific places

Median

If x is the median of v, then

half of the values in v are smaller than x
half of the values in v are bigger than x

The median is often used as “average”

Like in “the average person think he/she is smarter than the average person”

Median and mean are usually different

In our case

These are our values

mean(weight)

[1] 66.485

median(weight)

[1] 65

Median is robust

The problem with averages is that they are too sensitive to extreme values

Imagine that one day an elephant comes to our class

What happens with “the average weight”

An elephant joins us

mean( c(weight, 6000) )

[1] 125.2327

median( c(weight, 6000) )

[1] 65

Which one represent us better?

In summary

What are these values?

summary(weight)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  42.50   56.00   65.00   66.48   75.00  106.00

What is 1st Qu. and 3rd Qu.

First and Third Quartiles

Quart means one fourth in latin.

If we split the set of values in four subsets of the same size, what are the limits of these sets? \[Q_0, Q_1, Q_2, Q_3, Q_4\]

It is easy to know \(Q_0, Q_2\) and \(Q_4\)

Quartiles

\(Q_0\): Zero elements are smaller than this one
\(Q_1\): One quarter of the elements are smaller
\(Q_2\): Two quarters (half) of the elements are smaller
\(Q_3\): Three quarters of the elements are smaller
\(Q_4\): Four quarters (all) of the elements are smaller

It is easy to see that \(Q_0\) is the minimum, \(Q_2\) is the median, and \(Q_4\) is the maximum

Quartiles and Quantiles

Generalizing, we can ask, for each percentage \(p\),

which is the value on the vector v that is greater than \(p\)% of all the values?

These values are called Quantiles, or sometimes Percentiles

Quartiles and Quantiles

The function in R for quantiles is called quantile()

By default it gives us the quartiles

quantile(weight)

   0%   25%   50%   75%  100% 
 42.5  56.0  65.0  75.0 106.0

`quantile()` gives quartiles

quantile(weight)

   0%   25%   50%   75%  100% 
 42.5  56.0  65.0  75.0 106.0

unless we ask for something else

quantile(weight, seq(from=0, to=1, by=0.1))

   0%   10%   20%   30%   40%   50%   60%   70%   80%   90% 
 42.5  51.8  55.0  58.0  61.0  65.0  68.0  73.3  77.0  82.3 
 100% 
106.0

In `summary()`

summary(weight)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  42.50   56.00   65.00   66.48   75.00  106.00

Class 16: Telling stories

Computing in Molecular Biology and Genetics 1

Andrés Aravena, PhD

November 23, 2019

We have our own data

We can take a vector from our data

So what?

Descriptive Statistics

Standard Data Descriptors

How many in total

Counting how many of each

We can count combinations

table() is not good with numeric

Location

Minimum and Maximum

Range: min and max together

Average

Arithmetic Mean

Other means

Median

Median and mean are usually different

In our case

Median is robust

An elephant joins us

In summary

First and Third Quartiles

Quartiles

Quartiles and Quantiles

Quartiles and Quantiles

quantile() gives quartiles

In summary()

`table()` is not good with numeric

`quantile()` gives quartiles

In `summary()`