Class 21: Exploring data with plots

Computing in Molecular Biology and Genetics 1

Andrés Aravena, PhD

21 December 2020

Base plots in R

There are several ways to plot in R

In this class we show the basic R graphic commands

We will use them only for data exploration

We need more time to make these plots nicer

We will not study that in this course, because we will use a better system

We load the data

library(readr)
students <- read_tsv("students2018-2020-tidy.tsv")
students
# A tibble: 117 x 10
   answer_date id    english_level sex   birthdate  birthplace height_cm
   <date>      <chr> <chr>         <chr> <date>     <chr>          <dbl>
 1 2018-09-17  3e50… I can speak … Male  1993-02-01 -/Turkey         179
 2 2018-09-17  479d… I can unders… Fema… 1998-05-21 Kahramanm…       168
 3 2018-09-17  39df… I can read a… Fema… 1998-01-18 Batman/Tu…        NA
 4 2018-09-17  d2b0… I can read a… Male  1998-08-29 Antalya/T…       170
 5 2018-09-17  f22b… I can read a… Fema… 1998-05-03 Izmir/Tur…       162
 6 2018-09-17  849c… İngilizce bi… Fema… 1995-10-09 Yalova/Tu…       167
 7 2018-09-17  8381… I can speak … Fema… 1997-09-19 Adıyaman/…       174
 8 2018-09-17  b0dd… I can read a… Male  1997-11-27 Bursa/Tur…       180
 9 2018-09-17  2972… I can read a… Fema… 1999-01-02 Istanbul/…       162
10 2018-09-17  72c0… I can read a… Fema… 1998-10-02 Istanbul/…       172
# … with 107 more rows, and 3 more variables: weight_kg <dbl>,
#   handedness <chr>, hand_span <dbl>

Data Visualization

“one image worths a thousand words”

Graphics

Sometimes the best way to understand the data is a graphic

plot(students$weight_kg)

How to read it

  • Each value has a different position in the horizontal axis

  • The vector’s index is a number from 1 to length(vector)

  • The vertical axis represent the value of the element

  • So if vector[3] contains the value 170, we will have a point at the coordinates (3, 170)

Another example

plot(students$height_cm)

Plot Type line and over

plot(students$height_cm, type = "l")

plot(students$height_cm, type = "o")

Notice the broken line when there are missing values

Plot Type

The type depends on the story you want to tell

  • Lines are mostly used to tell a story of change through time

    • So it is not good for students
  • Using over is better to see the individual points in the line

  • If you do not specify, the default is type="p"

Use lines to show change through time

For example, the number of new COVID-19 cases each day

plot(Turkey$New_cases, type="l")

Zooming—Choosing the range

plot(Turkey$New_cases,
     type="l")

plot(Turkey$New_cases, xlim=c(80,150),
     type="l")

Use xlim for horizontal range and ylim for vertical

You can make barplots of numeric vectors

plot(Turkey$New_cases, xlim=c(80,150),
                       type="l")

barplot(Turkey$New_cases,
        xlim=c(80,150))

Barplots

Numeric vectors are shown element by element

  • bars starts at 0
  • hard to see when the vector length is large
barplot(Turkey$New_cases)

barplot() works well with table()

barplot(table(students$english_level))

This can also be written as

students$english_level %>% table() %>% barplot()

plot() can handle factors

When the vector is a factor, plot() does all the hard work

plot(students$english_factor)

Level order is important here

Can we do the same for a numeric vector?

In a numeric vector usually all values are different

We have to group them in “similar” sets

Histograms

Histograms group and count in one step

plot(students$weight_kg)

hist(students$weight_kg)

Histograms

Numeric data can be grouped into classes

  • The default number of classes is automatic
    • but you can change it
  • Frequency means “how many times”

Histogram bars are not separated

  • This is because numerical values are continuous, and there is no “space” between them

Numeric data is grouped in N classes

hist(students$weight_kg) 

hist(students$weight_kg, nclass = 20)

Another way to show a vector

boxplot(students$weight_kg)

This is called Boxplot

It is a graphical version of summary().

  • The center is the median
  • The box is between the first and third quartile (50% of cases)
  • The whiskers extend a prediction of 95% of cases
  • Points are outliers

Summary

  • plot() shows a graphic of a vector
  • Use lines only for time-dependent data
  • barplots() work well for small vectors
    • Use barplot(table()) for factors
    • Or just plot() the vector
  • histogram() gives a better view of large numeric vectors
  • boxplot() is another way to see large vectors
    • Do not confuse barplots() and boxplot()

Sooner or later you will forget to use $

and you will plot all data frame instead of a vector

Exploring all data: plot data frame

plot(students)

Scatter plots

Comparing two numeric vectors

If we ask to plot two numeric vectors, we get the first in the horizontal axis and the second in the vertical axis

plot(students$height_cm, students$weight_kg)

Horizontal factor, vertical numeric

plot(factor(students$sex), students$weight_kg)

Formulas in R

Formulas are summaries of a relationship

Instead of

plot(students$height_cm, students$weight_kg)

we can write

plot(students$weight_kg ~ students$height_cm)

or even

plot(weight_kg ~ height_cm, data = students)

Plotting a formula

plot(weight_kg ~ height_cm, data = students)

Summary

  • plot(), one vector, two vectors, or a formula
  • plot(y ~ x) looks like plot(x, y)
  • Formulas are nice:
    • plot(y~x, data=dframe)
    • plot(dframe$x, dframe$y)
  • In general the defaults are good
    • axis labels are the plotted variable’s names
    • ranges are automatic
  • You can choose the horizontal and vertical ranges