October 15, 2018

Basic objects in R

  • There are several data types:
    • numeric, character, logic, factor
  • They are stored in one of many data structures
    • vectors
    • lists
    • matrices
    • data frames
  • Each element can be accessed using indices
    • numeric vectors (positive or negative)
    • logical vectors
    • character vector

Reading text files

The function used to read text files is

read.table(file, header = FALSE, sep = "", quote = "\"'",
           row.names, col.names, na.strings = "NA",
           stringsAsFactors = TRUE,
           dec = ".", comment.char = "#", ...)

Please take a look at the help page of read.table().

help(read.table)

Reading text files

The output of this function is a data.frame. The only mandatory argument is:

file
the name of the file to read. It can also be an URL

Other useful options

header
if TRUE then the first line has the names of the columns
sep
Which character is used to separate columns. Use "\t" for Tab

Other useful options

stringsAsFactors
Logic option. If it is TRUE (by default), then text are taken as factors

Set it to FALSE to read text as character

dec
the character used in the file for decimal points
use dec="," for numbers in Turkish (European) format

Example data

Example data

We read data with

survey <- read.table("survey1-tidy.txt")

What can we say about this data?

Selecting columns

Data frames always have column names

Each column can be accessed by its name

colnames(survey)
[1] "Gender"       "birth_day"    "birth_month" 
[4] "birth_year"   "height_cm"    "weight_kg"   
[7] "handness"     "hand_span_cm"

Selecting columns

Each column is a vector

survey$handness
 [1] Right Right Left  Right Right Right Right Left 
 [9] Right Right Right Right Right Right Right Right
[17] Right Right Right Right Right Right Right Right
[25] Right Right Right Right Left  Right Right Right
[33] Right Right Right Right Right Right Right Right
[41] Right Right Right Right Right Right Right Right
[49] Left  Right Right
Levels: Left Right

Choosing some rows

We can always compare a vector to a constant

survey$handness=="Left"
 [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
 [9] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[17] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
[33] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[41] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49]  TRUE FALSE FALSE

(notice that we use == for comparisons)

Selecting rows

We can use the result of this comparison as a row index

survey[survey$handness=="Left", ]
     Gender birth_day birth_month birth_year height_cm
st3  Female        28           1       1995       170
st8  Female        14           1       1997       162
st29   Male        28           7       1998       185
st49 Female         2           5       1999       165
     weight_kg handness hand_span_cm
st3         56     Left           18
st8         75     Left           18
st29        65     Left           22
st49        63     Left           17

What is the difference?

survey[survey$handness=="Left", "Gender"]
[1] Female Female Male   Female
Levels: Female Male
survey$Gender[survey$handness=="Left"]
[1] Female Female Male   Female
Levels: Female Male

Same result, different ways

Summary statistics

We recommend that every time you use read.table, immediately you verify it

summary(survey)
    Gender     birth_day      birth_month    
 Female:30   Min.   : 1.00   Min.   : 1.000  
 Male  :21   1st Qu.: 5.00   1st Qu.: 3.500  
             Median :13.00   Median : 6.000  
             Mean   :13.59   Mean   : 6.353  
             3rd Qu.:20.00   3rd Qu.: 9.000  
             Max.   :31.00   Max.   :12.000  
   birth_year     height_cm       weight_kg     
 Min.   :1991   Min.   :155.0   Min.   : 42.50  
 1st Qu.:1997   1st Qu.:163.0   1st Qu.: 55.00  
 Median :1997   Median :171.0   Median : 64.00  
 Mean   :1998   Mean   :170.7   Mean   : 65.56  
 3rd Qu.:1998   3rd Qu.:176.5   3rd Qu.: 74.50  
 Max.   :2018   Max.   :195.0   Max.   :106.00  
  handness   hand_span_cm  
 Left : 4   Min.   : 8.00  
 Right:47   1st Qu.:16.00  
            Median :19.00  
            Mean   :18.98  
            3rd Qu.:21.00  
            Max.   :30.00  

Meaning of summary()

The result depends on the type of column

For a factor we get

summary(survey$handness)
 Left Right 
    4    47 

Other ways of counting

nrow(survey)
[1] 51
dim(survey)
[1] 51  8
table(survey$handness)
 Left Right 
    4    47 

Meaning of summary()

The result depends on the type of column

For a numeric column we get

summary(survey$hand_span_cm)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   8.00   16.00   19.00   18.98   21.00   30.00