- There are several data types:
- numeric, character, logic, factor
- They are stored in one of many data structures
- vectors
- lists
- matrices
- data frames
October 22, 2019
The function used to read text files is
read.table(file, header = FALSE, sep = "", quote = "\"'", row.names, col.names, na.strings = "NA", stringsAsFactors = TRUE, dec = ".", comment.char = "#", ...)
Please take a look at the help page of read.table()
.
help(read.table)
The output of this function is a data.frame
. The only mandatory argument is:
"\t"
for TabTRUE
(by default), then text are taken as factorsSet it to FALSE
to read text as character
dec=","
for numbers in Turkish (European) formatToday we will use data from
https://anaraven.bitbucket.io/static/2018/cmb1/survey1-tidy.txt
Please download it to your computer and save it in a good place
We read data with
survey <- read.table("survey1-tidy.txt")
What can we say about this data?
Data frames always have column names
Each column can be accessed by its name
colnames(survey)
[1] "Gender" "birth_day" "birth_month" [4] "birth_year" "height_cm" "weight_kg" [7] "handness" "hand_span_cm"
Each column is a vector
survey$handness
[1] Right Right Left Right Right Right Right Left [9] Right Right Right Right Right Right Right Right [17] Right Right Right Right Right Right Right Right [25] Right Right Right Right Left Right Right Right [33] Right Right Right Right Right Right Right Right [41] Right Right Right Right Right Right Right Right [49] Left Right Right Levels: Left Right
We can always compare a vector to a constant
survey$handness=="Left"
[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE [9] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [17] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [25] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE [33] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [41] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [49] TRUE FALSE FALSE
(notice that we use ==
for comparisons)
We can use the result of this comparison as a row index
survey[survey$handness=="Left", ]
Gender birth_day birth_month birth_year height_cm st3 Female 28 1 1995 170 st8 Female 14 1 1997 162 st29 Male 28 7 1998 185 st49 Female 2 5 1999 165 weight_kg handness hand_span_cm st3 56 Left 18 st8 75 Left 18 st29 65 Left 22 st49 63 Left 17
survey[survey$handness=="Left", "Gender"]
[1] Female Female Male Female Levels: Female Male
survey$Gender[survey$handness=="Left"]
[1] Female Female Male Female Levels: Female Male
Same result, different ways
We recommend that every time you use read.table
, immediately you verify it
summary(survey)
Gender birth_day birth_month Female:30 Min. : 1.00 Min. : 1.000 Male :21 1st Qu.: 5.00 1st Qu.: 3.500 Median :13.00 Median : 6.000 Mean :13.59 Mean : 6.353 3rd Qu.:20.00 3rd Qu.: 9.000 Max. :31.00 Max. :12.000 birth_year height_cm weight_kg Min. :1991 Min. :155.0 Min. : 42.50 1st Qu.:1997 1st Qu.:163.0 1st Qu.: 55.00 Median :1997 Median :171.0 Median : 64.00 Mean :1998 Mean :170.7 Mean : 65.56 3rd Qu.:1998 3rd Qu.:176.5 3rd Qu.: 74.50 Max. :2018 Max. :195.0 Max. :106.00 handness hand_span_cm Left : 4 Min. : 8.00 Right:47 1st Qu.:16.00 Median :19.00 Mean :18.98 3rd Qu.:21.00 Max. :30.00
summary()
The result depends on the type of column
For a factor we get
summary(survey$handness)
Left Right 4 47
Number of rows
nrow(survey)
[1] 51
Number of rows and columns (dimensions)
dim(survey)
[1] 51 8
This command counts how many of each value
table(survey$handness)
Left Right 4 47
summary()
The result depends on the type of column
For a numeric column we get
summary(survey$hand_span_cm)
Min. 1st Qu. Median Mean 3rd Qu. Max. 8.00 16.00 19.00 18.98 21.00 30.00