Today we will use the clean version of our data
[1] "answer_date" "id" "english_level" "sex"
[5] "birthdate" "birthplace" "height_cm" "weight_kg"
[9] "handedness" "hand_span"
There are 10 columns
birthplace
was a problemWe got different names for the same city
It took a lot of work to correct it
It would be better to enforce using only one standard name for each city
That is, to use a controlled vocabulary
birthplace
[1] "-/Turkey" "Kahramanmaraş/Turkey" "Batman/Turkey"
[4] "Antalya/Turkey" "Izmir/Turkey" "Yalova/Turkey"
[7] "Adıyaman/Turkey" "Bursa/Turkey" "Istanbul/Turkey"
[10] "Istanbul/Turkey" "Van/Turkey" NA
[13] NA "Istanbul/Turkey" "Istanbul/Turkey"
[16] "Samsun/Turkey" "Mardin/Turkey" "Gaziantep/Turkey"
[19] "Istanbul/Turkey" "Bursa/Turkey" "Istanbul/Turkey"
[22] "Bursa/Turkey" "Yalova/Turkey" "Ordu/Turkey"
[25] "Istanbul/Turkey" "Istanbul/Turkey" "Edirne/Turkey"
[28] "Malatya/Turkey" NA "Hatay/Turkey"
(For this class we show only the first 30 values)
There is another data type in R, called factors
They are also known as categorical variables
They are used for discrete values, for example when there is no natural order
These are variables that you would never average
To make a vector of factors we start with a character vector
[1] -/Turkey Kahramanmaraş/Turkey Batman/Turkey
[4] Antalya/Turkey Izmir/Turkey Yalova/Turkey
[7] Adıyaman/Turkey Bursa/Turkey Istanbul/Turkey
[10] Istanbul/Turkey Van/Turkey <NA>
[13] <NA> Istanbul/Turkey Istanbul/Turkey
[16] Samsun/Turkey Mardin/Turkey Gaziantep/Turkey
[19] Istanbul/Turkey Bursa/Turkey Istanbul/Turkey
[22] Bursa/Turkey Yalova/Turkey Ordu/Turkey
[25] Istanbul/Turkey Istanbul/Turkey Edirne/Turkey
[28] Malatya/Turkey <NA> Hatay/Turkey
39 Levels: -/Azerbaijan -/Syria -/Turkey -/Turkmenistan ... Yalova/Turkey
Notice that there are no "
marks,
and there is a line describing the levels
To see the difference between text and factor, we will add a new column called place_factor
[1] "answer_date" "id" "english_level" "sex"
[5] "birthdate" "birthplace" "height_cm" "weight_kg"
[9] "handedness" "hand_span" "place_factor"
summary()
Let’s compare text and factor vectors with the same data
birthplace place_factor
Length:117 Istanbul/Turkey:35
Class :character Bursa/Turkey : 7
Mode :character Tekirdağ/Turkey: 4
Yalova/Turkey : 4
-/Turkey : 3
(Other) :57
NA's : 7
In this case factors are more useful
Factor is a latin word, form facere (doing)
A factor is someone doing an action
More general, a factor is something that has an effect on another thing
The name “factor” was used first by plant researchers to describe the things that affect the growth of plants
Let’s look again the last line of place_factor
[1] -/Turkey Kahramanmaraş/Turkey Batman/Turkey
[4] Antalya/Turkey Izmir/Turkey Yalova/Turkey
[7] Adıyaman/Turkey Bursa/Turkey Istanbul/Turkey
[10] Istanbul/Turkey Van/Turkey <NA>
[13] <NA> Istanbul/Turkey Istanbul/Turkey
[16] Samsun/Turkey Mardin/Turkey Gaziantep/Turkey
[19] Istanbul/Turkey Bursa/Turkey Istanbul/Turkey
39 Levels: -/Azerbaijan -/Syria -/Turkey -/Turkmenistan ... Yalova/Turkey
Printing a factor will show what are the valid values
The valid values are called levels
We can ask what are the levels of a factor
[1] "-/Azerbaijan" "-/Syria" "-/Turkey"
[4] "-/Turkmenistan" "Adana/Turkey" "Adıyaman/Turkey"
[7] "Afyonkarahisar/Turkey" "Aleppo/Syria" "Almaty/Kazakhstan"
[10] "Ankara/Turkey" "Antalya/Turkey" "Aydın/Turkey"
[13] "Balıkesir/Turkey" "Batman/Turkey" "Bursa/Turkey"
[16] "Çorum/Turkey" "Edirne/Turkey" "Gaziantep/Turkey"
[19] "Hannover/Germany" "Hatay/Turkey" "Istanbul/Turkey"
[22] "Izmir/Turkey" "Kahramanmaraş/Turkey" "Karabük/Turkey"
[25] "Kırklareli/Turkey" "Konya/Turkey" "Malatya/Turkey"
[28] "Manisa/Turkey" "Mardin/Turkey" "Mersin/Turkey"
[31] "Muğla/Turkey" "Nakhchivan/Azerbaijan" "Ordu/Turkey"
[34] "Samsun/Turkey" "Sivas/Turkey" "Tekirdağ/Turkey"
[37] "Tunceli/Turkey" "Van/Turkey" "Yalova/Turkey"
This is a character vector
We can decide the levels when we create the factor
[1] black black black
Levels: black blue white
In this case we know all possible levels, even if not all are present in the character vector
We can give new names to the levels
[1] siyah siyah siyah
Levels: siyah mavi beyaz
The factor is the same, we only change the levels’ names
English is my native language
4
I can read and understand technical papers
56
I can speak fluently
18
I can understand movies without subtitles
26
I can write poetry better than Shakespeare
1
İngilizce bilmiyorum
12
The result is not ordered by “level of knowledge”.
We do not want alphabetic order in this case.
We can re-code the factor levels in the order we want
students$english_factor <- factor(students$english_factor,
levels=c("İngilizce bilmiyorum",
"I can read and understand technical papers",
"I can understand movies without subtitles",
"I can speak fluently",
"English is my native language",
"I can write poetry better than Shakespeare"))
summary(students$english_factor)
İngilizce bilmiyorum
12
I can read and understand technical papers
56
I can understand movies without subtitles
26
I can speak fluently
18
English is my native language
4
I can write poetry better than Shakespeare
1
Text is not very efficient
"123"
takes 3 times more memory than 123
We use text when there is no better option Used for that that does not repeat a lot
Inside the computer, factors are encoded as numbers
[1] -/Turkey Kahramanmaraş/Turkey Batman/Turkey
[4] Antalya/Turkey Izmir/Turkey Yalova/Turkey
[7] Adıyaman/Turkey Bursa/Turkey Istanbul/Turkey
[10] Istanbul/Turkey Van/Turkey <NA>
[13] <NA> Istanbul/Turkey Istanbul/Turkey
[16] Samsun/Turkey Mardin/Turkey Gaziantep/Turkey
[19] Istanbul/Turkey Bursa/Turkey Istanbul/Turkey
39 Levels: -/Azerbaijan -/Syria -/Turkey -/Turkmenistan ... Yalova/Turkey
[1] 3 23 14 11 22 39 6 15 21 21 38 NA NA 21 21 34 29 18 21 15 21
Factors are so useful that classic R functions like read.table()
produces factors instead of text
In the tidyverse we can choose which columns are text and which ones are factors
"
when printed