November 22, 2018
One of the important cleaning processes in the practice of data science
Tidy data sets have structure and working with them is easy
They are easy to manipulate, model and visualize
Tidy data sets are arranged such that each variable is a column and each observation (or case) is a row
“Tidy datasets are all alike but every messy dataset is messy in its own way.”
– Hadley Wickham
n_marbles length repetition 0 10 1 1 10.8 1 2 12.2 1 3 14 1 4 15.7 1 5 18.1 1 6 20.5 1 0 10 2 1 10.8 2 2 12.2 2 3 14 2 4 15.7 2 5 18.1 2 6 20.5 2
n_marbels length repetition 0 8.0 1 0 8.0 2 0 8.0 3 1 8.5 1 1 8.8 2 1 8.8 3 2 8.9 1 2 8.9 2 2 9.0 3 3 9.4 1 3 9.2 2 3 9.1 3
n_coins length repetition 0 7.5cm 1 5 8 cm 1 10 9 cm 1 15 11 cm 1 0 8 cm 2 5 9 cm 2 10 10 cm 2 15 11 cm 2 0 8 cm 3 5 9 cm 3 10 10 cm 3 15 11 cm 3
n_marble length repetition 0 5cm 1 1 5.5cm 1 2 6cm 1 3 6.5cm 1 4 7cm 1 5 7.5cm 1 6 8.1cm 1 7 8.6cm 1 0 5cm 1 1 5.4cm 1 2 6cm 1 3 6.4cm 1 4 6.9cm 1 5 7.5cm 1 6 8.2cm 1 7 8.7cm 1
n_coins length_cm repetition 0 7,5 1 5 8 1 10 9 1 15 11 1 0 8 2 5 9 2 10 10 2 15 11 2 0 8 3 5 9 3 10 10 3 15 11 3
n_marbles length repetition 0 10 2 1 10.8 2 2 12.2 2 3 14 2 4 15.7 2 5 18.1 2 6 20.5 2
Empty 1_Marble 2_Marbles 3_Marbles Exp1 50.5 65.5 81.5 96.0 Exp2 50.5 67.0 82.5 98.0 Exp3 51.5 67.5 83.0 98.0
What are the units here?
0 marble 1 marbles 2 marbles 3 marbles Repetition1 8.4 9.5 10 10.8 Repetition2 8.3 9 10.1 10.8
Data that is easy to model, visualize and aggregate
Only one kind of object in a data frame (e.g. experiment)
Variables in columns, observations in rows
Only one measuring unit on each column
Do not mix numbers and text
Units can be in the column name
We draw the figure using a formula
plot(weight_kg ~ height_cm, data=survey)
Using the same formula we can get a linear model
lm(weight_kg ~ height_cm, data=survey)
Call: lm(formula = weight_kg ~ height_cm, data = survey) Coefficients: (Intercept) height_cm -81.5045 0.8616
In science we work by creating models of how nature works
There are several kinds of models
One of the easiest and more commonly used are the linear models
We approximate all our data by a straight line that shows the relationship between some variables, with a formula like \[y=a + b\cdot x\]
We draw the tendency using abline()
with the coefficients of lm()
plot(weight_kg ~ height_cm, data=survey) abline(a=-81.5045, b=0.8616)
model <- lm(weight_kg ~ height_cm, data=survey) model
Call: lm(formula = weight_kg ~ height_cm, data = survey) Coefficients: (Intercept) height_cm -81.5045 0.8616
coef(model)
(Intercept) height_cm -81.5044904 0.8616001
coef(model)[1]
coef(model)[2]
plot(weight_kg ~ height_cm, data=survey) model <- lm(weight_kg ~ height_cm, data=survey) abline(a=coef(model)[1], b=coef(model)[2])
plot(weight_kg ~ height_cm, data=survey) model <- lm(weight_kg ~ height_cm, data=survey) abline(model)
Beyond giving a description of the data, models are often used to get a prediction of what would be the output of the system when we have new data
In this case we need to provide a data.frame
with at least one column. The column name must be the same as the one used to create the model. For example
data.frame(height_cm=155:205)
plot(weight_kg ~ height_cm, data=survey) model <- lm(weight_kg ~ height_cm, data=survey) guess <- predict(model, newdata = data.frame(height_cm=155:205)) points(155:205, guess, col="red", pch=19)
n_marbles
is 0, 5, 10, 20, and 50plot(survey$height_cm, survey$weight_kg)
plot(survey$weight_kg ~ survey$height_cm)
data=
option gives the context for the formulaplot(survey$weight_kg ~ survey$height_cm)
plot(weight_kg ~ height_cm, data = survey)
You can do like this
plot(survey$height_cm[ survey$Gender=="Female"], survey$weight_kg[ survey$Gender=="Female"])
or like this
plot(survey[survey$Gender=="Female", "height_cm"], survey[survey$Gender=="Female", "weight_kg"])
Instead of this
plot(survey[survey$Gender=="Female", "height_cm"], survey[survey$Gender=="Female", "weight_kg"])
We can do this
grl <- survey[ survey$Gender=="Female", ] plot(grl$height_cm, grl$weight_kg)
We can select using indices …
grl <- survey[ survey$Gender=="Female", ] plot(grl$height_cm, grl$weight_kg)
… or with subset()
grl <- subset(survey, Gender=="Female") plot(grl$height_cm, grl$weight_kg)
subset()
with formulaYou don’t need to use $
girls <- subset(survey, Gender=="Female") plot(girls$weight_kg ~ girls$height_cm)
data=
is the formula’s context
girls <- subset(survey, Gender=="Female") plot(weight_kg ~ height_cm, data = girls)
subset=
optionInstead of using subset()
…
plot(weight_kg ~ height_cm, data = subset(survey, Gender=="Female"))
you can use subset=
option
plot(weight_kg ~ height_cm, data = survey, subset = Gender=="Female")
par(mfrow=c(1,2)) plot(weight_kg ~ height_cm, data=survey, subset=Gender=="Female", main="Girls") plot(weight_kg ~ height_cm, data=survey, subset=Gender=="Male", main="Boys")
There are many graphical parameters that can be changed with the function par()
It is a good idea to read the manual page help(par)
Here we use the parameter mfrow
: a vector c(num_rows, num_colums)
After doing par(mfrow=c(num_rows, num_colums))
all figures will be drawn in an num_rows
-by-num_colums
shape