December 3, 2019
id rep N x1 x2 y1 y2 1 a 1 40 125 143 225 2 a 1 50 155 175 280 3 a 1 40 195 215 375 4 a 1 10 192 212 400 5 a 1 110 258 278 435 1 b 1 40 125 143 225 2 b 1 50 156 174 280 3 b 1 40 156 175 375 4 b 1 10 238 258 400 5 b 1 110 261 281 435 1 c 1 40 125 143 225 2 c 1 50 155 174 280 3 c 1 40 196 214 375 4 c 1 10 193 212 400 5 c 1 110 260 279 435
id rep N x1 x2 y1 y2 1 a 1 171 180 182 188 2 a 1 184 176 181 208 3 a 1 188 195 191 212 5 a 1 179 180 177 205 1 b 2 191 195 202 199 2 b 2 190 192 189 205 3 b 2 191 194 207 214 4 b 2 210 208 210 222 5 b 2 220 217 205 206 1 c 2 191 195 202 211 2 c 2 190 192 189 202 3 c 2 191 194 207 205 4 c 2 210 208 210 212 5 c 2 220 217 205 207
id rep N x1 x2 x3 1 a 2 40 15.7 13.0 11.3 2 a 2 50 21.3 15.0 13.7 3 a 2 60 25.0 18.8 16.2 4 a 2 70 28.8 21.8 19.4 5 a 2 80 32.2 25.2 22.8 1 b 2 40 15.7 12.9 11.4 2 b 2 50 20.7 15.7 13.6 3 b 2 60 24.8 18.5 16.6 4 b 2 70 28.9 21.7 19.2 5 b 2 80 32.1 25.3 22.6 1 c 2 40 15.8 12.9 11.3 2 c 2 50 20.5 15.7 13.8 3 c 2 60 24.3 19.3 16.4 4 c 2 70 28.7 21.9 19.4 5 c 2 80 32.3 25.3 22.4
id rep x1 x2 y1 y2 1 a 1 150 255 275 370 2 a 1 100 200 220 300 3 a 1 200 290 310 396 4 a 1 210 305 325 410 5 a 1 150 274 293 400 1 b 1 150 256 274 370 2 b 1 100 192 210 300 3 b 1 200 290 309 396 4 b 1 210 303 321 410 5 b 1 150 275 293 400 1 c 1 150 256 274 370 2 c 1 100 192 210 300 3 c 1 200 290 309 396 4 c 1 210 303 321 410 5 c 1 150 275 293 400
id rep N x1 x2 x2 y1 y2 1 a 1 250 346 370 475 2 a 1 250 365 385 495 3 a 1 250 378 398 515 4 a 1 250 365 384 495 5 a 1 250 359 378 475 1 b 1 250 349 369 475 2 b 1 250 350 370 495 3 b 1 250 365 385 515 4 b 1 250 357 397 495 5 b 1 250 349 369 475 1 c 1 250 348 368 475 2 c 1 250 356 376 495 3 c 1 250 364 384 515 4 c 1 250 356 376 495 7 c 1 250 350 370 475
id rep x1 x2 y1 y2 n 1 a 1 355 356 372 416 2 a 1 384 450 380 382 3 a 1 420 446 740 775 4 a 1 434 442 775 670 5 a 1 425 460 755 759 1 b 1 256 290 632 705 2 b 1 295 306 630 650 3 b 1 285 296 630 660 4 b 1 260 280 650 680 5 b 1 280 350 700 720 1 c 1 300 290 725 733 2 c 3 c 4 c 5 c
One of the important cleaning processes in the practice of data science
Tidy data sets have structure and working with them is easy
They are easy to manipulate, model and visualize
Tidy data sets are arranged such that each variable is a column and each observation (or case) is a row
“Tidy datasets are all alike but every messy dataset is messy in its own way.”
– Hadley Wickham
Data that is easy to model, visualize and aggregate
Only one kind of object in a data frame (e.g. experiment)
Variables in columns, observations in rows
Only one measuring unit on each column
Do not mix numbers and text
Units can be in the column name
We draw the figure using a formula
plot(weight_kg ~ height_cm, data=survey)
Using the same formula we can get a linear model
lm(weight_kg ~ height_cm, data=survey)
Call: lm(formula = weight_kg ~ height_cm, data = survey) Coefficients: (Intercept) height_cm -77.1697 0.8382
In science we work by creating models of how nature works
There are several kinds of models
One of the easiest and more commonly used are the linear models
We approximate all our data by a straight line that shows the relationship between some variables, with a formula like \[y=a + b\cdot x\]
We draw the tendency using abline()
with the coefficients of lm()
plot(weight_kg ~ height_cm, data=survey) abline(a=-77.169742, b=0.838182)
model <- lm(weight_kg ~ height_cm, data=survey) model
Call: lm(formula = weight_kg ~ height_cm, data = survey) Coefficients: (Intercept) height_cm -77.1697 0.8382
coef(model)
(Intercept) height_cm -77.169742 0.838182
coef(model)[1]
coef(model)[2]
plot(weight_kg ~ height_cm, data=survey) model <- lm(weight_kg ~ height_cm, data=survey) abline(a=coef(model)[1], b=coef(model)[2])
plot(weight_kg ~ height_cm, data=survey) model <- lm(weight_kg ~ height_cm, data=survey) abline(model)
Beyond giving a description of the data, models are often used to get a prediction of what would be the output of the system when we have new data
In this case we need to provide a data.frame
with at least one column. The column name must be the same as the one used to create the model. For example
data.frame(height_cm=155:205)
plot(weight_kg ~ height_cm, data=survey) model <- lm(weight_kg ~ height_cm, data=survey) guess <- predict(model, newdata=data.frame(height_cm=155:205)) points(155:205, guess, col="red", pch=19)
dx=x2-x1
and dy=y2-y1
d2 ~ d1
when N==1
d2
when d1
is 100, 200, and 500plot(survey$height_cm, survey$weight_kg)
plot(survey$weight_kg ~ survey$height_cm)
data=
option gives the context for the formulaplot(survey$weight_kg ~ survey$height_cm)
plot(weight_kg ~ height_cm, data = survey)
You can do like this
plot(survey$height_cm[ survey$Gender=="Female"], survey$weight_kg[ survey$Gender=="Female"])
or like this
plot(survey[survey$Gender=="Female", "height_cm"], survey[survey$Gender=="Female", "weight_kg"])
Instead of this
plot(survey[survey$Gender=="Female", "height_cm"], survey[survey$Gender=="Female", "weight_kg"])
We can do this
grl <- survey[ survey$Gender=="Female", ] plot(grl$height_cm, grl$weight_kg)
We can select using indices …
grl <- survey[ survey$Gender=="Female", ] plot(grl$height_cm, grl$weight_kg)
… or with subset()
grl <- subset(survey, Gender=="Female") plot(grl$height_cm, grl$weight_kg)
subset()
with formulaYou don’t need to use $
girls <- subset(survey, Gender=="Female") plot(girls$weight_kg ~ girls$height_cm)
data=
is the formula’s context
girls <- subset(survey, Gender=="Female") plot(weight_kg ~ height_cm, data = girls)
subset=
optionInstead of using subset()
…
plot(weight_kg ~ height_cm, data = subset(survey, Gender=="Female"))
you can use subset=
option
plot(weight_kg ~ height_cm, data = survey, subset = Gender=="Female")
par(mfrow=c(1,2), mar=c(5,4,2,0)) plot(weight_kg ~ height_cm, survey, subset=Gender=="Female", main="Girls") plot(weight_kg ~ height_cm, survey, subset=Gender=="Male", main="Boys")
There are many graphical parameters that can be changed with the function par()
It is a good idea to read the manual page help(par)
Here we use the parameter mfrow
: a vector c(num_rows, num_colums)
After doing par(mfrow=c(num_rows, num_colums))
all figures will be drawn in an num_rows
-by-num_colums
shape