people <- list( c(60,72,57,90,95, 72), c(1.75,1.80,1.65,1.90,1.74, 1.91), c("Peter", "John", "Frank", "Huey", "Dewey", "Louie"), TRUE, factor(rep("M",6), levels=c("M","F")))
people
[[1]] [1] 60 72 57 90 95 72 [[2]] [1] 1.75 1.80 1.65 1.90 1.74 1.91 [[3]] [1] "Peter" "John" "Frank" "Huey" "Dewey" "Louie" [[4]] [1] TRUE [[5]] [1] M M M M M M Levels: M F
people[1:2]
[[1]] [1] 60 72 57 90 95 72 [[2]] [1] 1.75 1.80 1.65 1.90 1.74 1.91
people[1]
[[1]] [1] 60 72 57 90 95 72
people[[1]]
[1] 60 72 57 90 95 72
people <- list( weight=c(60,72,57,90,95, 72), height=c(1.75,1.80,1.65,1.90,1.74, 1.91), names=c("Peter", "John", "Frank", "Huey", "Dewey", "Louie"), valid=TRUE, gender=factor(rep("M",6), levels=c("M","F")))
(How else can we assign names?)
people
$weight [1] 60 72 57 90 95 72 $height [1] 1.75 1.80 1.65 1.90 1.74 1.91 $names [1] "Peter" "John" "Frank" "Huey" "Dewey" "Louie" $valid [1] TRUE $gender [1] M M M M M M Levels: M F
people[1:2]
$weight [1] 60 72 57 90 95 72 $height [1] 1.75 1.80 1.65 1.90 1.74 1.91
This is a sublist:
people[1]
$weight [1] 60 72 57 90 95 72
This is a singe element:
people[[1]]
[1] 60 72 57 90 95 72
people[["weight"]]
people$weight
Try these
people[[2]] people[2] people[[2]][3] people[2][3] people[[1:3]] people[1:3] people[["weight"]] people$weight people["weight"]
people[[2]]
[1] 1.75 1.80 1.65 1.90 1.74 1.91
people[2]
$height [1] 1.75 1.80 1.65 1.90 1.74 1.91
people[[2]][3]
[1] 1.65
people[2][3]
$<NA> NULL
people[[1:3]]
Error in people[[1:3]]: recursive indexing failed at level 2
people[1:3]
$weight [1] 60 72 57 90 95 72 $height [1] 1.75 1.80 1.65 1.90 1.74 1.91 $names [1] "Peter" "John" "Frank" "Huey" "Dewey" "Louie"
people[["weight"]]
[1] 60 72 57 90 95 72
people$weight
[1] 60 72 57 90 95 72
people["weight"]
$weight [1] 60 72 57 90 95 72
If key <- "names"
,
What is the diference between the following?
people[[key]]
people[[names]]
people$key
people$names
Explain
Indices can also be used to change specifc parts of a list.
Try each of the following and explain the result:
people$names <- toupper(people$names) people$BMI <- people$weight/people$height^2 people$valid <- NULL
ppl <- data.frame( weight=c(60, 72, 57, 90, 95, 72), height=c(1.75, 1.80, 1.65, 1.90, 1.74, 1.91), names=c("Peter", "John", "Frank", "Huey", "Dewey", "Louie"), gender=factor(rep("M",6), levels=c("F","M")))
ppl
weight height names gender 1 60 1.75 Peter M 2 72 1.80 John M 3 57 1.65 Frank M 4 90 1.90 Huey M 5 95 1.74 Dewey M 6 72 1.91 Louie M
Look for the documentation of read.table()
Read the file birth.txt
into the data.frame birth
Do summary(birth)
What is this?
We have data, we want to tell something about them
How can we summarize all the values in a few numbers?
Let’s use the vector birth$head
.
To make it easier let’s rename it to v
v <- birth$head
length(v)
nrows(birth)
dim(birth)
table(birth$sex)
If you have to describe the vector v
with a single number X, which would it be?
If we have to replace each one of v[i]
for a single number, which number is “the best”?
Better choose one that is the “less wrong”
How can X be wrong?
Many alternatives
x!=v[i]
sum(abs(v-x))
sum((v-x)^2)
sum(abs(v-x))
Which x
minimizes absolute error?
If \(x\) is the median of v
, then
v
are smaller than x
v
are bigger than x
sum((v-x)^2)
Which x
minimizes squared error?
The mean value of v
is \[\mathrm{mean}(v) = \frac{1}{n}\sum_{i=1}^n v_i\] where \(n\) is the length of v
.
Sometimes it is written as \(\bar{v}\)
This value is usually called average
median(v)
mean(v)
If we approach v
by X, how good is this approximation?
Mean of absolute error \[\mathrm{abs.err}(v,x) = \frac{1}{n}\sum_{i=1}^n \vert v_i-x\vert\] In R code we write
sum(abs(v-x))/lenght(v)
It is minimized when x==median(v)
.
if \(n\) is the length of \(v\), then \[\mathrm{quad.err}(v,x) = \frac{1}{n}\sum_{i=1}^n (v_i-x)^2\] In R code we write
sum((v-x)^2)/lenght(v)
It is minimized when x==mean(v)
In that case this number is called variance of the sample.
The variance of the sample v
is
var(v) = sum((v-mean(v))^2)/lenght(v)
which is a number in squared units, so it is hard to compare with the mean value
The standar deviation of the sample is the square root of it
sd(v) = sqrt(sum((v-mean(v))^2)/lenght(v))
\[\mathrm{sd}(v) = \sqrt{\frac{1}{n}\sum_{i=1}^n (v_i-\bar{x})^2}\]
In many cases, including in R, people uses a slightly different formula \[\mathrm{sd}(v) = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (v_i-\bar{x})^2}\] Explaining the reason is for a next course.
(It is because of the bias of the expected value of the expected value)
This value is called standard deviation of the population
The difference is small, especially when \(n\) is big
Quart means one fourth.
If we split v
in four sets of the same size
Which are the limits of these sets?
\[ Q_0, Q_1, Q_2, Q_3, Q_4\]
It is easy to know \(Q_0, Q_2\) and \(Q_4\)