December 17, 2019
plot(sra)
:
plot(day~log(bases), data = sra)
:
plot(log(bases)~log(day), data=sra)
:
plot(bases~day, data=sra, log="y")
plot(log(bases) ~ day, data = sra)
sra <- read.table("sra_bases.txt", header=TRUE) model <- lm(log(bases) ~ day, data=sra) model
Call: lm(formula = log(bases) ~ day, data = sra) Coefficients: (Intercept) day 28.34146 0.00248
In a semi-log model, we have \[\log(\text{bases}) = \underbrace{\log(\text{initial})}_{\texttt{coef(model)[1]}} + \underbrace{\log(\text{rate})}_{\texttt{coef(model)[2]}} \cdot \text{days}\]
Undoing log()
, we have \[\text{bases} = \text{initial}\cdot\text{rate}^\text{days}\]
Thus, \(\text{rate}=\)exp(coef(model)[2])
\(=1.0025\)
If \(\text{rate}=1.0025\) it means that the database grows \(0.25\%\) every day
To make predictions using the model, we usepredict(model, newdata)
We need a data frame with one column named day
, because we used day
in the formula
The last day in sra
is max(sra$day)
Two years after the last will be max(sra$day) + 2 * 365
so newdata=data.frame(day = max(sra$day) + 2 * 365)
height | speed | time |
---|---|---|
2.0000 | -0.20 | 0.0001 |
1.9778 | -0.69 | 0.0496 |
1.9310 | -1.18 | 0.0999 |
1.8598 | -1.67 | 0.1492 |
1.7640 | -2.16 | 0.2004 |
1.6438 | -2.65 | 0.2512 |
1.4990 | -3.14 | 0.3002 |
1.3297 | -3.63 | 0.3500 |
1.1360 | -4.12 | 0.4000 |
0.9177 | -4.61 | 0.4490 |
0.6750 | -5.10 | 0.5008 |
We can use a logarithmic scale to see if we get a straight line
There are many possible functions that connect \(x\) and \(y\)
We will try a polynomial formula
A polynomial is a formula like \[y=A + B\cdot x + C\cdot x^2 +\cdots+ D\cdot x^n\]
The highest exponent is the degree of the polynomial
We will try a 2-degree formula \[\text{height}=A+B\cdot\text{time}+C\cdot\text{time}^2\]
We need to add an extra column, with the square of the time
ball$time_sq <- ball$time^2 head(ball)
height speed time time_sq 1 2.000 -0.20 0.00010 1.000e-08 2 1.978 -0.69 0.04961 2.461e-03 3 1.931 -1.18 0.09986 9.972e-03 4 1.860 -1.67 0.14919 2.226e-02 5 1.764 -2.16 0.20037 4.015e-02 6 1.644 -2.65 0.25118 6.309e-02
You may be tempted to use
model <- lm(height ~ time^2, data=ball)
But that does not work.
Exponents and multiplications in formulas have a different meaning
(that we will see later)
Now we use two independent variables in the model. This is different from the plot()
function
model <- lm(height ~ time + time_sq, data=ball) model
Call: lm(formula = height ~ time + time_sq, data = ball) Coefficients: (Intercept) time time_sq 2.000 -0.195 -4.912
We used the formula \[\text{height}=A+B\cdot\text{time}+C\cdot\text{time}^2\] The linear model gave us \[\begin{aligned} A&=2\\ B&=-0.195\\ C&=-4.9 \end{aligned}\]
plot(height ~ time, data=ball) lines(predict(model) ~ time, data=ball, col="red")
fall <- read.table("free-fall.txt", header=TRUE) fall
msec position replica 1 2355 10 A 2 2457 20 A 3 2533 30 A 4 2558 40 A 5 2639 50 A 6 2690 60 A 7 2742 70 A 8 2742 80 A 9 2797 90 A 10 1837 10 B 11 1970 20 B 12 2022 30 B 13 2104 40 B 14 2129 50 B 15 2210 60 B 16 2237 70 B 17 2262 80 B 18 2313 90 B
plot(position~msec, data=fall)
msec
. It means millisecondsLet’s add a new column, called time
, with the correct value of seconds
fall$time <- fall$msec/1000
Now, for each replica, we change time
to start at 0
old <- fall$time[fall$replica=="A"] fall$time[fall$replica=="A"] <- old - min(old) old <- fall$time[fall$replica=="B"] fall$time[fall$replica=="B"] <- old - min(old)
We should also transform position to meters. The first data should be at 0
Each black/white strip was a little more than 19 centimeters
fall$height <- fall$position/10 * 0.192 old <- fall$height[fall$replica=="A"] fall$height[fall$replica=="A"] <- min(old) - old old <- fall$height[fall$replica=="B"] fall$height[fall$replica=="B"] <- min(old) - old
fall[fall$replica=="A",]
msec position replica time height 1 2355 10 A 0.000 0.000 2 2457 20 A 0.102 -0.192 3 2533 30 A 0.178 -0.384 4 2558 40 A 0.203 -0.576 5 2639 50 A 0.284 -0.768 6 2690 60 A 0.335 -0.960 7 2742 70 A 0.387 -1.152 8 2742 80 A 0.387 -1.344 9 2797 90 A 0.442 -1.536
fall[fall$replica=="B",]
msec position replica time height 10 1837 10 B 0.000 0.000 11 1970 20 B 0.133 -0.192 12 2022 30 B 0.185 -0.384 13 2104 40 B 0.267 -0.576 14 2129 50 B 0.292 -0.768 15 2210 60 B 0.373 -0.960 16 2237 70 B 0.400 -1.152 17 2262 80 B 0.425 -1.344 18 2313 90 B 0.476 -1.536
plot(height~time, data=fall, pch=as.character(replica))
This is a parabola, which can be modeled with a second degree polynomial \[y=A + B\cdot x + C\cdot x^2\] When we use a polynomial model, we need to add new columns to our data frame
fall$time_sq <- fall$time^2
This is used only for the model
model_A <- lm(height ~ time + time_sq, data=fall, subset=replica=="A") model_A
Call: lm(formula = height ~ time + time_sq, data = fall, subset = replica == "A") Coefficients: (Intercept) time time_sq -0.00131 -1.55751 -4.26664
model_B <- lm(height ~ time + time_sq, data=fall, subset=replica=="B") model_B
Call: lm(formula = height ~ time + time_sq, data = fall, subset = replica == "B") Coefficients: (Intercept) time time_sq 0.00377 -1.08373 -4.57079
Since we choose that at time=0
we have height=0
, we know that (Intercept)
is 0
In other words, our model here is \[y= B\cdot x + C\cdot x^2\] The old \(A\) has a fixed value of 0
How do we do that in R?
new_model_A <- lm(height ~ time + time_sq + 0, data=fall, subset=replica=="A") new_model_A
Call: lm(formula = height ~ time + time_sq + 0, data = fall, subset = replica == "A") Coefficients: time time_sq -1.57 -4.25
new_model_B <- lm(height ~ time + time_sq + 0, data=fall, subset=replica=="B") new_model_B
Call: lm(formula = height ~ time + time_sq + 0, data = fall, subset = replica == "B") Coefficients: time time_sq -1.06 -4.61
If you remember your physics lessons, you can recognize that in the formula \[y= B\cdot x + C\cdot x^2\] the value \(B\) is the initial speed, and \(C=-g/2\)
Thus, we can find the gravitational acceleration in the second coefficient
The gravitational acceleration is constant and independent of the replica
But the initial speed depends on the replica
We want to combine both replicas to get
combi <- lm(height ~ time:replica + time_sq + 0, data=fall) combi
Call: lm(formula = height ~ time:replica + time_sq + 0, data = fall) Coefficients: time_sq time:replicaA time:replicaB -4.46 -1.49 -1.12
The symbol :
means interaction
+0
in the formula forces the intercept to 0:
allows us to combine two variables and measure their interactionsurvey
datasurvey <- read.table("survey1-tidy.txt") colnames(survey)
[1] "Gender" "birth_day" [3] "birth_month" "birth_year" [5] "height_cm" "weight_kg" [7] "handness" "hand_span_cm"
plot(weight_kg~Gender, data=survey)
lm(weight_kg~Gender, data=survey)
Call: lm(formula = weight_kg ~ Gender, data = survey) Coefficients: (Intercept) GenderMale 57.8 18.7
model <- lm(weight_kg ~ Gender, data = survey)
Call: lm(formula = weight_kg ~ Gender, data = survey) Coefficients: (Intercept) GenderMale 57.8 18.7
Here (Intercept)
is “normal weight”, and GenderMale
is the extra weight when Gender=='Male'
(on average)
\[\text{weight_kg} = coef(model)[1] + coef(model)[2]\cdot\text{[Gender is 'Male']}\]
lm(weight_kg ~ Gender + 0, data = survey)
Call: lm(formula = weight_kg ~ Gender + 0, data = survey) Coefficients: GenderFemale GenderMale 57.8 76.6
Now the formula includes +0
. This means “do not use intercept”. Now
GenderMale
is the average weight when Gender=='Male'
GenderFemale
is the average weight when Gender=='Female'
Let’s start with the simplest model
plot(weight_kg ~ height_cm, data=survey, col=Gender) model1 <- lm(weight_kg ~ height_cm, data=survey) points(predict(model1) ~ height_cm, data=survey, col="green3", pch=16)
model1
Call: lm(formula = weight_kg ~ height_cm, data = survey) Coefficients: (Intercept) height_cm -81.504 0.862
Obviously the weight also depends on the gender
plot(weight_kg ~ height_cm, data=survey, col=Gender) model2 <- lm(weight_kg ~ height_cm + Gender+0, data = survey) points(predict(model2) ~ height_cm, data=survey, col="green3", pch=16)
model2
Call: lm(formula = weight_kg ~ height_cm + Gender + 0, data = survey) Coefficients: height_cm GenderFemale GenderMale 0.279 11.688 26.896
There are two lines with the same slope: is the coefficient for height_cm
The “intercept” of each line are the coefficients GenderFemale
and GenderMale
+0
in the formula:
to indicate interaction*
in formulas