To improve our understanding of t-test and ANOVA in linear models, we can use simulation.
First, we need a function to create a data frame of random values. We can call it
create_random_data(n)
. The inputn
indicates the number of rows. It must return a data frame with 3 columns calledx1
,x2
, andy
. The values should be chosen randomly following a Normal distribution with mean zero and variance 1.Then we need a function that takes a data frame as input, and returns a vector with values taken from
summary(lm(y ~ x1 + x2, data))
. In particular we want to get:- the coefficients predicted by the linear model. This is the first
column of the field
coefficients
of the output of summary. They will be(Intercept)
,x1
andx2
. I would like to call them \(β_0,β_1,β_2,\) or at leastB0
,B1
,B2
.’ - the t-values predicted by the linear model. This is the
third column of the field
coefficients
of the output of summary. Let’s call themt0
,t1
,t2
. - The p-values predicted by the linear model. This is the
fourth column of the field
coefficients
of the output of summary. Let’s call themp0
,p1
,p2
. - The F statistic and the degrees of freedom, taken from the field
fstatistic
of the output of summary. We call themf
,df1
,df2
.
- the coefficients predicted by the linear model. This is the first
column of the field
Now we want to make several hundreds of replicas of the full process of generating a random data frame, building a linear model on it, and getting the relevant parameters from the model. We can use
n=3
initially. We collect all results in a data frame, one row for each simulation, one column for each of the 12 parameters.We add an extra column
pval
, with the p-value for the F statistics. We need this calculation becausesummary()
does not provide it for us.Finally we plot
B1
versusB2
using color depending on the significance ofp1
,p2
, orpval
, respectively. We can draw similar plots usingt1
versust2
for the \(x\) and \(y\) position.
We will discuss the results in classes.