As usual, margin notes contain code and comments that are not essential to the discussion, but that can still be interesting.
We have seen before that linear models allow us to interpolate and
extrapolate from experimental data, and define confidence intervals for
these predictions. Moreover, we have seen that the coefficients of the
linear model can reveal useful informationRaw data is an ingredient, like salt and oil. Knowledge
is the prepared dish that combines all relevant data into a
delicious/relevant dish.
, with their corresponding confidence intervals.
All these models take one or more independent variables —the
parameters that we can control— and combine them to approximate one
dependent variable —the result of the experiment. In all the
cases we have seen before, the independent variables were numerical
values in a rational scaleRemember that in a rational scale we can say
a is twice b, or b is half of a. In
contrast, other scales can be interval based, ordinal,
or nominal.
. The essential assumption of a linear model is that the
dependent variable is proportional to the independent variables, except
for some additive constant. Thus, twice the input must result in twice
the output. For example, when we measured the sound speed, twice the
distance resulted in twice the echo delay.
In this article we will explore the case when some of the independent variables are in a nominal scale and not in a ratio scale. Remember that a nominal variable can only take one of a few possible values, called levels. Examples of nominal variables are: color, country, sex, or any other variable where the possible answers are known a priory.
One advantage of using factors instead of text variables is that factors use a controlled vocabulary, where only the allowed words can be used. Therefore, there is a lower risk of input-data errors. In the language R these nominal variables are called factors. There is nothing equivalent in Excel.
Modeling the price of Internet plans using linear models
Somehow, I have been using a lot of internet in the last month. So
I’m exploring alternatives of service plans. I found a table on Turk
Telekom website, that can be tidy upThe tidying up process is interesting. I may make a
video later.
into an Excel spreadsheet that you
can download. We can analyze the data directly in Excel, or with R, or
with any other tool. Interestingly, we can read the spreadsheet directly
into R using the command read_excel()
from the library
readxl
, as in the following code.
library(readxl)
<- read_excel("internet_turk_telekom.xlsx") internet
The data frame internet
has 29 rows. They are not too
many, and we can see them in the following table
internet
package | speed | limit | price |
---|---|---|---|
fibernet | 100 | Limitsiz | 330 |
fibernet | 100 | 300GB | 200 |
fibernet | 100 | 200GB | 176 |
fibernet | 100 | 100GB | 157 |
fibernet | 50 | Limitsiz | 310 |
fibernet | 50 | 300GB | 190 |
fibernet | 50 | 200GB | 166 |
fibernet | 50 | 100GB | 147 |
fibernet | 35 | Limitsiz | 290 |
fibernet | 35 | 300GB | 187 |
fibernet | 35 | 200GB | 163 |
fibernet | 35 | 100GB | 142 |
fibernet | 24 | Limitsiz | 270 |
fibernet | 24 | 300GB | 185 |
fibernet | 24 | 200GB | 161 |
fibernet | 24 | 100GB | 137 |
ultranet | 16 | Limitsiz | 255 |
ultranet | 16 | 300GB | 183 |
ultranet | 16 | 200GB | 159 |
ultranet | 16 | 100GB | 135 |
net | 12 | Limitsiz | 254 |
net | 8 | Limitsiz | 252 |
net | 8 | 200GB | 155 |
net | 8 | 100GB | 132 |
net | 6 | Limitsiz | 200 |
net | 6 | 60GB | 100 |
net | 4 | 60GB | 95 |
net | 4 | 40GB | 90 |
net | 4 | 20GB | 85 |
To visualize the data, it is convenient to define a new column with the symbol to plot on each case. We take the first letter of the limit variable. In other words, “L” means Limitsiz (Turkish for “limitless”), “3” means “300GB”, and so on.
$symbol <- substr(internet$limit, 1, 1) internet
The full data set looks like the following plot.
plot(price~speed, internet, pch=symbol,
main="Raw Data")
We can see that, for a given limit, the price seems to follow a straight line. That will be our model.
Tidying up the data
For the purpose of this article, we are going to focus only on the
cases where the speed is at least 8 Mbps, since the smaller cases seem
to have a different behavior. Thus, we only keep the subset of cases
where speed
≥8.
<- subset(internet, speed>=8) internet
We will drop the column package
, since we will not use
it for modeling in this articleYou may decide to use package
as
independent variable in other models.
. Finally, we need to transform the column limit
from text type into a factor type, that is, a categorical variable. This
will allow us to use limit as part of the linear modelAs we will see, R linear models can handle numeric or
factor variables, but not text variables.
.
<- transform(internet, limit=as.factor(limit), package=NULL) internet
The new tidy-up table looks like this
internet
speed | limit | price | symbol |
---|---|---|---|
100 | Limitsiz | 330 | L |
100 | 300GB | 200 | 3 |
100 | 200GB | 176 | 2 |
100 | 100GB | 157 | 1 |
50 | Limitsiz | 310 | L |
50 | 300GB | 190 | 3 |
50 | 200GB | 166 | 2 |
50 | 100GB | 147 | 1 |
35 | Limitsiz | 290 | L |
35 | 300GB | 187 | 3 |
35 | 200GB | 163 | 2 |
35 | 100GB | 142 | 1 |
24 | Limitsiz | 270 | L |
24 | 300GB | 185 | 3 |
24 | 200GB | 161 | 2 |
24 | 100GB | 137 | 1 |
16 | Limitsiz | 255 | L |
16 | 300GB | 183 | 3 |
16 | 200GB | 159 | 2 |
16 | 100GB | 135 | 1 |
12 | Limitsiz | 254 | L |
8 | Limitsiz | 252 | L |
8 | 200GB | 155 | 2 |
8 | 100GB | 132 | 1 |
and the new graphics looks like this
plot(price~speed, internet, pch=symbol,
main="Filtered Data")
Basic linear model with a numeric independent variable
Since the only numeric independent variable is speed, it is
natural to try a first model using speed to predict
price. In R, these relationships are written as
price ~ speed
, that we read “price depends on
speed”. We fitThe model is given by the formula, since it describes
the relationship among the variables. Then we fit the model to
the data, and we find the coefficients.
You can find more details at Regression
Analysis Tutorial and Examples (The Minitab Blog)
our first model with this code
<- lm(price ~ speed, data=internet) model0
The model can also be fit in Excel or Google Sheets, using the
=LINEST()
function. You can see an example on the sheet
model 0, at the online
spreadsheet. Remember that =LINEST()
has an option to
return advanced statistical results, such as standard error and the
coefficient of determination R².
The fitted model has the following parameters
summary(model0)
Call:
lm(formula = price ~ speed, data = internet)
Residuals:
Min 1Q Median 3Q Max
-59.03 -43.45 -23.66 64.30 113.97
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 185.3823 19.8484 9.340 4.12e-09 ***
speed 0.3064 0.4018 0.763 0.454
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 59.68 on 22 degrees of freedom
Multiple R-squared: 0.02576, Adjusted R-squared: -0.01853
F-statistic: 0.5816 on 1 and 22 DF, p-value: 0.4538
We can see that the main component is (Intercept)
, with
a very small 𝑝-value. On the other side, the 𝑝-value of
speed
is large, so we do not have any strong evidence that
speed has any effect on price. Moreover, the
R² value is smallA good explanation of R² and Adjusted
R² can be found at Wikipedia.
, which means that most of the variance cannon be
explained by model 0. We can see it graphically here
plot(price~speed, internet, pch=symbol,
main="Model 0")
<- "#CC5450"
red points(predict(model0)~speed, internet,
col=red, type="o", lwd=2)
This initial model shows a small tendency to increase the price when the speed increases. But the real values are far away from the values predicted by the model.
Model with a factor
The model using only speed did not fit well to the data. On
the other side we have a clear effect of limit in
price. Thus, we can build a new model expressing this
relationship as price ~ limit
. We can call this model
1, and read it like “price depends on limit”.
Since limit is a factor, we can fit the new model in R using the same command as before
<- lm(price ~ limit, data=internet) model1
This can also be done in a spreadsheet, as we will discuss later. The result now has several coefficients, as we can see below
summary(model1)
Call:
lm(formula = price ~ limit, data = internet)
Residuals:
Min 1Q Median 3Q Max
-28.143 -7.083 -2.167 6.464 49.857
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 141.667 7.387 19.179 2.40e-14 ***
limit200GB 21.667 10.446 2.074 0.051196 .
limit300GB 47.333 10.956 4.320 0.000333 ***
limitLimitsiz 138.476 10.066 13.756 1.17e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.09 on 20 degrees of freedom
Multiple R-squared: 0.9186, Adjusted R-squared: 0.9064
F-statistic: 75.22 on 3 and 20 DF, p-value: 4.554e-11
There is an intercept value and values for 200GB, 300GB, and Limitsiz. There is nothing labeled 100GB. All these coefficients have small 𝑝-values, suggesting that there is a significant effect of limit on price.
The R-squared
value shows that model 1 can
explain 91% of the varianceIf we think a little, we can see that increasing the
number of independent variables should always increase the coefficient
of determination R2, since we have more degrees
of freedom. That is why it is better to look at the Adjusted
R2, which balances this effect.
. Much better than in model 0.
Graphically, fitted model 1 looks like this
plot(price~speed, internet, pch=symbol,
main="Model 1")
par(col=red, lwd=2)
points(predict(model1)~speed, internet,
type="o", subset=limit=="100GB")
points(predict(model1)~speed, internet,
type="o", subset=limit=="200GB")
points(predict(model1)~speed, internet,
type="o", subset=limit=="300GB")
points(predict(model1)~speed, internet,
type="o", subset=limit=="Limitsiz")
Interpreting the coefficients
Let us look again the coefficients found using limit as the
independent variable in model1
. The coefficients are
coef(model1)
(Intercept) | limit200GB | limit300GB | limitLimitsiz |
---|---|---|---|
141.7 | 21.67 | 47.33 | 138.5 |
As we said before, there is an intercept value and values for 200GB, 300GB, and Limitsiz, but nothing for 100GB. To understand these values we need to compare them to the averages of price for each level of limit.
<- tapply(internet$price,
averages $limit, mean)
internet averages
100GB | 200GB | 300GB | Limitsiz |
---|---|---|---|
141.7 | 163.3 | 189 | 280.1 |
We can immediately see that the intercept value is equal to the average for 100GB, that is, the missing level. ith a little arithmetic, we can notice that the other coefficients are the difference between the level average and the intercept.
\[ \begin{aligned} 163.3333333 & = 141.6666667 + 21.6666667 \\ 189 & = 141.6666667 + 47.3333333 \\ 280.1428571 & = 141.6666667 + 138.4761905 \\ \end{aligned} \]
In other words, the model coefficients are
- Intercept: average of the “first” category
- Other coefficients: difference between the first category and the other categories.
2] - averages[1] = coef(model1)[2]
averages[3] - averages[1] = coef(model1)[3]
averages[4] - averages[1] = coef(model1)[4] averages[
We have to keep in mind that the first level plays a special role in these kind of models, different from other levels’ role.
How to model factors in a linear model
We know from previous articles that linear models can take several independent variables and find the best linear combination to approximate a dependent variables. This idea can be extended to be used with factors. Each factor level is represented by an independent variable, that can take only 0 or 1 values.
This set of new variables is called model matrix or
design matrix. R does this automatically. We can see how it is
done in R using the function model.matrix()
and the same
formula used to describe the linear model. Indeed, this is done
internally when we fit the model, but will have other applications
later. The R code is this
<- model.matrix( ~ limit, data=internet) design
and the resulting model matrix is
design
(Intercept) | limit200GB | limit300GB | limitLimitsiz |
---|---|---|---|
1 | 0 | 0 | 1 |
1 | 0 | 1 | 0 |
1 | 1 | 0 | 0 |
1 | 0 | 0 | 0 |
1 | 0 | 0 | 1 |
1 | 0 | 1 | 0 |
1 | 1 | 0 | 0 |
1 | 0 | 0 | 0 |
1 | 0 | 0 | 1 |
1 | 0 | 1 | 0 |
1 | 1 | 0 | 0 |
1 | 0 | 0 | 0 |
1 | 0 | 0 | 1 |
1 | 0 | 1 | 0 |
1 | 1 | 0 | 0 |
1 | 0 | 0 | 0 |
1 | 0 | 0 | 1 |
1 | 0 | 1 | 0 |
1 | 1 | 0 | 0 |
1 | 0 | 0 | 0 |
1 | 0 | 0 | 1 |
1 | 0 | 0 | 1 |
1 | 1 | 0 | 0 |
1 | 0 | 0 | 0 |
The first column represents the intercept, and it is 1 in all cases. The second column is 1 when limit is equal to 200GB and 0 otherwise. The third column is 1 only when limit is equal to 300GB, and the fourth column is 1 only when limit is Limitsiz.
We can see that there is no column for 100GB. Instead, the rows in the model matrix corresponding to 100GB have 0 in all columns, except of course on column intercept, which always has 1.
In Excel we can build the design matrix by first writing the level
names in the columns’ headers —let’s say, in cells C1
to
E1
—, and then using a formula like
=IF($B2 = $C1, 1, 0)
We write this formula on cell C2
and then we copy it to
all cell positions in the matrix. Here we assume that column
B
has the factor values. This approach can be seen online,
in the sheet model 1.
Model with all categories
There is a technical reasonThe reason is that the columns must be linearly
independent.
that forbids us to have all factor levels and
the intercept at the same time. If we want to include an explicit value
for 100GB category, then we need to omit the
intercept. Thus, we have a new model, called model
2.
In Excel and Google Sheets, the function =LINEST()
has
an option “make B=0
” that really means “do not use
intercept”. If we use that option, we can include a new column for
100GB. You can see this model in the sheet model 2 in
the online
spreadsheet.
In R we omit the intercept by writing +0
in the the
linear model formula. Then, model 2 is described by the formula
price ~ limit + 0
. We fit model 2 to the data with
the same R function as in the following code
<- lm(price ~ limit + 0, data=internet) model2
which gives us these new coefficients
summary(model2)
Call:
lm(formula = price ~ limit + 0, data = internet)
Residuals:
Min 1Q Median 3Q Max
-28.143 -7.083 -2.167 6.464 49.857
Coefficients:
Estimate Std. Error t value Pr(>|t|)
limit100GB 141.667 7.387 19.18 2.40e-14 ***
limit200GB 163.333 7.387 22.11 1.58e-15 ***
limit300GB 189.000 8.092 23.36 5.48e-16 ***
limitLimitsiz 280.143 6.839 40.96 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.09 on 20 degrees of freedom
Multiple R-squared: 0.9935, Adjusted R-squared: 0.9923
F-statistic: 770.1 on 4 and 20 DF, p-value: < 2.2e-16
This time the linear model’s coefficients correspond one-to-one to
each category’s averages. Nevertheless, R² in model 2
is the same as in model 1, so they have the same
explanatory power. The predictions will be the same. The real
difference is on the meaning of the coefficientsThis is important. The value of the
model is not always on the numeric prediction. The value can be
in the qualitative interpretation of the coefficients.
.
The model matrix is easy to find. Just one column for each level of
limit, cells filled with 1 when the value is equal to the
column name, and 0 otherwise. In case of doubt, it is easy to find this
matrix in R using the command
model.matrix(~limit+0, data=internet)
.
Advantages
A reasonable question here is “why don’t we just take the average and skip all the linear model?” Using a linear model has some advantages.
The first advantage of using a linear model, is that we can use all data to estimate the standard error. If we assume that all dependent values are affected by the same kind of noise, a linear model is better than evaluating case by case. If the hypothesis of same noise for all error is valid, a linear model is a good way to estimate it.
The second advantage is that we can evaluate the statistical significance of coefficients representing differences. For example, in model 1 we saw that the average for 200GB was near 21.7₺ higher than the average for 100GB, but that difference is not statistically significative: the 𝑝-value was more than 5%.
Another advantage of using a linear model for factors is that they can be easily combined with regular numeric independent variables.
Combined model
So far we have model 0 with a small increasing slope for
speed, and models 1 and 2 with a different
constant for each level of limit. We can combine them, by
declaring that the price depends on speed plus
limit and without (Intercept). Thus, model 3
is represented by the formula
price ~ speed + limit + 0
.
The design matrix will contain one column for speed and several columns for limit. You can see how to do it with Excel, on the sheet model 3 online. In R we just write the formula and we get the fitted values
<- lm(price ~ speed + limit + 0, data=internet) model3
summary(model3)
Call:
lm(formula = price ~ speed + limit + 0, data = internet)
Residuals:
Min 1Q Median 3Q Max
-17.079 -6.763 1.795 4.793 23.491
Coefficients:
Estimate Std. Error t value Pr(>|t|)
speed 0.42444 0.07969 5.326 3.85e-05 ***
limit100GB 125.18437 5.71069 21.921 5.97e-15 ***
limit200GB 146.85104 5.71069 25.715 3.16e-16 ***
limit300GB 169.90034 6.36411 26.697 < 2e-16 ***
limitLimitsiz 265.28757 5.24632 50.566 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.76 on 19 degrees of freedom
Multiple R-squared: 0.9974, Adjusted R-squared: 0.9967
F-statistic: 1465 on 5 and 19 DF, p-value: < 2.2e-16
This time the coefficient for speed is 0.42 ₺/Mbps, a value not too
different from 0.31 ₺/Mbps found in model 0. The difference is
in the 𝑝-value. In model 0 the coefficient was not
significantly different from 0. In model 3 the difference is
statistically significant. Moreover, the confidence interval for
speed
coefficient is narrower in model 3, and does
not include 0.
confint(model0)["speed",]
2.5 % 97.5 %
-0.5268568 1.1397308
confint(model3)["speed",]
2.5 % 97.5 %
0.2576540 0.5912197
The coefficients for each level of limit are smaller than in model 2, since we discount the contribution of speed. Their standard error is smaller, so we get narrower confidence intervals. It is a win-win situation. We can see this in the following plot.
plot(price~speed, data=internet, pch=symbol,
main="Model 2")
par(col=red, lwd=2)
points(predict(model3)~speed, data=internet,
type="o", subset=limit=="100GB")
points(predict(model3)~speed, data=internet,
type="o", subset=limit=="200GB")
points(predict(model3)~speed, data=internet,
type="o", subset=limit=="300GB")
points(predict(model3)~speed, data=internet,
type="o", subset=limit=="Limitsiz")
The speed coefficient shows the average slope, discounting
any influence from the limit variable. It is clear that, in
this model, the main factor affecting the price is the
limit, since the speed contribution is only
round(coef(model3)["speed"],2)
₺ for each Mbps.
This model assumes that the increase of price when speed increases is the same for each level of limit. This hypothesis should be tested.
Model with Independent slopes
Finally, we can build a model with different speed slopes
for each level of limit. In other words, we need to specify
that the speed coefficient should interact
with limit. We can do that by including an interaction term. In
R we specify this interaction with the formula
price ~ speed:limit + limit + 0
. This literally means
“price depends on speed combined with limit,
plus something depending on the limit, and no intercept”. This
will be model 4.
In Excel we need to insert more columns, one for each level of
limit. Then the cells contain the value of speed only
in the column corresponding to the value of limit. This can be
easily done by multiplying each speed value with the columns
representing limit. You can see how this is done in sheet
model 4 of the online document. Alternatively, you can use the
model.matix()
function of R, with the corresponding
formula.
We fit model 4 to the data as usual,
<- lm(price ~ speed:limit + limit + 0, data=internet) model4
and we get the following results
summary(model4)
Call:
lm(formula = price ~ speed:limit + limit + 0, data = internet)
Residuals:
Min 1Q Median 3Q Max
-9.3569 -1.2862 -0.0709 0.5093 16.1924
Coefficients:
Estimate Std. Error t value Pr(>|t|)
limit100GB 131.10500 3.92435 33.408 3.15e-16 ***
limit200GB 154.99689 3.92435 39.496 < 2e-16 ***
limit300GB 179.98375 4.80998 37.419 < 2e-16 ***
limitLimitsiz 248.25837 3.46430 71.662 < 2e-16 ***
speed:limit100GB 0.27197 0.07950 3.421 0.0035 **
speed:limit200GB 0.21467 0.07950 2.700 0.0158 *
speed:limit300GB 0.20036 0.08914 2.248 0.0391 *
speed:limitLimitsiz 0.91099 0.07543 12.077 1.87e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.935 on 16 degrees of freedom
Multiple R-squared: 0.9994, Adjusted R-squared: 0.9992
F-statistic: 3600 on 8 and 16 DF, p-value: < 2.2e-16
As expected, R² is better than in the previous models. The coefficients for each limit’s level have narrower intervals. Most interestingly, the speed slope for “Limitsiz” is much higher than the slope for other levels of limit. Therefore the hypothesis used in model 3 was not valid.
The plot of this model is this.
plot(price~speed, internet, pch=symbol,
main="Model 4")
par(col=red, lwd=2)
points(predict(model4)~speed, data=internet,
type="o", subset=limit=="100GB")
points(predict(model4)~speed, data=internet,
type="o", subset=limit=="200GB")
points(predict(model4)~speed, data=internet,
type="o", subset=limit=="300GB")
points(predict(model4)~speed, data=internet,
type="o", subset=limit=="Limitsiz")
In theory we can find the same results if we do independent models for each level of limit. The advantage of the combined model is to use more samples to estimate the standard error, and therefore find narrower confidence intervals for each coefficient and prediction. The big assumption here is that the “noise” source is the same for each case.
Exercise
To simplify the explanation, here we modeled the raw values. Nevertheless, it seems that using a log-log scale produces better alignments. Indeed, speed grows exponentially in our data, therefor it may be better to use
log(speed)
as independent variable. Naturally, we cannot take logarithm of a factor, but we can considerlog(price)
.Your challenge is to repeat this analysis but using
log(speed)
andlimit
to predictlog(price)
. You can choose to skip model 1 or model 2, but not both.Then —and this is the important part— you should write a report explaining the meaning of the results. The mathematics of this exercise is easy, and you can copy-and-paste the R code. The real value —the science— is in the interpretation of the results.