Blog of Andrés Aravena
Methodology of Scientific Research:

Uncertainty in Linear Models

01 April 2020

In the previous post we spoke about the uncertainty on a fixed value, and how taking averages can increase precision. We can even estimate the uncertainty of the average.

The ideal case is when we measure a value that does not change. In the case of the temperature we had the extra complication that it changed during the experiment. We had two components of variability: the noise and the change in temperature. So we had a trade-offThis tradeoff can be described as the uncertainty principle. Increasing the sample size tells us what was the temperature but not when it happened. We cannot know the value and time of a temperature with high precision at the same time.

between increasing the sample size to reduce the noise, and decreasing the sample size to reduce the change on the real value.

A conclusion of this analysis is that every sampling should include several measurements. In other words, instead of measuring temperature once every time, we can measure it twenty times on every sampling event. If the twenty measurements are taken in a short time —let’s say, less than a second— we get several independent measurements from a value that does not change —we assume— during that short period of time. Then we can average the twenty samples and get a temperature value with smaller uncertainty. We also get an estimation of that uncertainty.

Linear models

For some events we can use another hypothesis. Sometimes, from theoretical backgrounds, we can assume that the real value lays on a straight line. For example, the following graph shows the logarithm of the number of COVID-19 cases in ChileIf you know a good source of data for other countries, I will be glad to know.

until March 23, 2020.

Figure 1. COVID-19 cases in Chile between March 3 and March 23.

Theory says that, in this phase and without major health care interventions, the logarithm of the number of cases should follow a straight lineLater we will study the models that justify this theory.

. Using this hypothesis, we can describe the relationship between cases and time with the formula \[\log(\text{cases}_i)=a+b\cdot\text{time}_i + \varepsilon_i\qquad\text{for each }i\]

The coefficients \(a\) and \(b\) are what we are looking for. The values \(\varepsilon_i\) represent the noise, that is, the variability due to randomness. As in the case of the moving average, we assume that the noise values are random, have mean zero, finite variance, and are independent. In this case we use linear regression —also know as least squares— to find the best values for \(a\) and \(b\).

We can do the regression in Google Sheets or Excel by drawing a plot of our data, and then asking for the trend line. Even better, we can use the function =LINEST() to calculate the linear estimatorBe careful with this function, since the answer uses several cells, not only the one where we write the formula. This is one of the array functions of Google Sheets, and they do not work as the regular functions.

. In the simple case, we write =LINEST(known_data_y, known_data_x). By default, this function will return the coefficients \(b\) and \(a\). We can then use the formula to predict new values.

In R we perform the regression using the function lm(), which returns a linear model.

model <-lm(log(cases) ~ day, data=covid)
model
Call:
lm(formula = log(cases) ~ day, data = covid)

Coefficients:
(Intercept)          day  
     0.3008       0.3176  

Then we use the function coef() to find the coefficients of the model. The coefficient \(a\) is named (Intercept), and the coefficient \(b\) is named log(cases), since it multiplies the variable with the same name.

coef(model)
(Intercept)         day 
  0.3007812   0.3176057 

Finally, we can use the model to predict new values using the function predict().

plot(log(cases)~day, data=covid, pch=16)
lines(predict(model) ~ day, data=covid)
Figure 2. Best fitting linear model for log(cases).

There are two basic questions to be asked about the regression:

In the case of COVID-19, this question can be rewritten as

A correct evaluation of these intervals will allow us to evaluate if the health-care decisions are effective or not. This saves lives.

Calculating the intervals

It should not come as a surprise that the width of all these intervals depends on the noise in the data. More specifically, it depends on the variance of \(\varepsilon_i.\) As in the previous case, we do not really know this value but we can estimate it from the data. Using this estimated variance, and some formulas that we will discuss somewhere else, we can calculate the standard error of each coefficient.

In Google Sheets we can get the detailed regression using the same function =LINEST() but this time using the value TRUE for the option verbose. We write =LINEST(known_data_y, known_data_x, FALSE, TRUE)The meaning of the third parameter of LINEST() will be discussed later.

. The output has several components. For now we care only about the first two rows. They are the coefficients \(b\) and \(a,\) and their standard errors \(s_b\) and \(s_a\). In the fourth row we also have the value of degrees of freedom. In this case, since we have 21 rows of data and we calculated 2 coefficients, then we have 21-2=19 degrees of freedom.

The intervals are always between the extremes \(mean ± k\cdot stderr\). The value of \(k\) depends on the confidence level. We calculate the value of \(k\) using =T.INV(1-alpha, df), where alpha is (1-confidence)/2 and df is the number of degrees of freedom.

In R is easier. We just use the following code:

confint(model, level=0.99)
                 0.5 %    99.5 %
(Intercept) 0.03487745 0.5666849
day         0.29642922 0.3387823

Therefore we have a range that contains the real growth factor with probability 99%, given the data and the assumptions.[^ {-} This will be important in other applications, such as differential gene expression analysis.]

The meaning of these intervals is the following. Any value of the coefficients in the interval is compatible with the data. The calculated coefficients are just the center of the interval. Therefore, with 99% confidence, the number of cases may be equally described by any equation like this \[ \log(\text{cases}_i) =(a+k_a s_a)+(b+k_b s_b)\cdot\text{time}_i + \varepsilon_i \] as long as \(k_a\) and \(k_b\) are between -3 and 3. More precisely, between -T.INV(0.995, 19) and T.INV(0.995, 19)[^ {-} You can see an illustration on the sheet “pred1”.].

In R it is even easier. If we use the function predict() with the option interval="confidence", it will give us a 3–column matrix with the predicted center, lower, and upper value for each day. The following plot shows the idea

valid_models <- predict(model, interval = "confidence", level = 0.99)
plot(log(cases)~day, covid, pch=16)
lines(valid_models[,"fit"])
lines(valid_models[,"lwr"])
lines(valid_models[,"upr"])
Figure 3. Average, lower and upper limits for linear model 99% confidence interval. Values are log(cases).

It may be easier to understand the result if we “undo the logarithm”, that is, use the exponent function.We can see that the last value does not fit in the model. Why?

plot(cases ~ day, covid, pch=16, ylim=range(exp(valid_models)))
lines(exp(valid_models[,"fit"]))
lines(exp(valid_models[,"lwr"]))
lines(exp(valid_models[,"upr"]))
Figure 4. Average, lower and upper limits for linear model 99% confidence interval. Values are number of cases.

Uncertainty in the prediction.

We just have shown that the noise in the data results in a range of possible coefficients, and therefore in a set of linear models that are compatible with the data. Now we want to predict number of cases for new conditions.

In general, linear models are good at interpolating, that is, finding intermediate values between the data points. In this case we are interested in extrapolating, that is, finding the number of cases outside the data points, in the future. And evaluate their uncertainty.

One part of the uncertainty comes form the coefficients’ uncertainty, as described in the previous section. Now we have another component of the uncertainty: the noise.

Each point is random. The linear equation only predicts the center of the interval where the points can be. Then the actual width depends on the standard error of the residualsThe residuals are the difference between the fitted model and the real data. Under our hypotheses, these residuals are \(\varepsilon_i\).

. Thus, the prediction interval has to combine both uncertainties.

I do not know if this can be done easily in Excel or Google Sheets. In R it is easy. We need to create a data frame with a column named day and the values where we want to predict. This will be the option newdata for the predict() function. We also use the option interval = "prediction". As before, we get a 3–column matrix with center, lower and upper limit of the prediction confidence interval.

to_predict <- data.frame(day=0:25)
prediction <- predict(model, newdata = to_predict,
                      interval = "prediction", level = 0.99)

plot(log(cases)~day, covid, pch=16,
     xlim=range(to_predict), ylim=range(prediction))
lines(prediction[,"fit"], lty=1)
lines(prediction[,"lwr"], lty=2)
lines(prediction[,"upr"], lty=2)
Figure 5. Average, lower and upper limits for predicted values. Values are log(cases)

Again, it is easier to understand the result if we use the exponent function to calculate the number of cases.Now the last value does fit in the model. Why?

plot(cases~day, covid, pch=16,
     xlim=range(to_predict), ylim=range(exp(prediction)))
lines(exp(prediction[,"fit"]), lty=1)
lines(exp(prediction[,"lwr"]), lty=2)
lines(exp(prediction[,"upr"]), lty=2)
Figure 6. Average, lower and upper limits for predicted values. Values are number of cases.

Application

We can use a linear model to determine a confidence interval for the real delay between samples in the Temperature test experiment.

Figure 7. Comparison of expected time versus real time of sampling.

Can you do this analysis? At least, describe how to do it