2 Lab 3 — September 29

2.1 Some modern data science packages

The tidyverse is a collection of packages for modern data science. Tidymodels is another collection of packages that extends the tidyverse to modelling. For these labs, I won't be using the entire tidyverse and tidymodels. For now, I recommend only installing a few of their component packages.

They are tibble, dplyr, ggplot2, and broom. These packages can be installed using:

After successfully installing these packages, you can load them using:

Additionally, ggplot2 comes with various plotting themes (examples of which can be found here). By default, the plotting background is grey but I prefer a white background so I will also call:

The tibble package provides us with the usage of "tibbles" which are an extension of the base-R data frame. Of its many features, we will mostly be using tibbles for their pretty-printing features and the fact that they integrate well with the other packages.

The dplyr package provides us with tools for data wrangling where base-R equivalents can feel a bit clunky. Examples of data wrangling include: keeping/removing observations based on the presence/absence of one or more conditions, creating additional variables in a data set, and condensing existing data into summaries.

The ggplot2 package is an alternative to base-R graphics and has functions that are "smarter" than some that exist in base-R. For example, in the previous lab, I mentioned that when plotting lines in base-R, points needed to sorted from left to right since the lines() function joins points in the order that they appear. Without sorting, the resulting line would appear jagged or unevenly coloured in certain areas. With ggplot2, we have separate functions for joining points as they appear in the data and joining points in order from left to right.

The broom package contains tools for cleaning up model output and extracting model data with diagnostic information that is ready for visualisation with ggplot2.

2.2 Back to simple linear regression

2.2.1 The mpg data set

Consider the fuel economy data set found in the ggplot2 package. We can load this data set by calling:

data(mpg, package="ggplot2")

Suppose we are interested in fitting a simple linear regression model with highway miles per gallon (hwy) as the response variable and city miles per gallon (cty) as the predictor variable. Before fitting the model, we should check to see if there is a linear relationship between our chosen variables.

2.2.2 Check for a linear relationship

To create this plot, we use the function ggplot.

ggplot(mpg, aes(x=cty, y=hwy))+
  geom_point()
  • The call to ggplot can be thought of as the initialisation of the drawing canvas
  • The first argument of ggplot is the data set, mpg
  • The second argument of ggplot is the mapping, i.e. "which variables contain the data that I am interested in plotting?"
  • The mapping is created by supplying the names of the variables found within your data set to the aesthetics function, aes().
  • Finally, we can add (with a "plus" symbol) a layer of points with geom_point() that will inherit any aesthetics supplied in the initialisation of the canvas
  • In other words, while not specified within geom_point(), the points are drawn at positions x=cty and y=hwy

From the plot above, there appears to be a linear relationship between the two variables. We can begin fitting our linear model.

2.2.3 Fitting the linear model

lm_miles <- lm(hwy ~ cty, data=mpg)

summary(lm_miles)
## 
## Call:
## lm(formula = hwy ~ cty, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3408 -1.2790  0.0214  1.0338  4.0461 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.89204    0.46895   1.902   0.0584 .  
## cty          1.33746    0.02697  49.585   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.752 on 232 degrees of freedom
## Multiple R-squared:  0.9138, Adjusted R-squared:  0.9134 
## F-statistic:  2459 on 1 and 232 DF,  p-value: < 2.2e-16

From the regression output, we have that \(\widehat{\beta}_{0} = 0.892\) and \(\widehat{\beta}_{1} = 1.337\). The equation of the fitted line is:

\[\widehat{Y}_{i} \,=\, 0.892 \,+\, 1.337X_{i}\]

We can interpret \(\widehat{\beta}_{0} = 0.892\) as the mean highway miles per gallon when the city miles per gallon has a value of zero. However, this doesn't really make sense in the context of our data since you can't have a car that travels zero city miles on a gallon on gas.

We can interpret \(\widehat{\beta}_{1} = 1.337\) as the mean change in highway miles per gallon per unit increase in city miles per gallon. As such, for a 1 mpg increase in city mpg, we expect a 1.337 mpg increase in highway mpg.

2.2.4 Visualising the fit

Here is where ggplot2 and broom shine! We first use broom::augment() on our linear model to extract the data that was used to fit the model, along with additional diagnostics (such as the residuals). Note that broom::augment() is a generic function. When we pass our lm object to broom::augment(), under the hood, it actually calls the more specific broom::augment.lm() function.

lm_miles_aug <- augment(lm_miles)

lm_miles_aug
## # A tibble: 234 x 8
##      hwy   cty .fitted .resid    .hat .sigma     .cooksd .std.resid
##    <int> <int>   <dbl>  <dbl>   <dbl>  <dbl>       <dbl>      <dbl>
##  1    29    18    25.0 4.03   0.00458   1.74 0.0123          2.31  
##  2    29    21    29.0 0.0214 0.00834   1.76 0.000000632     0.0123
##  3    31    20    27.6 3.36   0.00661   1.74 0.0123          1.92  
##  4    30    21    29.0 1.02   0.00834   1.75 0.00144         0.585 
##  5    26    16    22.3 3.71   0.00445   1.74 0.0101          2.12  
##  6    26    18    25.0 1.03   0.00458   1.75 0.000805        0.591 
##  7    27    18    25.0 2.03   0.00458   1.75 0.00311         1.16  
##  8    26    18    25.0 1.03   0.00458   1.75 0.000805        0.591 
##  9    25    16    22.3 2.71   0.00445   1.75 0.00536         1.55  
## 10    28    20    27.6 0.359  0.00661   1.76 0.000140        0.205 
## # ... with 224 more rows

In the first two columns of lm_miles_aug, we have recovered the data that we initially passed into lm(). The .fitted column contains the fitted values of our model and the .resid column contains the raw residuals.

The plot that we wish to build will use the newly created lm_miles_aug data set and will have the following features:

  • The base plot is a scatterplot with points located at x=cty and y=hwy
  • The fitted line will be drawn on top with points located at x=cty and y=.fitted

Since the points and lines that we wish to draw no longer share a common y-aesthetic, we may not want to declare a y-aesthetic in the initialisation of the canvas, and instead, declare the y-aesthetic in the individual layers.

ggplot(lm_miles_aug, aes(x=cty))+
  geom_point(aes(y=hwy))+
  geom_line(aes(y=.fitted), colour="#3366FF", lwd=1.5, alpha=0.6)

Notice that in building this plot, I did not sort any of the points! By default, geom_line() connects points from left to right. I have added some additional arguments to geom_line() outside of the aesthetics since these values do not depend on the data:

  • colour controls the colour of the line (can also use color)
  • lwd controls the width of the line (can also use linewidth)
  • alpha controls the transparency of the line

To see the construction of the above plot as "an addition of layers" we can run

ggplot(lm_miles_aug, aes(x=cty))

followed by

ggplot(lm_miles_aug, aes(x=cty))+
  geom_point(aes(y=hwy))

followed by

ggplot(lm_miles_aug, aes(x=cty))+
  geom_point(aes(y=hwy))+
  geom_line(aes(y=.fitted), colour="#3366FF", lwd=1.5, alpha=0.6)

2.2.5 Confidence and prediction intervals

2.2.5.1 Confidence intervals

Suppose we are interested in finding a point estimate and a 95% confidence interval for the mean highway miles per gallon for vehicles with a city miles per gallon value of 20. We can obtain these values by using the predict() function. predict() is another example of a generic function. When we pass our lm object to predict(), it actually calls the more specific predict.lm() function. To see what we need to get our point estimate and confidence interval, let's pull up the associated documentation.

?predict.lm

We will need to supply to predict():

  • A linear model object
  • A data frame containing variables common to the linear model, and values upon which to predict
    • Note that if our model has predictor variable cty, we cannot supply newdata = data.frame(x=20), we must supply newdata = data.frame(cty=20)
  • If we want a confidence interval, we should specify interval = "confidence"
  • The default confidence level is 0.95
predict(
  lm_miles,
  newdata = data.frame(cty=20),
  interval = "confidence"
)
##        fit      lwr      upr
## 1 27.64115 27.36044 27.92187

From the above output, the point estimate is 27.641. The lower bound of the 95% confidence interval is 27.360 and the upper bound is 27.922. This says that with 95% confidence, for vehicles with a city miles per gallon value of 20, the mean highway miles per gallon will range between 27.360 and 27.922.

2.2.5.2 Prediction intervals

Suppose we are now interested in finding a point estimate and a 95% prediction interval for the highway miles per gallon of a vehicle that has a city miles per gallon of 20. We can obtain these values in a manner similar to the previous example:

predict(
  lm_miles,
  newdata = data.frame(cty=20),
  interval = "prediction"
)
##        fit      lwr      upr
## 1 27.64115 24.17733 31.10498

From the above output, the point estimate is 27.641. The lower bound of the 95% prediction interval is 24.177 and the upper bound is 31.105. This says that with 95% confidence, a vehicle with a city miles per gallon value of 20 will have a highway miles per gallon value between 24.177 and 31.105.

2.2.5.3 Spot the differences

The point estimates in both cases were identical. However, the lower and upper bounds of the confidence and prediction intervals were different. This is unsurprising since the formulas used to compute the bounds of the two intervals are different.

It should be noted that, in general, for a fixed confidence level, the prediction interval will be wider than its corresponding confidence interval.

2.2.6 Correlations

Using the previously obtained augmented model data (lm_miles_aug), we can easily obtain the correlations of

  • \(X_{i}\) and \(Y_{i}\),
  • \(Y_{i}\) and \(\widehat{Y}_{i}\),
  • \(X_{i}\) and \(\widehat{Y}_{i}\),

compare them among each other, and compare them to the model's coefficient of determination, \(R^{2}\). Correlations are obtained using the cor() function. We will also use the summarise() function from the dplyr package to help us obtain the correlations while staying in the context of our augmented data so that we can reference our variables using bare names (i.e. we don't need to use $ to reference our variables).

summarise(
  lm_miles_aug,
  corr_x_y = cor(cty, hwy),
  corr_y_fitted = cor(hwy, .fitted),
  corr_x_fitted = cor(cty, .fitted)
)
## # A tibble: 1 x 3
##   corr_x_y corr_y_fitted corr_x_fitted
##      <dbl>         <dbl>         <dbl>
## 1    0.956         0.956             1

Note also that:

summarise(
  lm_miles_aug,
  corr_x_y_squared = cor(cty, hwy)^2,
  corr_y_fitted_squared = cor(hwy, .fitted)^2,
  corr_x_fitted_squared = cor(cty, .fitted)^2,
  r.squared = summary(lm_miles)$r.squared
)
## # A tibble: 1 x 4
##   corr_x_y_squared corr_y_fitted_squared corr_x_fitted_squared r.squared
##              <dbl>                 <dbl>                 <dbl>     <dbl>
## 1            0.914                 0.914                     1     0.914

Are these values mathematically related or is this coincidence? 😲 (See assignment 2 question 3).

2.3 Simple linear regression with transformations

When applying transformations to variables in linear models, there are two options:

  • Create a new variable in the data set that applies the transformation, or
  • Apply the transformation in the formula specification

To observe the differences, we will build two models using both methods. Using the mpg data set again, let us fit a model where the response will be the engine displacement and the predictor will be the (natural) log of the highway miles per gallon.

2.3.1 Creating a new variable in your data set

With dplyr loaded, we can easily create new variables in our data set using the mutate() function. The first argument to mutate() is your data set. Additional arguments are name-value pairs for the variables you want to create.

mpg2 <- mutate(
  mpg,
  log_hwy = log(hwy)
)

The model is created as usual:

lm_displ1 <- lm(displ ~ log_hwy, data=mpg2)

2.3.2 Applying the transformation in the formula specification

We can apply transformations in the formula specification if we do not wish to create a new variable ahead of time. The model is created using:

lm_displ2 <- lm(displ ~ log(hwy), data=mpg2)

2.3.3 Comparing our two models

summary(lm_displ1)
## 
## Call:
## lm(formula = displ ~ log_hwy, data = mpg2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2992 -0.5548 -0.1148  0.4252  3.7446 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15.4145     0.6496   23.73   <2e-16 ***
## log_hwy      -3.8260     0.2074  -18.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8243 on 232 degrees of freedom
## Multiple R-squared:  0.5946, Adjusted R-squared:  0.5929 
## F-statistic: 340.3 on 1 and 232 DF,  p-value: < 2.2e-16
summary(lm_displ2)
## 
## Call:
## lm(formula = displ ~ log(hwy), data = mpg2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2992 -0.5548 -0.1148  0.4252  3.7446 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15.4145     0.6496   23.73   <2e-16 ***
## log(hwy)     -3.8260     0.2074  -18.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8243 on 232 degrees of freedom
## Multiple R-squared:  0.5946, Adjusted R-squared:  0.5929 
## F-statistic: 340.3 on 1 and 232 DF,  p-value: < 2.2e-16

The models are identical! Let's try to predict the engine displacement for a vehicle that has a highway miles per gallon value of 29.

predict(
  lm_displ1,
  newdata = data.frame(log_hwy=29)
)
##         1 
## -95.53829
predict(
  lm_displ2,
  newdata = data.frame(hwy=29)
)
##       1 
## 2.53138

Our predicted values are very different! Why?

For models that use variables created ahead of time, the value(s) that are supplied to predict upon must be on the same scale as the other predictor values that were used in the fitting of the model. In other words, by supplying a value of 29 to the first model, we are predicting engine displacement for a vehicle with a log highway miles per gallon value of 29! In order to properly predict the engine displacement, we must instead supply the value of log(29).

predict(
  lm_displ1,
  newdata = data.frame(log_hwy=log(29))
)
##       1 
## 2.53138

For models where transformations are applied in the formula specification, we can supply values on the original scale and the transformation will be applied for us in calculating the predicted value.

2.3.4 Warning ❗

Not all operations applied in the formula specification are treated as mathematical operations. For example, a + in the formula specification means to add a variable and as we saw in section 1.2.8, - means to remove a variable. In addition, the ^ symbol that we usually use for exponentiation has a different meaning when used in a formula specification. For example, if we wanted to fit a model using the square of highway miles per gallon as the predictor, we cannot write:

lm(displ ~ hwy^2, data=mpg2)

Instead, we must either create this variable ahead of time or wrap our exponentiation with I():

lm(displ ~ I(hwy^2), data=mpg2)

When an operation is wrapped by I() in a formula specification, it means to treat it as a literal mathematical operator rather than a formula operator.

2.4 Another example

For the sake of illustration, suppose we wish to fit a model where the response variable is the log of the highway miles per gallon plus the log of the city miles per gallon and the predictor variable is the square of the engine displacement. I will create my response variable ahead of time but I will apply the transformation to my predictor in the formula specification.

mpg2 <- mutate(
  mpg2,
  mpg_sum = log(hwy) + log(cty)
)

The model is fitted using:

lm_log_miles <- lm(mpg_sum ~ I(displ^2), data=mpg2)

summary(lm_log_miles)
## 
## Call:
## lm(formula = mpg_sum ~ I(displ^2), data = mpg2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.91040 -0.18887  0.00019  0.16808  1.33149 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.444537   0.037335  172.61   <2e-16 ***
## I(displ^2)  -0.038570   0.002214  -17.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3324 on 232 degrees of freedom
## Multiple R-squared:  0.5668, Adjusted R-squared:  0.5649 
## F-statistic: 303.6 on 1 and 232 DF,  p-value: < 2.2e-16

The equation of the fitted line is:

\[\widehat{Y}_{i} \,=\, 6.445 \,-\, 0.039X_{i}^{2}\]

where \(\widehat{y}_{i}\) is the sum of the log highway miles per gallon and the log city miles per gallon. Due to how the model was constructed, if I wanted to predict the sum of the log highway miles per gallon and the log city highway miles per gallon for a particular vehicle, I would supply displacement values as-is and the model would square them for me.

2.4.1 Visualise the fit

We can visualise the fit using the same procedure as before by first augmenting our linear model.

lm_log_miles_aug <- augment(lm_log_miles)

lm_log_miles_aug
## # A tibble: 234 x 7
##    mpg_sum `I(displ^2)` .fitted    .hat .sigma    .cooksd .std.resid
##      <dbl>     <I<dbl>>   <dbl>   <dbl>  <dbl>      <dbl>      <dbl>
##  1    6.26         3.24    6.32 0.00914  0.333 0.000162      -0.187 
##  2    6.41         3.24    6.32 0.00914  0.333 0.000359       0.279 
##  3    6.43         4       6.29 0.00846  0.333 0.000758       0.421 
##  4    6.45         4       6.29 0.00846  0.333 0.000941       0.470 
##  5    6.03         7.84    6.14 0.00580  0.333 0.000330      -0.336 
##  6    6.15         7.84    6.14 0.00580  0.333 0.00000106     0.0191
##  7    6.19         9.61    6.07 0.00502  0.333 0.000290       0.339 
##  8    6.15         3.24    6.32 0.00914  0.333 0.00123       -0.517 
##  9    5.99         3.24    6.32 0.00914  0.332 0.00454       -0.992 
## 10    6.33         4       6.29 0.00846  0.333 0.0000553      0.114 
## # ... with 224 more rows

There are two things to note here:

  • The name of the column of the predictor values violates the usual naming conventions by containing brackets and a caret so its name is wrapped in backticks
  • The values in this column are the squared displacement values
  • This can be verified by comparing the values lm_log_miles_aug with the values of mpg2 — the rows have not been reordered in any way so, for example, the first row of lm_log_miles_aug corresponds with the first row of mpg2

We can visualise the fit by passing our augmented model data to ggplot():

ggplot(lm_log_miles_aug, aes(x=`I(displ^2)`))+
  geom_point(aes(y=mpg_sum))+
  geom_line(aes(y=.fitted), colour="#3366FF", lwd=1.5, alpha=0.6)

Obviously, this isn't a great fit but that was not our goal with this example!