2 Lab 3 — September 29
2.1 Some modern data science packages
The tidyverse is a collection of packages for modern data science. Tidymodels is another collection of packages that extends the tidyverse to modelling. For these labs, I won't be using the entire tidyverse and tidymodels. For now, I recommend only installing a few of their component packages.
They are tibble, dplyr, ggplot2, and broom. These packages can be installed using:
install.packages("tibble")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("broom")
After successfully installing these packages, you can load them using:
Additionally, ggplot2 comes with various plotting themes (examples of which can be found here). By default, the plotting background is grey but I prefer a white background so I will also call:
The tibble package provides us with the usage of "tibbles" which are an extension of the base-R data frame. Of its many features, we will mostly be using tibbles for their pretty-printing features and the fact that they integrate well with the other packages.
The dplyr package provides us with tools for data wrangling where base-R equivalents can feel a bit clunky. Examples of data wrangling include: keeping/removing observations based on the presence/absence of one or more conditions, creating additional variables in a data set, and condensing existing data into summaries.
The ggplot2 package is an alternative to base-R graphics
and has functions that are "smarter" than some that exist in base-R. For example, in the previous
lab, I mentioned that when plotting lines in base-R, points needed to sorted from left to right
since the lines()
function joins points in the order that they appear. Without sorting, the
resulting line would appear jagged or unevenly coloured in certain areas. With
ggplot2, we have separate functions for joining points as they
appear in the data and joining points in order from left to right.
The broom package contains tools for cleaning up model output and extracting model data with diagnostic information that is ready for visualisation with ggplot2.
2.2 Back to simple linear regression
2.2.1 The mpg
data set
Consider the fuel economy data set found in the ggplot2 package. We can load this data set by calling:
data(mpg, package="ggplot2")
Suppose we are interested in fitting a simple linear regression model with highway miles per gallon
(hwy
) as the response variable and city miles per gallon (cty
) as the predictor variable.
Before fitting the model, we should check to see if there is a linear relationship between our
chosen variables.
2.2.2 Check for a linear relationship
To create this plot, we use the function ggplot
.
ggplot(mpg, aes(x=cty, y=hwy))+
geom_point()
- The call to
ggplot
can be thought of as the initialisation of the drawing canvas - The first argument of
ggplot
is the data set,mpg
- The second argument of
ggplot
is the mapping, i.e. "which variables contain the data that I am interested in plotting?" - The mapping is created by supplying the names of the variables found within your data set to the
aesthetics function,
aes()
. - Finally, we can add (with a "plus" symbol) a layer of points with
geom_point()
that will inherit any aesthetics supplied in the initialisation of the canvas - In other words, while not specified within
geom_point()
, the points are drawn at positionsx=cty
andy=hwy
From the plot above, there appears to be a linear relationship between the two variables. We can begin fitting our linear model.
2.2.3 Fitting the linear model
##
## Call:
## lm(formula = hwy ~ cty, data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3408 -1.2790 0.0214 1.0338 4.0461
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.89204 0.46895 1.902 0.0584 .
## cty 1.33746 0.02697 49.585 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.752 on 232 degrees of freedom
## Multiple R-squared: 0.9138, Adjusted R-squared: 0.9134
## F-statistic: 2459 on 1 and 232 DF, p-value: < 2.2e-16
From the regression output, we have that \(\widehat{\beta}_{0} = 0.892\) and \(\widehat{\beta}_{1} = 1.337\). The equation of the fitted line is:
\[\widehat{Y}_{i} \,=\, 0.892 \,+\, 1.337X_{i}\]
We can interpret \(\widehat{\beta}_{0} = 0.892\) as the mean highway miles per gallon when the city miles per gallon has a value of zero. However, this doesn't really make sense in the context of our data since you can't have a car that travels zero city miles on a gallon on gas.
We can interpret \(\widehat{\beta}_{1} = 1.337\) as the mean change in highway miles per gallon per unit increase in city miles per gallon. As such, for a 1 mpg increase in city mpg, we expect a 1.337 mpg increase in highway mpg.
2.2.4 Visualising the fit
Here is where ggplot2 and broom
shine! We first use broom::augment()
on our linear model to extract the data that was used to
fit the model, along with additional diagnostics (such as the residuals). Note that
broom::augment()
is a generic function. When we pass our lm
object to broom::augment()
, under
the hood, it actually calls the more specific broom::augment.lm()
function.
lm_miles_aug <- augment(lm_miles)
lm_miles_aug
## # A tibble: 234 x 8
## hwy cty .fitted .resid .hat .sigma .cooksd .std.resid
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 29 18 25.0 4.03 0.00458 1.74 0.0123 2.31
## 2 29 21 29.0 0.0214 0.00834 1.76 0.000000632 0.0123
## 3 31 20 27.6 3.36 0.00661 1.74 0.0123 1.92
## 4 30 21 29.0 1.02 0.00834 1.75 0.00144 0.585
## 5 26 16 22.3 3.71 0.00445 1.74 0.0101 2.12
## 6 26 18 25.0 1.03 0.00458 1.75 0.000805 0.591
## 7 27 18 25.0 2.03 0.00458 1.75 0.00311 1.16
## 8 26 18 25.0 1.03 0.00458 1.75 0.000805 0.591
## 9 25 16 22.3 2.71 0.00445 1.75 0.00536 1.55
## 10 28 20 27.6 0.359 0.00661 1.76 0.000140 0.205
## # ... with 224 more rows
In the first two columns of lm_miles_aug
, we have recovered the data that we initially passed
into lm()
. The .fitted
column contains the fitted values of our model and the .resid
column
contains the raw residuals.
The plot that we wish to build will use the newly created lm_miles_aug
data set and will
have the following features:
- The base plot is a scatterplot with points located at
x=cty
andy=hwy
- The fitted line will be drawn on top with points located at
x=cty
andy=.fitted
Since the points and lines that we wish to draw no longer share a common y-aesthetic, we may not want to declare a y-aesthetic in the initialisation of the canvas, and instead, declare the y-aesthetic in the individual layers.
ggplot(lm_miles_aug, aes(x=cty))+
geom_point(aes(y=hwy))+
geom_line(aes(y=.fitted), colour="#3366FF", lwd=1.5, alpha=0.6)
Notice that in building this plot, I did not sort any of the points! By default, geom_line()
connects points from left to right. I have added some additional arguments to geom_line()
outside
of the aesthetics since these values do not depend on the data:
-
colour
controls the colour of the line (can also usecolor
) -
lwd
controls the width of the line (can also uselinewidth
) -
alpha
controls the transparency of the line
To see the construction of the above plot as "an addition of layers" we can run
followed by
ggplot(lm_miles_aug, aes(x=cty))+
geom_point(aes(y=hwy))
followed by
2.2.5 Confidence and prediction intervals
2.2.5.1 Confidence intervals
Suppose we are interested in finding a point estimate and a 95% confidence interval for the mean
highway miles per gallon for vehicles with a city miles per gallon value of 20. We can obtain these
values by using the predict()
function. predict()
is another example of a generic function. When
we pass our lm
object to predict()
, it actually calls the more specific predict.lm()
function.
To see what we need to get our point estimate and confidence interval, let's pull up the associated
documentation.
?predict.lm
We will need to supply to predict()
:
- A linear model object
- A data frame containing variables common to the linear model, and values upon which to predict
- Note that if our model has predictor variable
cty
, we cannot supplynewdata = data.frame(x=20)
, we must supplynewdata = data.frame(cty=20)
- Note that if our model has predictor variable
- If we want a confidence interval, we should specify
interval = "confidence"
- The default confidence level is 0.95
predict(
lm_miles,
newdata = data.frame(cty=20),
interval = "confidence"
)
## fit lwr upr
## 1 27.64115 27.36044 27.92187
From the above output, the point estimate is 27.641. The lower bound of the 95% confidence interval is 27.360 and the upper bound is 27.922. This says that with 95% confidence, for vehicles with a city miles per gallon value of 20, the mean highway miles per gallon will range between 27.360 and 27.922.
2.2.5.2 Prediction intervals
Suppose we are now interested in finding a point estimate and a 95% prediction interval for the highway miles per gallon of a vehicle that has a city miles per gallon of 20. We can obtain these values in a manner similar to the previous example:
predict(
lm_miles,
newdata = data.frame(cty=20),
interval = "prediction"
)
## fit lwr upr
## 1 27.64115 24.17733 31.10498
From the above output, the point estimate is 27.641. The lower bound of the 95% prediction interval is 24.177 and the upper bound is 31.105. This says that with 95% confidence, a vehicle with a city miles per gallon value of 20 will have a highway miles per gallon value between 24.177 and 31.105.
2.2.5.3 Spot the differences
The point estimates in both cases were identical. However, the lower and upper bounds of the confidence and prediction intervals were different. This is unsurprising since the formulas used to compute the bounds of the two intervals are different.
It should be noted that, in general, for a fixed confidence level, the prediction interval will be wider than its corresponding confidence interval.
2.2.6 Correlations
Using the previously obtained augmented model data (lm_miles_aug
), we can easily obtain the
correlations of
- \(X_{i}\) and \(Y_{i}\),
- \(Y_{i}\) and \(\widehat{Y}_{i}\),
- \(X_{i}\) and \(\widehat{Y}_{i}\),
compare them among each other, and compare them to the model's coefficient of determination,
\(R^{2}\). Correlations are obtained using the cor()
function. We will also use the summarise()
function from the dplyr package to help us obtain the correlations
while staying in the context of our augmented data so that we can reference our variables using bare
names (i.e. we don't need to use $
to reference our variables).
summarise(
lm_miles_aug,
corr_x_y = cor(cty, hwy),
corr_y_fitted = cor(hwy, .fitted),
corr_x_fitted = cor(cty, .fitted)
)
## # A tibble: 1 x 3
## corr_x_y corr_y_fitted corr_x_fitted
## <dbl> <dbl> <dbl>
## 1 0.956 0.956 1
Note also that:
summarise(
lm_miles_aug,
corr_x_y_squared = cor(cty, hwy)^2,
corr_y_fitted_squared = cor(hwy, .fitted)^2,
corr_x_fitted_squared = cor(cty, .fitted)^2,
r.squared = summary(lm_miles)$r.squared
)
## # A tibble: 1 x 4
## corr_x_y_squared corr_y_fitted_squared corr_x_fitted_squared r.squared
## <dbl> <dbl> <dbl> <dbl>
## 1 0.914 0.914 1 0.914
Are these values mathematically related or is this coincidence? 😲 (See assignment 2 question 3).
2.3 Simple linear regression with transformations
When applying transformations to variables in linear models, there are two options:
- Create a new variable in the data set that applies the transformation, or
- Apply the transformation in the formula specification
To observe the differences, we will build two models using both methods. Using the mpg
data set
again, let us fit a model where the response will be the engine displacement and the predictor will
be the (natural) log of the highway miles per gallon.
2.3.1 Creating a new variable in your data set
With dplyr
loaded, we can easily create new variables in our data set using the mutate()
function. The first argument to mutate()
is your data set. Additional arguments are name-value
pairs for the variables you want to create.
The model is created as usual:
lm_displ1 <- lm(displ ~ log_hwy, data=mpg2)
2.3.2 Applying the transformation in the formula specification
We can apply transformations in the formula specification if we do not wish to create a new variable ahead of time. The model is created using:
2.3.3 Comparing our two models
summary(lm_displ1)
##
## Call:
## lm(formula = displ ~ log_hwy, data = mpg2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2992 -0.5548 -0.1148 0.4252 3.7446
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.4145 0.6496 23.73 <2e-16 ***
## log_hwy -3.8260 0.2074 -18.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8243 on 232 degrees of freedom
## Multiple R-squared: 0.5946, Adjusted R-squared: 0.5929
## F-statistic: 340.3 on 1 and 232 DF, p-value: < 2.2e-16
summary(lm_displ2)
##
## Call:
## lm(formula = displ ~ log(hwy), data = mpg2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2992 -0.5548 -0.1148 0.4252 3.7446
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.4145 0.6496 23.73 <2e-16 ***
## log(hwy) -3.8260 0.2074 -18.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8243 on 232 degrees of freedom
## Multiple R-squared: 0.5946, Adjusted R-squared: 0.5929
## F-statistic: 340.3 on 1 and 232 DF, p-value: < 2.2e-16
The models are identical! Let's try to predict the engine displacement for a vehicle that has a highway miles per gallon value of 29.
predict(
lm_displ1,
newdata = data.frame(log_hwy=29)
)
## 1
## -95.53829
predict(
lm_displ2,
newdata = data.frame(hwy=29)
)
## 1
## 2.53138
Our predicted values are very different! Why?
For models that use variables created ahead of time, the value(s) that are supplied to predict upon must be on the same scale as the other predictor values that were used in the fitting of the model. In other words, by supplying a value of 29 to the first model, we are predicting engine displacement for a vehicle with a log highway miles per gallon value of 29! In order to properly predict the engine displacement, we must instead supply the value of log(29).
predict(
lm_displ1,
newdata = data.frame(log_hwy=log(29))
)
## 1
## 2.53138
For models where transformations are applied in the formula specification, we can supply values on the original scale and the transformation will be applied for us in calculating the predicted value.
2.3.4 Warning ❗
Not all operations applied in the formula specification are treated as mathematical
operations. For example, a +
in the formula specification means to add a variable and as we saw
in section 1.2.8, -
means to remove a variable. In addition, the ^
symbol that we usually use
for exponentiation has a different meaning when used in a formula specification. For example, if we
wanted to fit a model using the square of highway miles per gallon as the predictor, we cannot
write:
lm(displ ~ hwy^2, data=mpg2)
Instead, we must either create this variable ahead of time or wrap our exponentiation with I()
:
When an operation is wrapped by I()
in a formula specification, it means to treat it as a literal
mathematical operator rather than a formula operator.
2.4 Another example
For the sake of illustration, suppose we wish to fit a model where the response variable is the log of the highway miles per gallon plus the log of the city miles per gallon and the predictor variable is the square of the engine displacement. I will create my response variable ahead of time but I will apply the transformation to my predictor in the formula specification.
The model is fitted using:
##
## Call:
## lm(formula = mpg_sum ~ I(displ^2), data = mpg2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.91040 -0.18887 0.00019 0.16808 1.33149
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.444537 0.037335 172.61 <2e-16 ***
## I(displ^2) -0.038570 0.002214 -17.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3324 on 232 degrees of freedom
## Multiple R-squared: 0.5668, Adjusted R-squared: 0.5649
## F-statistic: 303.6 on 1 and 232 DF, p-value: < 2.2e-16
The equation of the fitted line is:
\[\widehat{Y}_{i} \,=\, 6.445 \,-\, 0.039X_{i}^{2}\]
where \(\widehat{y}_{i}\) is the sum of the log highway miles per gallon and the log city miles per gallon. Due to how the model was constructed, if I wanted to predict the sum of the log highway miles per gallon and the log city highway miles per gallon for a particular vehicle, I would supply displacement values as-is and the model would square them for me.
2.4.1 Visualise the fit
We can visualise the fit using the same procedure as before by first augmenting our linear model.
lm_log_miles_aug <- augment(lm_log_miles)
lm_log_miles_aug
## # A tibble: 234 x 7
## mpg_sum `I(displ^2)` .fitted .hat .sigma .cooksd .std.resid
## <dbl> <I<dbl>> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6.26 3.24 6.32 0.00914 0.333 0.000162 -0.187
## 2 6.41 3.24 6.32 0.00914 0.333 0.000359 0.279
## 3 6.43 4 6.29 0.00846 0.333 0.000758 0.421
## 4 6.45 4 6.29 0.00846 0.333 0.000941 0.470
## 5 6.03 7.84 6.14 0.00580 0.333 0.000330 -0.336
## 6 6.15 7.84 6.14 0.00580 0.333 0.00000106 0.0191
## 7 6.19 9.61 6.07 0.00502 0.333 0.000290 0.339
## 8 6.15 3.24 6.32 0.00914 0.333 0.00123 -0.517
## 9 5.99 3.24 6.32 0.00914 0.332 0.00454 -0.992
## 10 6.33 4 6.29 0.00846 0.333 0.0000553 0.114
## # ... with 224 more rows
There are two things to note here:
- The name of the column of the predictor values violates the usual naming conventions by containing brackets and a caret so its name is wrapped in backticks
- The values in this column are the squared displacement values
- This can be verified by comparing the values
lm_log_miles_aug
with the values ofmpg2
— the rows have not been reordered in any way so, for example, the first row oflm_log_miles_aug
corresponds with the first row ofmpg2
We can visualise the fit by passing our augmented model data to ggplot()
:
ggplot(lm_log_miles_aug, aes(x=`I(displ^2)`))+
geom_point(aes(y=mpg_sum))+
geom_line(aes(y=.fitted), colour="#3366FF", lwd=1.5, alpha=0.6)
Obviously, this isn't a great fit but that was not our goal with this example!