Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Lab 7 — November 15

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.6 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.9
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(GGally)
library(broom)
theme_set(theme_bw())
1 / 20

Additional packages — easystats

easystats is a collection of packages, like tidyverse. easystats is not currently on CRAN but can be installed using:

install.packages("easystats", repos="https://easystats.r-universe.dev")

However, its component packages and dependencies can be installed from CRAN normally:

install.packages("performance")
install.packages("see")
install.packages("patchwork")

Once installed,

library(performance)
# or
# library(easystats)
2 / 20

Additional packages — tidymodels

tidymodels is an extension of the tidyverse that focuses on modelling. (Recall that broom is actually a package from tidymodels, not tidyverse!)

You can install tidymodels using the usual methods. You can also install just its component packages. In this lab, I introduce two additional packages from the tidymodels collection:

  • rsample: for producing data partitions and data resamples
  • yardstick: for calculating model performance metrics
library(rsample)
library(yardstick)
3 / 20

Load the data

t7 <- read_csv("./data/tutorial7.csv")
## Rows: 1024 Columns: 7
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (7): Y, X1, X2, X3, X4, X5, X6
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
t7
## # A tibble: 1,024 x 7
## Y X1 X2 X3 X4 X5 X6
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -4078113. 0.0727 2.91 1.35 0.224 357. 1
## 2 -3100124. 0.181 8.08 2.57 0.848 311. 1
## 3 -2282073. 0.659 115. 77.4 1.06 269. 0
## 4 -1993606. 0.807 47.4 42.3 1.35 251. 0
## 5 -1569783. 1.24 0.785 -1.61 1.36 222. 1
## 6 -1467620. 2.46 103. 258. 1.55 216. 1
## 7 -1248135. 2.50 22.4 67.7 1.59 198. 1
## 8 -1223299. 2.56 104. 262. 1.60 198. 0
## 9 -1221486. 2.74 89.7 254. 1.95 197. 1
## 10 -1178091. 3.29 81.8 269. 2.30 194. 0
## # ... with 1,014 more rows
4 / 20

Make a scatterplot matrix

ggpairs(t7)

5 / 20

Fit a linear model using all main effects

main_model <- lm(Y ~ ., data=t7)
summary(main_model)
##
## Call:
## lm(formula = Y ~ ., data = t7)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1665731 -72925 22740 95651 308517
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.273e+05 3.501e+04 9.350 < 2e-16 ***
## X1 3.291e+04 3.657e+03 8.999 < 2e-16 ***
## X2 -2.506e+02 3.055e+02 -0.821 0.41212
## X3 6.365e+00 2.014e+00 3.160 0.00162 **
## X4 -3.312e+04 3.666e+03 -9.033 < 2e-16 ***
## X5 -7.394e+03 1.300e+02 -56.895 < 2e-16 ***
## X6 -1.696e+05 1.148e+04 -14.771 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 179700 on 1017 degrees of freedom
## Multiple R-squared: 0.8604, Adjusted R-squared: 0.8596
## F-statistic: 1045 on 6 and 1017 DF, p-value: < 2.2e-16
6 / 20

Report the AIC of the fitted model

AIC(main_model)
## [1] 27693.49

AIC is also reported by broom::glance.lm():

glance(main_model)
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 0.860 0.860 179672. 1045. 0 6 -13839. 27693. 27733. 3.28e13 1017 1024
7 / 20

Check model assumptions via diagnostic plots

par(mfrow = c(2, 2)) # Create 2x2 grid
plot(main_model, which=c(1, 2, 3, 5))

8 / 20

Check model assumptions via diagnostic plots

performance::check_model() makes it easier to read and interpret the diagnostic plots.

  • Plots are created using ggplot2

  • Plots are labelled to tell you what to look for

check_model(main_model, check=c("vif", "qq", "normality", "linearity", "homogeneity", "outliers"))
9 / 20

10 / 20

Adding residuals to the scatterplot matrix

main_model %>%
augment() %>%
select(Y, contains("X"), .resid) %>%
ggpairs()

11 / 20

Consider adding interaction effects

We wish to consider interactions between all the main effects. For example, consider the interaction between variables X1 and X2.

model_X1X2 <- lm(Y ~ . + X1:X2, data=t7)
anova(main_model, model_X1X2)
## Analysis of Variance Table
##
## Model 1: Y ~ X1 + X2 + X3 + X4 + X5 + X6
## Model 2: Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X1:X2
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1017 3.2831e+13
## 2 1016 3.2800e+13 1 3.0461e+10 0.9435 0.3316

This interaction is not significant so we will not add it to our model. It remains to check all other two-way interactions. We do not want to build all these models manually and pass each one into anova() to compare with our main model.

12 / 20

Consider adding interaction effects

Instead, we will use the add1() function which adds single variables to the main model and gives the result of the usual F-test from comparing two models with anova().

add1(
main_model,
~ .^2,
test = "F"
)
## Single term additions
##
## Model:
## Y ~ X1 + X2 + X3 + X4 + X5 + X6
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 3.2831e+13 24786
## X1:X2 1 3.0461e+10 3.2800e+13 24787 0.9435 0.33160
## X1:X3 1 7.8963e+10 3.2752e+13 24785 2.4495 0.11787
## X1:X4 1 2.7107e+12 3.0120e+13 24699 91.4374 < 2.2e-16 ***
## X1:X5 1 4.7530e+12 2.8078e+13 24627 171.9883 < 2.2e-16 ***
## X1:X6 1 4.0214e+12 2.8809e+13 24654 141.8184 < 2.2e-16 ***
## X2:X3 1 1.0099e+12 3.1821e+13 24756 32.2458 1.773e-08 ***
## X2:X4 1 5.3209e+10 3.2778e+13 24786 1.6493 0.19935
## X2:X5 1 1.2772e+12 3.1554e+13 24747 41.1238 2.185e-10 ***
## X2:X6 1 2.6245e+12 3.0206e+13 24702 88.2748 < 2.2e-16 ***
## X3:X4 1 9.8325e+10 3.2733e+13 24784 3.0520 0.08094 .
## X3:X5 1 7.3403e+12 2.5491e+13 24528 292.5696 < 2.2e-16 ***
## X3:X6 1 4.6462e+12 2.8185e+13 24631 167.4842 < 2.2e-16 ***
## X4:X5 1 4.8446e+12 2.7986e+13 24624 175.8765 < 2.2e-16 ***
## X4:X6 1 4.0482e+12 2.8783e+13 24653 142.8979 < 2.2e-16 ***
## X5:X6 1 4.7672e+12 2.8064e+13 24627 172.5890 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
13 / 20

Consider adding interaction effects

Using a significance level of α=0.05, the interactions that are not significant are

  • X1 with X2
  • X1 with X3
  • X2 with X4
  • X3 with X4

For this example, we will be adding all significant interactions into our main model.

## Single term additions
##
## Model:
## Y ~ X1 + X2 + X3 + X4 + X5 + X6
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 3.2831e+13 24786
## X1:X2 1 3.0461e+10 3.2800e+13 24787 0.9435 0.33160
## X1:X3 1 7.8963e+10 3.2752e+13 24785 2.4495 0.11787
## X1:X4 1 2.7107e+12 3.0120e+13 24699 91.4374 < 2.2e-16 ***
## X1:X5 1 4.7530e+12 2.8078e+13 24627 171.9883 < 2.2e-16 ***
## X1:X6 1 4.0214e+12 2.8809e+13 24654 141.8184 < 2.2e-16 ***
## X2:X3 1 1.0099e+12 3.1821e+13 24756 32.2458 1.773e-08 ***
## X2:X4 1 5.3209e+10 3.2778e+13 24786 1.6493 0.19935
## X2:X5 1 1.2772e+12 3.1554e+13 24747 41.1238 2.185e-10 ***
## X2:X6 1 2.6245e+12 3.0206e+13 24702 88.2748 < 2.2e-16 ***
## X3:X4 1 9.8325e+10 3.2733e+13 24784 3.0520 0.08094 .
## X3:X5 1 7.3403e+12 2.5491e+13 24528 292.5696 < 2.2e-16 ***
## X3:X6 1 4.6462e+12 2.8185e+13 24631 167.4842 < 2.2e-16 ***
## X4:X5 1 4.8446e+12 2.7986e+13 24624 175.8765 < 2.2e-16 ***
## X4:X6 1 4.0482e+12 2.8783e+13 24653 142.8979 < 2.2e-16 ***
## X5:X6 1 4.7672e+12 2.8064e+13 24627 172.5890 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
13 / 20

Consider quadratic effects

Let us also consider quadratic effects, obtained by squaring the main predictors (can also be thought of as an interaction with oneself). We will not consider the square of X6 since X6 is a binary predictor (0-1) and squaring it would result in a variable that is identical.

add1(
main_model,
~ . + I(X1^2) + I(X2^2) + I(X3^2) + I(X4^2) + I(X5^2),
test = "F"
)

We also need to wrap the squaring by I() to mean literal exponentiation as ^ has a different meaning when used in a formula. See ?formula for details.

## Single term additions
##
## Model:
## Y ~ X1 + X2 + X3 + X4 + X5 + X6
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 3.2831e+13 24786
## I(X1^2) 1 2.9689e+12 2.9862e+13 24690 101.012 < 2.2e-16 ***
## I(X2^2) 1 6.6248e+11 3.2168e+13 24767 20.924 5.369e-06 ***
## I(X3^2) 1 1.2554e+12 3.1575e+13 24748 40.395 3.128e-10 ***
## I(X4^2) 1 2.4475e+12 3.0383e+13 24708 81.844 < 2.2e-16 ***
## I(X5^2) 1 2.9071e+13 3.7596e+12 22568 7856.178 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
14 / 20

Consider quadratic effects

Using a significance level of α=0.05, all quadratic terms are significant.

For this example, we will be adding all quadratic terms into our main model.

## Single term additions
##
## Model:
## Y ~ X1 + X2 + X3 + X4 + X5 + X6
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 3.2831e+13 24786
## I(X1^2) 1 2.9689e+12 2.9862e+13 24690 101.012 < 2.2e-16 ***
## I(X2^2) 1 6.6248e+11 3.2168e+13 24767 20.924 5.369e-06 ***
## I(X3^2) 1 1.2554e+12 3.1575e+13 24748 40.395 3.128e-10 ***
## I(X4^2) 1 2.4475e+12 3.0383e+13 24708 81.844 < 2.2e-16 ***
## I(X5^2) 1 2.9071e+13 3.7596e+12 22568 7856.178 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
14 / 20

Fit the new model

new_formula <- Y ~ .^2 - X1:X2 - X1:X3 - X2:X4 - X3:X4 +
I(X1^2) + I(X2^2) + I(X3^2) + I(X4^2) + I(X5^2)
new_model <- lm(new_formula, data=t7)
tidy(new_model) %>%
print(n=23)

Let's use tidy() to print just the coefficient table.

## # A tibble: 23 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 177. 64.5 2.75 6.15e- 3
## 2 X1 128. 4.36 29.3 1.25e-136
## 3 X2 32.8 0.907 36.2 8.09e-184
## 4 X3 7.00 0.00777 901. 0
## 5 X4 -49.6 4.28 -11.6 2.61e- 29
## 6 X5 2.69 0.479 5.62 2.54e- 8
## 7 X6 8.49 31.3 0.271 7.86e- 1
## 8 I(X1^2) 0.744 0.802 0.927 3.54e- 1
## 9 I(X2^2) -0.00381 0.00509 -0.748 4.55e- 1
## 10 I(X3^2) -0.000000192 0.000000166 -1.16 2.46e- 1
## 11 I(X4^2) 0.282 0.804 0.351 7.26e- 1
## 12 I(X5^2) -32.0 0.000936 -34168. 0
## 13 X1:X4 -1.02 1.60 -0.636 5.25e- 1
## 14 X1:X5 0.0767 0.0471 1.63 1.04e- 1
## 15 X1:X6 1.56 3.40 0.458 6.47e- 1
## 16 X2:X3 0.0000598 0.0000597 1.00 3.17e- 1
## 17 X2:X5 1.01 0.00280 359. 0
## 18 X2:X6 -0.0406 0.275 -0.147 8.83e- 1
## 19 X3:X5 -0.0000240 0.0000224 -1.07 2.84e- 1
## 20 X3:X6 -8.00 0.00180 -4436. 0
## 21 X4:X5 -0.0760 0.0472 -1.61 1.08e- 1
## 22 X4:X6 -1.40 3.41 -0.411 6.81e- 1
## 23 X5:X6 -0.0421 0.119 -0.355 7.23e- 1
15 / 20

Produce 95% confidence intervals for the betas

We can obtain confidence intervals for the betas using confint() or as a tibble using tidy() and setting conf.int=TRUE

# confint(new_model)
tidy(new_model, conf.int=TRUE) %>%
select(term, contains("conf")) %>%
print(n=23)
## # A tibble: 23 x 3
## term conf.low conf.high
## <chr> <dbl> <dbl>
## 1 (Intercept) 50.5 303.
## 2 X1 119. 136.
## 3 X2 31.0 34.6
## 4 X3 6.99 7.02
## 5 X4 -58.0 -41.2
## 6 X5 1.75 3.63
## 7 X6 -52.9 69.9
## 8 I(X1^2) -0.830 2.32
## 9 I(X2^2) -0.0138 0.00619
## 10 I(X3^2) -0.000000518 0.000000133
## 11 I(X4^2) -1.29 1.86
## 12 I(X5^2) -32.0 -32.0
## 13 X1:X4 -4.17 2.13
## 14 X1:X5 -0.0158 0.169
## 15 X1:X6 -5.11 8.23
## 16 X2:X3 -0.0000574 0.000177
## 17 X2:X5 1.00 1.01
## 18 X2:X6 -0.581 0.500
## 19 X3:X5 -0.0000679 0.0000199
## 20 X3:X6 -8.00 -8.00
## 21 X4:X5 -0.169 0.0166
## 22 X4:X6 -8.09 5.29
## 23 X5:X6 -0.275 0.191
16 / 20

Creating the variables beforehand

An advantage of creating the variables beforehand is that it simplifies the formula specification and your model output may look a bit cleaner. Consider the simple example:

demo_data1 <- t7 %>%
select(Y, X3) %>%
mutate(X3sq = X3^2)
demo_model1 <- lm(Y ~ ., data=demo_data1)
summary(demo_model1)
##
## Call:
## lm(formula = Y ~ ., data = demo_data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3736905 -51045 85325 169189 971743
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.413e+05 2.515e+04 -13.57 <2e-16 ***
## X3 3.388e+01 2.506e+00 13.52 <2e-16 ***
## X3sq -9.323e-04 4.874e-05 -19.13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 381900 on 1021 degrees of freedom
## Multiple R-squared: 0.367, Adjusted R-squared: 0.3658
## F-statistic: 296 on 2 and 1021 DF, p-value: < 2.2e-16
17 / 20

Creating the variables beforehand

The disadvantage occurs when making predictions. When new variables are created beforehand, their relationships are not retained. To make a prediction with the model where we squared X3 beforehand, we need to supply a value for X3 and X3sq.

predict(
demo_model1,
newdata = data.frame(X3=100, X3sq=100^2)
)
## 1
## -337875.1

Note that we could have used broom::augment() to obtain our predictions if we wanted our output to be a tibble.

18 / 20

Creating the variables beforehand

However, if we create the variables in the formula specification, the relationships between the variables are retained. When predicting, we only need to supply a value for X3. The value of X3^2 is automatically calculated by squaring the value of X3.

demo_model2 <- lm(Y ~ X3 + I(X3^2), data=demo_data1)
predict(
demo_model2,
newdata = data.frame(X3=100)
)
## 1
## -337875.1

Note that we could have used broom::augment() to obtain our predictions if we wanted our output to be a tibble.

19 / 20

Creating training/testing splits

Oftentimes, we may want to train our model on a portion of the available data, and evaluate its performance on the remaining unseen data. This can be accomplished in a tidy manner using the rsample and yardstick packages from the tidymodels collection.

An example workflow:

set.seed(100) # For reproducibility
# Create an 80/20 split
t7_split <- t7 %>%
initial_split(prop=0.8)
t7_train <- t7_split %>%
training()
t7_test <- t7_split %>%
testing()
main_model <- lm(Y ~ ., data=t7_train)
main_model %>%
augment(newdata = t7_test) %>%
metrics(truth=Y, estimate=.fitted)
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 197221.
## 2 rsq standard 0.860
## 3 mae standard 126751.
20 / 20

Additional packages — easystats

easystats is a collection of packages, like tidyverse. easystats is not currently on CRAN but can be installed using:

install.packages("easystats", repos="https://easystats.r-universe.dev")

However, its component packages and dependencies can be installed from CRAN normally:

install.packages("performance")
install.packages("see")
install.packages("patchwork")

Once installed,

library(performance)
# or
# library(easystats)
2 / 20
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
Esc Back to slideshow