library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.6 v purrr 0.3.4## v tibble 3.1.6 v dplyr 1.0.9## v tidyr 1.2.0 v stringr 1.4.0## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --## x dplyr::filter() masks stats::filter()## x dplyr::lag() masks stats::lag()
library(GGally)library(broom)theme_set(theme_bw())
easystats is a collection of packages, like tidyverse. easystats is not currently on CRAN but can be installed using:
install.packages("easystats", repos="https://easystats.r-universe.dev")
However, its component packages and dependencies can be installed from CRAN normally:
install.packages("performance")install.packages("see")install.packages("patchwork")
Once installed,
library(performance)# or# library(easystats)
tidymodels is an extension of the tidyverse that focuses on modelling. (Recall that broom is actually a package from tidymodels, not tidyverse!)
You can install tidymodels using the usual methods. You can also install just its component packages. In this lab, I introduce two additional packages from the tidymodels collection:
library(rsample)library(yardstick)
t7 <- read_csv("./data/tutorial7.csv")
## Rows: 1024 Columns: 7## -- Column specification --------------------------------------------------------## Delimiter: ","## dbl (7): Y, X1, X2, X3, X4, X5, X6## ## i Use `spec()` to retrieve the full column specification for this data.## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
t7
## # A tibble: 1,024 x 7## Y X1 X2 X3 X4 X5 X6## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 -4078113. 0.0727 2.91 1.35 0.224 357. 1## 2 -3100124. 0.181 8.08 2.57 0.848 311. 1## 3 -2282073. 0.659 115. 77.4 1.06 269. 0## 4 -1993606. 0.807 47.4 42.3 1.35 251. 0## 5 -1569783. 1.24 0.785 -1.61 1.36 222. 1## 6 -1467620. 2.46 103. 258. 1.55 216. 1## 7 -1248135. 2.50 22.4 67.7 1.59 198. 1## 8 -1223299. 2.56 104. 262. 1.60 198. 0## 9 -1221486. 2.74 89.7 254. 1.95 197. 1## 10 -1178091. 3.29 81.8 269. 2.30 194. 0## # ... with 1,014 more rows
ggpairs(t7)
main_model <- lm(Y ~ ., data=t7)summary(main_model)
## ## Call:## lm(formula = Y ~ ., data = t7)## ## Residuals:## Min 1Q Median 3Q Max ## -1665731 -72925 22740 95651 308517 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.273e+05 3.501e+04 9.350 < 2e-16 ***## X1 3.291e+04 3.657e+03 8.999 < 2e-16 ***## X2 -2.506e+02 3.055e+02 -0.821 0.41212 ## X3 6.365e+00 2.014e+00 3.160 0.00162 ** ## X4 -3.312e+04 3.666e+03 -9.033 < 2e-16 ***## X5 -7.394e+03 1.300e+02 -56.895 < 2e-16 ***## X6 -1.696e+05 1.148e+04 -14.771 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 179700 on 1017 degrees of freedom## Multiple R-squared: 0.8604, Adjusted R-squared: 0.8596 ## F-statistic: 1045 on 6 and 1017 DF, p-value: < 2.2e-16
AIC(main_model)
## [1] 27693.49
AIC is also reported by broom::glance.lm()
:
glance(main_model)
## # A tibble: 1 x 12## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>## 1 0.860 0.860 179672. 1045. 0 6 -13839. 27693. 27733. 3.28e13 1017 1024
par(mfrow = c(2, 2)) # Create 2x2 gridplot(main_model, which=c(1, 2, 3, 5))
performance::check_model()
makes it easier to read and interpret the diagnostic plots.
Plots are created using ggplot2
Plots are labelled to tell you what to look for
check_model(main_model, check=c("vif", "qq", "normality", "linearity", "homogeneity", "outliers"))
main_model %>% augment() %>% select(Y, contains("X"), .resid) %>% ggpairs()
We wish to consider interactions between all the main effects. For example, consider the interaction
between variables X1
and X2
.
model_X1X2 <- lm(Y ~ . + X1:X2, data=t7)anova(main_model, model_X1X2)
## Analysis of Variance Table## ## Model 1: Y ~ X1 + X2 + X3 + X4 + X5 + X6## Model 2: Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X1:X2## Res.Df RSS Df Sum of Sq F Pr(>F)## 1 1017 3.2831e+13 ## 2 1016 3.2800e+13 1 3.0461e+10 0.9435 0.3316
This interaction is not significant so we will not add it to our model. It remains to check all
other two-way interactions. We do not want to build all these models manually and pass each one into
anova()
to compare with our main model.
Instead, we will use the add1()
function which adds single variables to the main model and
gives the result of the usual F-test from comparing two models with anova()
.
add1( main_model, ~ .^2, test = "F")
## Single term additions## ## Model:## Y ~ X1 + X2 + X3 + X4 + X5 + X6## Df Sum of Sq RSS AIC F value Pr(>F) ## <none> 3.2831e+13 24786 ## X1:X2 1 3.0461e+10 3.2800e+13 24787 0.9435 0.33160 ## X1:X3 1 7.8963e+10 3.2752e+13 24785 2.4495 0.11787 ## X1:X4 1 2.7107e+12 3.0120e+13 24699 91.4374 < 2.2e-16 ***## X1:X5 1 4.7530e+12 2.8078e+13 24627 171.9883 < 2.2e-16 ***## X1:X6 1 4.0214e+12 2.8809e+13 24654 141.8184 < 2.2e-16 ***## X2:X3 1 1.0099e+12 3.1821e+13 24756 32.2458 1.773e-08 ***## X2:X4 1 5.3209e+10 3.2778e+13 24786 1.6493 0.19935 ## X2:X5 1 1.2772e+12 3.1554e+13 24747 41.1238 2.185e-10 ***## X2:X6 1 2.6245e+12 3.0206e+13 24702 88.2748 < 2.2e-16 ***## X3:X4 1 9.8325e+10 3.2733e+13 24784 3.0520 0.08094 . ## X3:X5 1 7.3403e+12 2.5491e+13 24528 292.5696 < 2.2e-16 ***## X3:X6 1 4.6462e+12 2.8185e+13 24631 167.4842 < 2.2e-16 ***## X4:X5 1 4.8446e+12 2.7986e+13 24624 175.8765 < 2.2e-16 ***## X4:X6 1 4.0482e+12 2.8783e+13 24653 142.8979 < 2.2e-16 ***## X5:X6 1 4.7672e+12 2.8064e+13 24627 172.5890 < 2.2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using a significance level of α=0.05, the interactions that are not significant are
X1
with X2
X1
with X3
X2
with X4
X3
with X4
For this example, we will be adding all significant interactions into our main model.
## Single term additions## ## Model:## Y ~ X1 + X2 + X3 + X4 + X5 + X6## Df Sum of Sq RSS AIC F value Pr(>F) ## <none> 3.2831e+13 24786 ## X1:X2 1 3.0461e+10 3.2800e+13 24787 0.9435 0.33160 ## X1:X3 1 7.8963e+10 3.2752e+13 24785 2.4495 0.11787 ## X1:X4 1 2.7107e+12 3.0120e+13 24699 91.4374 < 2.2e-16 ***## X1:X5 1 4.7530e+12 2.8078e+13 24627 171.9883 < 2.2e-16 ***## X1:X6 1 4.0214e+12 2.8809e+13 24654 141.8184 < 2.2e-16 ***## X2:X3 1 1.0099e+12 3.1821e+13 24756 32.2458 1.773e-08 ***## X2:X4 1 5.3209e+10 3.2778e+13 24786 1.6493 0.19935 ## X2:X5 1 1.2772e+12 3.1554e+13 24747 41.1238 2.185e-10 ***## X2:X6 1 2.6245e+12 3.0206e+13 24702 88.2748 < 2.2e-16 ***## X3:X4 1 9.8325e+10 3.2733e+13 24784 3.0520 0.08094 . ## X3:X5 1 7.3403e+12 2.5491e+13 24528 292.5696 < 2.2e-16 ***## X3:X6 1 4.6462e+12 2.8185e+13 24631 167.4842 < 2.2e-16 ***## X4:X5 1 4.8446e+12 2.7986e+13 24624 175.8765 < 2.2e-16 ***## X4:X6 1 4.0482e+12 2.8783e+13 24653 142.8979 < 2.2e-16 ***## X5:X6 1 4.7672e+12 2.8064e+13 24627 172.5890 < 2.2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let us also consider quadratic effects, obtained by squaring the main predictors (can also be
thought of as an interaction with oneself). We will not consider the square of X6
since X6
is a
binary predictor (0-1) and squaring it would result in a variable that is identical.
add1( main_model, ~ . + I(X1^2) + I(X2^2) + I(X3^2) + I(X4^2) + I(X5^2), test = "F")
We also need to wrap the squaring by I()
to mean literal exponentiation as ^
has a different
meaning when used in a formula. See ?formula
for details.
## Single term additions## ## Model:## Y ~ X1 + X2 + X3 + X4 + X5 + X6## Df Sum of Sq RSS AIC F value Pr(>F) ## <none> 3.2831e+13 24786 ## I(X1^2) 1 2.9689e+12 2.9862e+13 24690 101.012 < 2.2e-16 ***## I(X2^2) 1 6.6248e+11 3.2168e+13 24767 20.924 5.369e-06 ***## I(X3^2) 1 1.2554e+12 3.1575e+13 24748 40.395 3.128e-10 ***## I(X4^2) 1 2.4475e+12 3.0383e+13 24708 81.844 < 2.2e-16 ***## I(X5^2) 1 2.9071e+13 3.7596e+12 22568 7856.178 < 2.2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using a significance level of α=0.05, all quadratic terms are significant.
For this example, we will be adding all quadratic terms into our main model.
## Single term additions## ## Model:## Y ~ X1 + X2 + X3 + X4 + X5 + X6## Df Sum of Sq RSS AIC F value Pr(>F) ## <none> 3.2831e+13 24786 ## I(X1^2) 1 2.9689e+12 2.9862e+13 24690 101.012 < 2.2e-16 ***## I(X2^2) 1 6.6248e+11 3.2168e+13 24767 20.924 5.369e-06 ***## I(X3^2) 1 1.2554e+12 3.1575e+13 24748 40.395 3.128e-10 ***## I(X4^2) 1 2.4475e+12 3.0383e+13 24708 81.844 < 2.2e-16 ***## I(X5^2) 1 2.9071e+13 3.7596e+12 22568 7856.178 < 2.2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
new_formula <- Y ~ .^2 - X1:X2 - X1:X3 - X2:X4 - X3:X4 + I(X1^2) + I(X2^2) + I(X3^2) + I(X4^2) + I(X5^2)new_model <- lm(new_formula, data=t7)tidy(new_model) %>% print(n=23)
Let's use tidy()
to print just the coefficient table.
## # A tibble: 23 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 177. 64.5 2.75 6.15e- 3## 2 X1 128. 4.36 29.3 1.25e-136## 3 X2 32.8 0.907 36.2 8.09e-184## 4 X3 7.00 0.00777 901. 0 ## 5 X4 -49.6 4.28 -11.6 2.61e- 29## 6 X5 2.69 0.479 5.62 2.54e- 8## 7 X6 8.49 31.3 0.271 7.86e- 1## 8 I(X1^2) 0.744 0.802 0.927 3.54e- 1## 9 I(X2^2) -0.00381 0.00509 -0.748 4.55e- 1## 10 I(X3^2) -0.000000192 0.000000166 -1.16 2.46e- 1## 11 I(X4^2) 0.282 0.804 0.351 7.26e- 1## 12 I(X5^2) -32.0 0.000936 -34168. 0 ## 13 X1:X4 -1.02 1.60 -0.636 5.25e- 1## 14 X1:X5 0.0767 0.0471 1.63 1.04e- 1## 15 X1:X6 1.56 3.40 0.458 6.47e- 1## 16 X2:X3 0.0000598 0.0000597 1.00 3.17e- 1## 17 X2:X5 1.01 0.00280 359. 0 ## 18 X2:X6 -0.0406 0.275 -0.147 8.83e- 1## 19 X3:X5 -0.0000240 0.0000224 -1.07 2.84e- 1## 20 X3:X6 -8.00 0.00180 -4436. 0 ## 21 X4:X5 -0.0760 0.0472 -1.61 1.08e- 1## 22 X4:X6 -1.40 3.41 -0.411 6.81e- 1## 23 X5:X6 -0.0421 0.119 -0.355 7.23e- 1
We can obtain confidence intervals for the betas using confint()
or as a tibble using tidy()
and
setting conf.int=TRUE
# confint(new_model)tidy(new_model, conf.int=TRUE) %>% select(term, contains("conf")) %>% print(n=23)
## # A tibble: 23 x 3## term conf.low conf.high## <chr> <dbl> <dbl>## 1 (Intercept) 50.5 303. ## 2 X1 119. 136. ## 3 X2 31.0 34.6 ## 4 X3 6.99 7.02 ## 5 X4 -58.0 -41.2 ## 6 X5 1.75 3.63 ## 7 X6 -52.9 69.9 ## 8 I(X1^2) -0.830 2.32 ## 9 I(X2^2) -0.0138 0.00619 ## 10 I(X3^2) -0.000000518 0.000000133## 11 I(X4^2) -1.29 1.86 ## 12 I(X5^2) -32.0 -32.0 ## 13 X1:X4 -4.17 2.13 ## 14 X1:X5 -0.0158 0.169 ## 15 X1:X6 -5.11 8.23 ## 16 X2:X3 -0.0000574 0.000177 ## 17 X2:X5 1.00 1.01 ## 18 X2:X6 -0.581 0.500 ## 19 X3:X5 -0.0000679 0.0000199 ## 20 X3:X6 -8.00 -8.00 ## 21 X4:X5 -0.169 0.0166 ## 22 X4:X6 -8.09 5.29 ## 23 X5:X6 -0.275 0.191
An advantage of creating the variables beforehand is that it simplifies the formula specification and your model output may look a bit cleaner. Consider the simple example:
demo_data1 <- t7 %>% select(Y, X3) %>% mutate(X3sq = X3^2)demo_model1 <- lm(Y ~ ., data=demo_data1)summary(demo_model1)
## ## Call:## lm(formula = Y ~ ., data = demo_data1)## ## Residuals:## Min 1Q Median 3Q Max ## -3736905 -51045 85325 169189 971743 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -3.413e+05 2.515e+04 -13.57 <2e-16 ***## X3 3.388e+01 2.506e+00 13.52 <2e-16 ***## X3sq -9.323e-04 4.874e-05 -19.13 <2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 381900 on 1021 degrees of freedom## Multiple R-squared: 0.367, Adjusted R-squared: 0.3658 ## F-statistic: 296 on 2 and 1021 DF, p-value: < 2.2e-16
The disadvantage occurs when making predictions. When new variables are created beforehand, their
relationships are not retained. To make a prediction with the model where we squared X3
beforehand, we need to supply a value for X3
and X3sq
.
predict( demo_model1, newdata = data.frame(X3=100, X3sq=100^2))
## 1 ## -337875.1
Note that we could have used broom::augment()
to obtain our predictions if we wanted our output to
be a tibble.
However, if we create the variables in the formula specification, the relationships between the
variables are retained. When predicting, we only need to supply a value for X3
. The value of
X3^2
is automatically calculated by squaring the value of X3
.
demo_model2 <- lm(Y ~ X3 + I(X3^2), data=demo_data1)predict( demo_model2, newdata = data.frame(X3=100))
## 1 ## -337875.1
Note that we could have used broom::augment()
to obtain our predictions if we wanted our output to
be a tibble.
Oftentimes, we may want to train our model on a portion of the available data, and evaluate its performance on the remaining unseen data. This can be accomplished in a tidy manner using the rsample and yardstick packages from the tidymodels collection.
An example workflow:
set.seed(100) # For reproducibility# Create an 80/20 splitt7_split <- t7 %>% initial_split(prop=0.8)t7_train <- t7_split %>% training()t7_test <- t7_split %>% testing()main_model <- lm(Y ~ ., data=t7_train)main_model %>% augment(newdata = t7_test) %>% metrics(truth=Y, estimate=.fitted)
## # A tibble: 3 x 3## .metric .estimator .estimate## <chr> <chr> <dbl>## 1 rmse standard 197221. ## 2 rsq standard 0.860## 3 mae standard 126751.
easystats is a collection of packages, like tidyverse. easystats is not currently on CRAN but can be installed using:
install.packages("easystats", repos="https://easystats.r-universe.dev")
However, its component packages and dependencies can be installed from CRAN normally:
install.packages("performance")install.packages("see")install.packages("patchwork")
Once installed,
library(performance)# or# library(easystats)
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.6 v purrr 0.3.4## v tibble 3.1.6 v dplyr 1.0.9## v tidyr 1.2.0 v stringr 1.4.0## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --## x dplyr::filter() masks stats::filter()## x dplyr::lag() masks stats::lag()
library(GGally)library(broom)theme_set(theme_bw())
easystats is a collection of packages, like tidyverse. easystats is not currently on CRAN but can be installed using:
install.packages("easystats", repos="https://easystats.r-universe.dev")
However, its component packages and dependencies can be installed from CRAN normally:
install.packages("performance")install.packages("see")install.packages("patchwork")
Once installed,
library(performance)# or# library(easystats)
tidymodels is an extension of the tidyverse that focuses on modelling. (Recall that broom is actually a package from tidymodels, not tidyverse!)
You can install tidymodels using the usual methods. You can also install just its component packages. In this lab, I introduce two additional packages from the tidymodels collection:
library(rsample)library(yardstick)
t7 <- read_csv("./data/tutorial7.csv")
## Rows: 1024 Columns: 7## -- Column specification --------------------------------------------------------## Delimiter: ","## dbl (7): Y, X1, X2, X3, X4, X5, X6## ## i Use `spec()` to retrieve the full column specification for this data.## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
t7
## # A tibble: 1,024 x 7## Y X1 X2 X3 X4 X5 X6## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 -4078113. 0.0727 2.91 1.35 0.224 357. 1## 2 -3100124. 0.181 8.08 2.57 0.848 311. 1## 3 -2282073. 0.659 115. 77.4 1.06 269. 0## 4 -1993606. 0.807 47.4 42.3 1.35 251. 0## 5 -1569783. 1.24 0.785 -1.61 1.36 222. 1## 6 -1467620. 2.46 103. 258. 1.55 216. 1## 7 -1248135. 2.50 22.4 67.7 1.59 198. 1## 8 -1223299. 2.56 104. 262. 1.60 198. 0## 9 -1221486. 2.74 89.7 254. 1.95 197. 1## 10 -1178091. 3.29 81.8 269. 2.30 194. 0## # ... with 1,014 more rows
ggpairs(t7)
main_model <- lm(Y ~ ., data=t7)summary(main_model)
## ## Call:## lm(formula = Y ~ ., data = t7)## ## Residuals:## Min 1Q Median 3Q Max ## -1665731 -72925 22740 95651 308517 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.273e+05 3.501e+04 9.350 < 2e-16 ***## X1 3.291e+04 3.657e+03 8.999 < 2e-16 ***## X2 -2.506e+02 3.055e+02 -0.821 0.41212 ## X3 6.365e+00 2.014e+00 3.160 0.00162 ** ## X4 -3.312e+04 3.666e+03 -9.033 < 2e-16 ***## X5 -7.394e+03 1.300e+02 -56.895 < 2e-16 ***## X6 -1.696e+05 1.148e+04 -14.771 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 179700 on 1017 degrees of freedom## Multiple R-squared: 0.8604, Adjusted R-squared: 0.8596 ## F-statistic: 1045 on 6 and 1017 DF, p-value: < 2.2e-16
AIC(main_model)
## [1] 27693.49
AIC is also reported by broom::glance.lm()
:
glance(main_model)
## # A tibble: 1 x 12## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>## 1 0.860 0.860 179672. 1045. 0 6 -13839. 27693. 27733. 3.28e13 1017 1024
par(mfrow = c(2, 2)) # Create 2x2 gridplot(main_model, which=c(1, 2, 3, 5))
performance::check_model()
makes it easier to read and interpret the diagnostic plots.
Plots are created using ggplot2
Plots are labelled to tell you what to look for
check_model(main_model, check=c("vif", "qq", "normality", "linearity", "homogeneity", "outliers"))
main_model %>% augment() %>% select(Y, contains("X"), .resid) %>% ggpairs()
We wish to consider interactions between all the main effects. For example, consider the interaction
between variables X1
and X2
.
model_X1X2 <- lm(Y ~ . + X1:X2, data=t7)anova(main_model, model_X1X2)
## Analysis of Variance Table## ## Model 1: Y ~ X1 + X2 + X3 + X4 + X5 + X6## Model 2: Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X1:X2## Res.Df RSS Df Sum of Sq F Pr(>F)## 1 1017 3.2831e+13 ## 2 1016 3.2800e+13 1 3.0461e+10 0.9435 0.3316
This interaction is not significant so we will not add it to our model. It remains to check all
other two-way interactions. We do not want to build all these models manually and pass each one into
anova()
to compare with our main model.
Instead, we will use the add1()
function which adds single variables to the main model and
gives the result of the usual F-test from comparing two models with anova()
.
add1( main_model, ~ .^2, test = "F")
## Single term additions## ## Model:## Y ~ X1 + X2 + X3 + X4 + X5 + X6## Df Sum of Sq RSS AIC F value Pr(>F) ## <none> 3.2831e+13 24786 ## X1:X2 1 3.0461e+10 3.2800e+13 24787 0.9435 0.33160 ## X1:X3 1 7.8963e+10 3.2752e+13 24785 2.4495 0.11787 ## X1:X4 1 2.7107e+12 3.0120e+13 24699 91.4374 < 2.2e-16 ***## X1:X5 1 4.7530e+12 2.8078e+13 24627 171.9883 < 2.2e-16 ***## X1:X6 1 4.0214e+12 2.8809e+13 24654 141.8184 < 2.2e-16 ***## X2:X3 1 1.0099e+12 3.1821e+13 24756 32.2458 1.773e-08 ***## X2:X4 1 5.3209e+10 3.2778e+13 24786 1.6493 0.19935 ## X2:X5 1 1.2772e+12 3.1554e+13 24747 41.1238 2.185e-10 ***## X2:X6 1 2.6245e+12 3.0206e+13 24702 88.2748 < 2.2e-16 ***## X3:X4 1 9.8325e+10 3.2733e+13 24784 3.0520 0.08094 . ## X3:X5 1 7.3403e+12 2.5491e+13 24528 292.5696 < 2.2e-16 ***## X3:X6 1 4.6462e+12 2.8185e+13 24631 167.4842 < 2.2e-16 ***## X4:X5 1 4.8446e+12 2.7986e+13 24624 175.8765 < 2.2e-16 ***## X4:X6 1 4.0482e+12 2.8783e+13 24653 142.8979 < 2.2e-16 ***## X5:X6 1 4.7672e+12 2.8064e+13 24627 172.5890 < 2.2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using a significance level of α=0.05, the interactions that are not significant are
X1
with X2
X1
with X3
X2
with X4
X3
with X4
For this example, we will be adding all significant interactions into our main model.
## Single term additions## ## Model:## Y ~ X1 + X2 + X3 + X4 + X5 + X6## Df Sum of Sq RSS AIC F value Pr(>F) ## <none> 3.2831e+13 24786 ## X1:X2 1 3.0461e+10 3.2800e+13 24787 0.9435 0.33160 ## X1:X3 1 7.8963e+10 3.2752e+13 24785 2.4495 0.11787 ## X1:X4 1 2.7107e+12 3.0120e+13 24699 91.4374 < 2.2e-16 ***## X1:X5 1 4.7530e+12 2.8078e+13 24627 171.9883 < 2.2e-16 ***## X1:X6 1 4.0214e+12 2.8809e+13 24654 141.8184 < 2.2e-16 ***## X2:X3 1 1.0099e+12 3.1821e+13 24756 32.2458 1.773e-08 ***## X2:X4 1 5.3209e+10 3.2778e+13 24786 1.6493 0.19935 ## X2:X5 1 1.2772e+12 3.1554e+13 24747 41.1238 2.185e-10 ***## X2:X6 1 2.6245e+12 3.0206e+13 24702 88.2748 < 2.2e-16 ***## X3:X4 1 9.8325e+10 3.2733e+13 24784 3.0520 0.08094 . ## X3:X5 1 7.3403e+12 2.5491e+13 24528 292.5696 < 2.2e-16 ***## X3:X6 1 4.6462e+12 2.8185e+13 24631 167.4842 < 2.2e-16 ***## X4:X5 1 4.8446e+12 2.7986e+13 24624 175.8765 < 2.2e-16 ***## X4:X6 1 4.0482e+12 2.8783e+13 24653 142.8979 < 2.2e-16 ***## X5:X6 1 4.7672e+12 2.8064e+13 24627 172.5890 < 2.2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let us also consider quadratic effects, obtained by squaring the main predictors (can also be
thought of as an interaction with oneself). We will not consider the square of X6
since X6
is a
binary predictor (0-1) and squaring it would result in a variable that is identical.
add1( main_model, ~ . + I(X1^2) + I(X2^2) + I(X3^2) + I(X4^2) + I(X5^2), test = "F")
We also need to wrap the squaring by I()
to mean literal exponentiation as ^
has a different
meaning when used in a formula. See ?formula
for details.
## Single term additions## ## Model:## Y ~ X1 + X2 + X3 + X4 + X5 + X6## Df Sum of Sq RSS AIC F value Pr(>F) ## <none> 3.2831e+13 24786 ## I(X1^2) 1 2.9689e+12 2.9862e+13 24690 101.012 < 2.2e-16 ***## I(X2^2) 1 6.6248e+11 3.2168e+13 24767 20.924 5.369e-06 ***## I(X3^2) 1 1.2554e+12 3.1575e+13 24748 40.395 3.128e-10 ***## I(X4^2) 1 2.4475e+12 3.0383e+13 24708 81.844 < 2.2e-16 ***## I(X5^2) 1 2.9071e+13 3.7596e+12 22568 7856.178 < 2.2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using a significance level of α=0.05, all quadratic terms are significant.
For this example, we will be adding all quadratic terms into our main model.
## Single term additions## ## Model:## Y ~ X1 + X2 + X3 + X4 + X5 + X6## Df Sum of Sq RSS AIC F value Pr(>F) ## <none> 3.2831e+13 24786 ## I(X1^2) 1 2.9689e+12 2.9862e+13 24690 101.012 < 2.2e-16 ***## I(X2^2) 1 6.6248e+11 3.2168e+13 24767 20.924 5.369e-06 ***## I(X3^2) 1 1.2554e+12 3.1575e+13 24748 40.395 3.128e-10 ***## I(X4^2) 1 2.4475e+12 3.0383e+13 24708 81.844 < 2.2e-16 ***## I(X5^2) 1 2.9071e+13 3.7596e+12 22568 7856.178 < 2.2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
new_formula <- Y ~ .^2 - X1:X2 - X1:X3 - X2:X4 - X3:X4 + I(X1^2) + I(X2^2) + I(X3^2) + I(X4^2) + I(X5^2)new_model <- lm(new_formula, data=t7)tidy(new_model) %>% print(n=23)
Let's use tidy()
to print just the coefficient table.
## # A tibble: 23 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 177. 64.5 2.75 6.15e- 3## 2 X1 128. 4.36 29.3 1.25e-136## 3 X2 32.8 0.907 36.2 8.09e-184## 4 X3 7.00 0.00777 901. 0 ## 5 X4 -49.6 4.28 -11.6 2.61e- 29## 6 X5 2.69 0.479 5.62 2.54e- 8## 7 X6 8.49 31.3 0.271 7.86e- 1## 8 I(X1^2) 0.744 0.802 0.927 3.54e- 1## 9 I(X2^2) -0.00381 0.00509 -0.748 4.55e- 1## 10 I(X3^2) -0.000000192 0.000000166 -1.16 2.46e- 1## 11 I(X4^2) 0.282 0.804 0.351 7.26e- 1## 12 I(X5^2) -32.0 0.000936 -34168. 0 ## 13 X1:X4 -1.02 1.60 -0.636 5.25e- 1## 14 X1:X5 0.0767 0.0471 1.63 1.04e- 1## 15 X1:X6 1.56 3.40 0.458 6.47e- 1## 16 X2:X3 0.0000598 0.0000597 1.00 3.17e- 1## 17 X2:X5 1.01 0.00280 359. 0 ## 18 X2:X6 -0.0406 0.275 -0.147 8.83e- 1## 19 X3:X5 -0.0000240 0.0000224 -1.07 2.84e- 1## 20 X3:X6 -8.00 0.00180 -4436. 0 ## 21 X4:X5 -0.0760 0.0472 -1.61 1.08e- 1## 22 X4:X6 -1.40 3.41 -0.411 6.81e- 1## 23 X5:X6 -0.0421 0.119 -0.355 7.23e- 1
We can obtain confidence intervals for the betas using confint()
or as a tibble using tidy()
and
setting conf.int=TRUE
# confint(new_model)tidy(new_model, conf.int=TRUE) %>% select(term, contains("conf")) %>% print(n=23)
## # A tibble: 23 x 3## term conf.low conf.high## <chr> <dbl> <dbl>## 1 (Intercept) 50.5 303. ## 2 X1 119. 136. ## 3 X2 31.0 34.6 ## 4 X3 6.99 7.02 ## 5 X4 -58.0 -41.2 ## 6 X5 1.75 3.63 ## 7 X6 -52.9 69.9 ## 8 I(X1^2) -0.830 2.32 ## 9 I(X2^2) -0.0138 0.00619 ## 10 I(X3^2) -0.000000518 0.000000133## 11 I(X4^2) -1.29 1.86 ## 12 I(X5^2) -32.0 -32.0 ## 13 X1:X4 -4.17 2.13 ## 14 X1:X5 -0.0158 0.169 ## 15 X1:X6 -5.11 8.23 ## 16 X2:X3 -0.0000574 0.000177 ## 17 X2:X5 1.00 1.01 ## 18 X2:X6 -0.581 0.500 ## 19 X3:X5 -0.0000679 0.0000199 ## 20 X3:X6 -8.00 -8.00 ## 21 X4:X5 -0.169 0.0166 ## 22 X4:X6 -8.09 5.29 ## 23 X5:X6 -0.275 0.191
An advantage of creating the variables beforehand is that it simplifies the formula specification and your model output may look a bit cleaner. Consider the simple example:
demo_data1 <- t7 %>% select(Y, X3) %>% mutate(X3sq = X3^2)demo_model1 <- lm(Y ~ ., data=demo_data1)summary(demo_model1)
## ## Call:## lm(formula = Y ~ ., data = demo_data1)## ## Residuals:## Min 1Q Median 3Q Max ## -3736905 -51045 85325 169189 971743 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -3.413e+05 2.515e+04 -13.57 <2e-16 ***## X3 3.388e+01 2.506e+00 13.52 <2e-16 ***## X3sq -9.323e-04 4.874e-05 -19.13 <2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 381900 on 1021 degrees of freedom## Multiple R-squared: 0.367, Adjusted R-squared: 0.3658 ## F-statistic: 296 on 2 and 1021 DF, p-value: < 2.2e-16
The disadvantage occurs when making predictions. When new variables are created beforehand, their
relationships are not retained. To make a prediction with the model where we squared X3
beforehand, we need to supply a value for X3
and X3sq
.
predict( demo_model1, newdata = data.frame(X3=100, X3sq=100^2))
## 1 ## -337875.1
Note that we could have used broom::augment()
to obtain our predictions if we wanted our output to
be a tibble.
However, if we create the variables in the formula specification, the relationships between the
variables are retained. When predicting, we only need to supply a value for X3
. The value of
X3^2
is automatically calculated by squaring the value of X3
.
demo_model2 <- lm(Y ~ X3 + I(X3^2), data=demo_data1)predict( demo_model2, newdata = data.frame(X3=100))
## 1 ## -337875.1
Note that we could have used broom::augment()
to obtain our predictions if we wanted our output to
be a tibble.
Oftentimes, we may want to train our model on a portion of the available data, and evaluate its performance on the remaining unseen data. This can be accomplished in a tidy manner using the rsample and yardstick packages from the tidymodels collection.
An example workflow:
set.seed(100) # For reproducibility# Create an 80/20 splitt7_split <- t7 %>% initial_split(prop=0.8)t7_train <- t7_split %>% training()t7_test <- t7_split %>% testing()main_model <- lm(Y ~ ., data=t7_train)main_model %>% augment(newdata = t7_test) %>% metrics(truth=Y, estimate=.fitted)
## # A tibble: 3 x 3## .metric .estimator .estimate## <chr> <chr> <dbl>## 1 rmse standard 197221. ## 2 rsq standard 0.860## 3 mae standard 126751.