Lab 7, Stat 3503 Fall 2021

class: language-r
layout: true

---

# Lab 7 &mdash; November 15

```r
library(tidyverse)
```

<PRE class="fansi fansi-message"><CODE>## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
</CODE></PRE><PRE class="fansi fansi-message"><CODE>## v ggplot2 3.3.6 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.9
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
</CODE></PRE><PRE class="fansi fansi-message"><CODE>## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
</CODE></PRE>

```r
library(GGally)
library(broom)
theme_set(theme_bw())
```

---

# Additional packages &mdash; easystats

[easystats](https://easystats.github.io/easystats/) is a collection of packages, like 
[tidyverse](https://www.tidyverse.org/packages/). 
[easystats](https://easystats.github.io/easystats/) is not currently on CRAN but can be installed
using:

```r
install.packages("easystats", repos="https://easystats.r-universe.dev")
```

However, its component packages and dependencies can be installed from CRAN normally:

```r
install.packages("performance")
install.packages("see")
install.packages("patchwork")
```

Once installed,

```r
library(performance)
# or
# library(easystats)
```

---

# Additional packages &mdash; tidymodels

[tidymodels](https://www.tidymodels.org/packages/) is an extension of the
[tidyverse](https://www.tidyverse.org/packages/) that focuses on modelling. (Recall that
[broom](https://broom.tidymodels.org/) is actually a package from
[tidymodels](https://www.tidymodels.org/packages/), not 
[tidyverse](https://www.tidyverse.org/packages/)!)

You can install [tidymodels](https://www.tidymodels.org/packages/) using the usual methods. You can
also install just its component packages. In this lab, I introduce two additional packages from
the [tidymodels](https://www.tidymodels.org/packages/) collection:

- [rsample](https://rsample.tidymodels.org/): for producing data partitions and data resamples
- [yardstick](https://yardstick.tidymodels.org/): for calculating model performance metrics

```r
library(rsample)
library(yardstick)
```

---

# Load the data

```r
t7 <- read_csv("./data/tutorial7.csv")
```

<PRE class="fansi fansi-message"><CODE>## Rows: 1024 Columns: 7
## -- Column specification --------------------------------------------------------
## Delimiter: &quot;,&quot;
## dbl (7): Y, X1, X2, X3, X4, X5, X6
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
</CODE></PRE>

```r
t7
```

<PRE class="fansi fansi-output"><CODE>## # A tibble: 1,024 x 7
## Y X1 X2 X3 X4 X5 X6
## &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 -4078113. 0.0727 2.91 1.35 0.224 357. 1
## 2 -3100124. 0.181 8.08 2.57 0.848 311. 1
## 3 -2282073. 0.659 115. 77.4 1.06 269. 0
## 4 -1993606. 0.807 47.4 42.3 1.35 251. 0
## 5 -1569783. 1.24 0.785 -1.61 1.36 222. 1
## 6 -1467620. 2.46 103. 258. 1.55 216. 1
## 7 -1248135. 2.50 22.4 67.7 1.59 198. 1
## 8 -1223299. 2.56 104. 262. 1.60 198. 0
## 9 -1221486. 2.74 89.7 254. 1.95 197. 1
## 10 -1178091. 3.29 81.8 269. 2.30 194. 0
## # ... with 1,014 more rows
</CODE></PRE>

---

# Make a scatterplot matrix

```r
ggpairs(t7)
```

---

# Fit a linear model using all main effects

```r
main_model <- lm(Y ~ ., data=t7)

summary(main_model)
```

```
## 
## Call:
## lm(formula = Y ~ ., data = t7)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -1665731 -72925 22740 95651 308517 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) 3.273e+05 3.501e+04 9.350 < 2e-16 ***
## X1 3.291e+04 3.657e+03 8.999 < 2e-16 ***
## X2 -2.506e+02 3.055e+02 -0.821 0.41212 
## X3 6.365e+00 2.014e+00 3.160 0.00162 ** 
## X4 -3.312e+04 3.666e+03 -9.033 < 2e-16 ***
## X5 -7.394e+03 1.300e+02 -56.895 < 2e-16 ***
## X6 -1.696e+05 1.148e+04 -14.771 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 179700 on 1017 degrees of freedom
## Multiple R-squared: 0.8604,	Adjusted R-squared: 0.8596 
## F-statistic: 1045 on 6 and 1017 DF, p-value: < 2.2e-16
```

---

# Report the AIC of the fitted model

```r
AIC(main_model)
```

```
## [1] 27693.49
```

AIC is also reported by `broom::glance.lm()`:

```r
glance(main_model)
```

<PRE class="fansi fansi-output"><CODE>## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
## &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
## 1 0.860 0.860 179672. 1045. 0 6 -13839. 27693. 27733. 3.28e13 1017 1024
</CODE></PRE>

---

# Check model assumptions via diagnostic plots

```r
par(mfrow = c(2, 2)) # Create 2x2 grid
plot(main_model, which=c(1, 2, 3, 5))
```

---

# Check model assumptions via diagnostic plots

`performance::check_model()` makes it easier to read and interpret the diagnostic plots.

- Plots are created using [ggplot2](https://ggplot2.tidyverse.org/)

- Plots are labelled to tell you what to look for

```r
check_model(main_model, check=c("vif", "qq", "normality", "linearity", "homogeneity", "outliers"))
```

---

---

# Adding residuals to the scatterplot matrix

```r
main_model %>%
  augment() %>%
  select(Y, contains("X"), .resid) %>%
  ggpairs()
```

---

# Consider adding interaction effects

We wish to consider interactions between all the main effects. For example, consider the interaction
between variables `X1` and `X2`.

```r
model_X1X2 <- lm(Y ~ . + X1:X2, data=t7)

anova(main_model, model_X1X2)
```

```
## Analysis of Variance Table
## 
## Model 1: Y ~ X1 + X2 + X3 + X4 + X5 + X6
## Model 2: Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X1:X2
##   Res.Df        RSS Df  Sum of Sq      F Pr(>F)
## 1   1017 3.2831e+13                            
## 2   1016 3.2800e+13  1 3.0461e+10 0.9435 0.3316
```

This interaction is not significant so we will not add it to our model. It remains to check all
other two-way interactions. We do not want to build all these models manually and pass each one into
`anova()` to compare with our main model.

---

# Consider adding interaction effects

.pull-left[

Instead, we will use the `add1()` function which adds single variables to the main model and
gives the result of the usual `\(F\)`-test from comparing two models with `anova()`.

```r
add1(
  main_model, 
  ~ .^2,
  test = "F"
)
```

]

.pull-right[

```
## Single term additions
## 
## Model:
## Y ~ X1 + X2 + X3 + X4 + X5 + X6
## Df Sum of Sq RSS AIC F value Pr(>F) 
## <none> 3.2831e+13 24786 
## X1:X2 1 3.0461e+10 3.2800e+13 24787 0.9435 0.33160 
## X1:X3 1 7.8963e+10 3.2752e+13 24785 2.4495 0.11787 
## X1:X4 1 2.7107e+12 3.0120e+13 24699 91.4374 < 2.2e-16 ***
## X1:X5 1 4.7530e+12 2.8078e+13 24627 171.9883 < 2.2e-16 ***
## X1:X6 1 4.0214e+12 2.8809e+13 24654 141.8184 < 2.2e-16 ***
## X2:X3 1 1.0099e+12 3.1821e+13 24756 32.2458 1.773e-08 ***
## X2:X4 1 5.3209e+10 3.2778e+13 24786 1.6493 0.19935 
## X2:X5 1 1.2772e+12 3.1554e+13 24747 41.1238 2.185e-10 ***
## X2:X6 1 2.6245e+12 3.0206e+13 24702 88.2748 < 2.2e-16 ***
## X3:X4 1 9.8325e+10 3.2733e+13 24784 3.0520 0.08094 . 
## X3:X5 1 7.3403e+12 2.5491e+13 24528 292.5696 < 2.2e-16 ***
## X3:X6 1 4.6462e+12 2.8185e+13 24631 167.4842 < 2.2e-16 ***
## X4:X5 1 4.8446e+12 2.7986e+13 24624 175.8765 < 2.2e-16 ***
## X4:X6 1 4.0482e+12 2.8783e+13 24653 142.8979 < 2.2e-16 ***
## X5:X6 1 4.7672e+12 2.8064e+13 24627 172.5890 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

]

---
count: false

# Consider adding interaction effects

.pull-left[

Using a significance level of `\(\alpha = 0.05\)`, the interactions that are not significant are

- `X1` with `X2`
- `X1` with `X3`
- `X2` with `X4`
- `X3` with `X4`

For this example, we will be adding all significant interactions into our main model.

]

.pull-right[

]

---

# Consider quadratic effects

.pull-left[

Let us also consider quadratic effects, obtained by squaring the main predictors (can also be
thought of as an interaction with oneself). We will not consider the square of `X6` since `X6` is a
binary predictor (0-1) and squaring it would result in a variable that is identical.

```r
add1(
  main_model,
  ~ . + I(X1^2) + I(X2^2) + I(X3^2) + I(X4^2) + I(X5^2),
  test = "F"
)
```

We also need to wrap the squaring by `I()` to mean literal exponentiation as `^` has a different
meaning when used in a formula. See `?formula` for details.

]

.pull-right[

```
## Single term additions
## 
## Model:
## Y ~ X1 + X2 + X3 + X4 + X5 + X6
## Df Sum of Sq RSS AIC F value Pr(>F) 
## <none> 3.2831e+13 24786 
## I(X1^2) 1 2.9689e+12 2.9862e+13 24690 101.012 < 2.2e-16 ***
## I(X2^2) 1 6.6248e+11 3.2168e+13 24767 20.924 5.369e-06 ***
## I(X3^2) 1 1.2554e+12 3.1575e+13 24748 40.395 3.128e-10 ***
## I(X4^2) 1 2.4475e+12 3.0383e+13 24708 81.844 < 2.2e-16 ***
## I(X5^2) 1 2.9071e+13 3.7596e+12 22568 7856.178 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

]

---
count: false

# Consider quadratic effects

.pull-left[

Using a significance level of `\(\alpha = 0.05\)`, all quadratic terms are significant.

For this example, we will be adding all quadratic terms into our main model.

]

.pull-right[

]

---

# Fit the new model

.pull-left[

```r
new_formula <- Y ~ .^2 - X1:X2 - X1:X3 - X2:X4 - X3:X4 +
 I(X1^2) + I(X2^2) + I(X3^2) + I(X4^2) + I(X5^2)

new_model <- lm(new_formula, data=t7)

tidy(new_model) %>%
  print(n=23)
```

Let's use `tidy()` to print just the coefficient table.

]

.pull-right[

<PRE class="fansi fansi-output"><CODE>## # A tibble: 23 x 5
## term estimate std.error statistic p.value
## &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 (Intercept) 177. 64.5 2.75 6.15e- 3
## 2 X1 128. 4.36 29.3 1.25e-136
## 3 X2 32.8 0.907 36.2 8.09e-184
## 4 X3 7.00 0.00777 901. 0 
## 5 X4 -49.6 4.28 -11.6 2.61e- 29
## 6 X5 2.69 0.479 5.62 2.54e- 8
## 7 X6 8.49 31.3 0.271 7.86e- 1
## 8 I(X1^2) 0.744 0.802 0.927 3.54e- 1
## 9 I(X2^2) -0.00381 0.00509 -0.748 4.55e- 1
## 10 I(X3^2) -0.000000192 0.000000166 -1.16 2.46e- 1
## 11 I(X4^2) 0.282 0.804 0.351 7.26e- 1
## 12 I(X5^2) -32.0 0.000936 -34168. 0 
## 13 X1:X4 -1.02 1.60 -0.636 5.25e- 1
## 14 X1:X5 0.0767 0.0471 1.63 1.04e- 1
## 15 X1:X6 1.56 3.40 0.458 6.47e- 1
## 16 X2:X3 0.0000598 0.0000597 1.00 3.17e- 1
## 17 X2:X5 1.01 0.00280 359. 0 
## 18 X2:X6 -0.0406 0.275 -0.147 8.83e- 1
## 19 X3:X5 -0.0000240 0.0000224 -1.07 2.84e- 1
## 20 X3:X6 -8.00 0.00180 -4436. 0 
## 21 X4:X5 -0.0760 0.0472 -1.61 1.08e- 1
## 22 X4:X6 -1.40 3.41 -0.411 6.81e- 1
## 23 X5:X6 -0.0421 0.119 -0.355 7.23e- 1
</CODE></PRE>
]

---

# Produce 95% confidence intervals for the betas

.pull-left[

We can obtain confidence intervals for the betas using `confint()` or as a tibble using `tidy()` and
setting `conf.int=TRUE`

```r
# confint(new_model)

tidy(new_model, conf.int=TRUE) %>%
  select(term, contains("conf")) %>%
  print(n=23)
```

]

.pull-right[

<PRE class="fansi fansi-output"><CODE>## # A tibble: 23 x 3
## term conf.low conf.high
## &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 (Intercept) 50.5 303. 
## 2 X1 119. 136. 
## 3 X2 31.0 34.6 
## 4 X3 6.99 7.02 
## 5 X4 -58.0 -41.2 
## 6 X5 1.75 3.63 
## 7 X6 -52.9 69.9 
## 8 I(X1^2) -0.830 2.32 
## 9 I(X2^2) -0.0138 0.00619 
## 10 I(X3^2) -0.000000518 0.000000133
## 11 I(X4^2) -1.29 1.86 
## 12 I(X5^2) -32.0 -32.0 
## 13 X1:X4 -4.17 2.13 
## 14 X1:X5 -0.0158 0.169 
## 15 X1:X6 -5.11 8.23 
## 16 X2:X3 -0.0000574 0.000177 
## 17 X2:X5 1.00 1.01 
## 18 X2:X6 -0.581 0.500 
## 19 X3:X5 -0.0000679 0.0000199 
## 20 X3:X6 -8.00 -8.00 
## 21 X4:X5 -0.169 0.0166 
## 22 X4:X6 -8.09 5.29 
## 23 X5:X6 -0.275 0.191
</CODE></PRE>

]

---

# Creating the variables beforehand

An advantage of creating the variables beforehand is that it simplifies the formula specification
and your model output may look a bit cleaner. Consider the simple example:

```r
demo_data1 <- t7 %>%
 select(Y, X3) %>%
 mutate(X3sq = X3^2)

demo_model1 <- lm(Y ~ ., data=demo_data1)

summary(demo_model1)
```

```
## 
## Call:
## lm(formula = Y ~ ., data = demo_data1)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -3736905 -51045 85325 169189 971743 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) -3.413e+05 2.515e+04 -13.57 <2e-16 ***
## X3 3.388e+01 2.506e+00 13.52 <2e-16 ***
## X3sq -9.323e-04 4.874e-05 -19.13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 381900 on 1021 degrees of freedom
## Multiple R-squared: 0.367,	Adjusted R-squared: 0.3658 
## F-statistic: 296 on 2 and 1021 DF, p-value: < 2.2e-16
```

---

# Creating the variables beforehand

The disadvantage occurs when making predictions. When new variables are created beforehand, their
relationships are not retained. To make a prediction with the model where we squared `X3`
beforehand, we need to supply a value for `X3` and `X3sq`.

```r
predict(
  demo_model1,
  newdata = data.frame(X3=100, X3sq=100^2)
)
```

```
##         1 
## -337875.1
```

Note that we could have used `broom::augment()` to obtain our predictions if we wanted our output to
be a tibble.

---

# Creating the variables beforehand

However, if we create the variables in the formula specification, the relationships between the
variables are retained. When predicting, we only need to supply a value for `X3`. The value of
`X3^2` is automatically calculated by squaring the value of `X3`.

```r
demo_model2 <- lm(Y ~ X3 + I(X3^2), data=demo_data1)

predict(
  demo_model2,
  newdata = data.frame(X3=100)
)
```

```
##         1 
## -337875.1
```

Note that we could have used `broom::augment()` to obtain our predictions if we wanted our output to
be a tibble.

---

# Creating training/testing splits

Oftentimes, we may want to train our model on a portion of the available data, and evaluate its
performance on the remaining unseen data. This can be accomplished in a tidy manner using the
[rsample](https://rsample.tidymodels.org/) and [yardstick](https://yardstick.tidymodels.org/)
packages from the [tidymodels](https://www.tidymodels.org/packages/) collection.

An example workflow:

```r
set.seed(100) # For reproducibility

# Create an 80/20 split
t7_split <- t7 %>%
 initial_split(prop=0.8)

t7_train <- t7_split %>%
 training()

t7_test <- t7_split %>%
 testing()

main_model <- lm(Y ~ ., data=t7_train)

main_model %>%
  augment(newdata = t7_test) %>%
  metrics(truth=Y, estimate=.fitted)
```

<PRE class="fansi fansi-output"><CODE>## # A tibble: 3 x 3
## .metric .estimator .estimate
## &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
## 1 rmse standard 197221. 
## 2 rsq standard 0.860
## 3 mae standard 126751.
</CODE></PRE>