Lab 8, Stat 3503 Fall 2021

class: language-r
layout: true

---

# Lab 8 &mdash; November 29

```r
library(tidyverse)
```

<PRE class="fansi fansi-message"><CODE>## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
</CODE></PRE><PRE class="fansi fansi-message"><CODE>## v ggplot2 3.3.6 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.9
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
</CODE></PRE><PRE class="fansi fansi-message"><CODE>## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
</CODE></PRE>

```r
library(GGally)
library(broom)
library(performance)
theme_set(theme_bw())
```

---

# The `state` data set

```r
?state
```

- The `state` data set is divided into seven parts

- The part that we are interested in is `state.x77`

```r
data(state)
```

---

# The `state` data set

```r
head(state.x77)
```

```
##            Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
## Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
## California      21198   5114        1.1    71.71   10.3    62.6    20 156361
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766
```

```r
class(state.x77)
```

```
## [1] "matrix" "array"
```

- This data is in matrix format; we wish to convert it to a tibble before proceeding further

- Note that the state names are the row names of the matrix, they aren't actually in a column

---

# Converting the matrix to a tibble

- Though not shown in the documentation for
[as_tibble()](https://tibble.tidyverse.org/reference/as_tibble.html), you can actually supply
the `rownames` argument to the `as_tibble()` method for matrices
(documented [here](https://github.com/tidyverse/tibble/issues/288#issuecomment-334244077))

- The value that you would supply to the `rownames` argument is the name of the column that should
contain the rownames

- Since the row names are the states, let's use `state` as the name of the new column

```r
state2 <- state.x77 %>%
 as_tibble(rownames="state")

state2
```

<PRE class="fansi fansi-output"><CODE>## # A tibble: 50 x 9
## state Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost Area
## &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 Alabama 3615 3624 2.1 69.0 15.1 41.3 20 50708
## 2 Alaska 365 6315 1.5 69.3 11.3 66.7 152 566432
## 3 Arizona 2212 4530 1.8 70.6 7.8 58.1 15 113417
## 4 Arkans~ 2110 3378 1.9 70.7 10.1 39.9 65 51945
## 5 Califo~ 21198 5114 1.1 71.7 10.3 62.6 20 156361
## 6 Colora~ 2541 4884 0.7 72.1 6.8 63.9 166 103766
## 7 Connec~ 3100 5348 1.1 72.5 3.1 56 139 4862
## 8 Delawa~ 579 4809 0.9 70.1 6.2 54.6 103 1982
## 9 Florida 8277 4815 1.3 70.7 10.7 52.6 11 54090
## 10 Georgia 4931 4091 2 68.5 13.9 40.6 60 58073
## # ... with 40 more rows
</CODE></PRE>

---

# Extra beautification

- Let's also rename some of the columns...

```r
state2 <- state2 %>%
 rename_with(~str_to_lower(str_replace_all(.x, pattern=" ", replacement="_")))

state2
```

<PRE class="fansi fansi-output"><CODE>## # A tibble: 50 x 9
## state population income illiteracy life_exp murder hs_grad frost area
## &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 Alabama 3615 3624 2.1 69.0 15.1 41.3 20 50708
## 2 Alaska 365 6315 1.5 69.3 11.3 66.7 152 566432
## 3 Arizona 2212 4530 1.8 70.6 7.8 58.1 15 113417
## 4 Arkansas 2110 3378 1.9 70.7 10.1 39.9 65 51945
## 5 California 21198 5114 1.1 71.7 10.3 62.6 20 156361
## 6 Colorado 2541 4884 0.7 72.1 6.8 63.9 166 103766
## 7 Connecticut 3100 5348 1.1 72.5 3.1 56 139 4862
## 8 Delaware 579 4809 0.9 70.1 6.2 54.6 103 1982
## 9 Florida 8277 4815 1.3 70.7 10.7 52.6 11 54090
## 10 Georgia 4931 4091 2 68.5 13.9 40.6 60 58073
## # ... with 40 more rows
</CODE></PRE>

---

# Fit a basic OLS model

- We wish to fit a basic OLS model with life expectancy as the response and all other variables
except `state` as predictors

```r
full_model <- lm(life_exp ~ . - state, data=state2)

summary(full_model)
```

```
## 
## Call:
## lm(formula = life_exp ~ . - state, data = state2)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -1.48895 -0.51232 -0.02747 0.57002 1.49447 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) 7.094e+01 1.748e+00 40.586 < 2e-16 ***
## population 5.180e-05 2.919e-05 1.775 0.0832 . 
## income -2.180e-05 2.444e-04 -0.089 0.9293 
## illiteracy 3.382e-02 3.663e-01 0.092 0.9269 
## murder -3.011e-01 4.662e-02 -6.459 8.68e-08 ***
## hs_grad 4.893e-02 2.332e-02 2.098 0.0420 * 
## frost -5.735e-03 3.143e-03 -1.825 0.0752 . 
## area -7.383e-08 1.668e-06 -0.044 0.9649 
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7448 on 42 degrees of freedom
## Multiple R-squared: 0.7362,	Adjusted R-squared: 0.6922 
## F-statistic: 16.74 on 7 and 42 DF, p-value: 2.534e-10
```

---

# Perform backward stepwise selection using AIC

- You can use the `step()` function to perform stepwise selection to see the effect of adding
or removing a predictor from your model

- However, I prefer to alternate between `add1()` and `drop1()` in order to keep track of what
I'm doing at the current step

- Starting with the full model, we look at single term deletions

```r
drop1(full_model, ~ .) %>%
  tidy() %>%
  slice_min(AIC)
```

<PRE class="fansi fansi-output"><CODE>## # A tibble: 1 x 5
## term df sumsq rss AIC
## &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 area 1 0.00109 23.3 -24.2
</CODE></PRE>

- Remember: low AIC = good, hence the `slice_min()`

- We drop `area` from our model

```r
step1 <- update(full_model, . ~ . - area)
```

---

# Perform backward stepwise selection using AIC

- Can we add anything back?

```r
add1(step1, ~ . + area) %>%
  tidy() %>%
  slice_min(AIC)
```

<PRE class="fansi fansi-output"><CODE>## # A tibble: 1 x 5
## term df sumsq rss AIC
## &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 &lt;none&gt; NA NA 23.3 -24.2
</CODE></PRE>

- We already saw that the model **without** `area` has a lower AIC, therefore we will continue to
drop predictors

```r
drop1(step1, ~ .) %>%
  tidy() %>%
  slice_min(AIC)
```

<PRE class="fansi fansi-output"><CODE>## # A tibble: 1 x 5
## term df sumsq rss AIC
## &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 illiteracy 1 0.00376 23.3 -26.2
</CODE></PRE>

- Let's drop `illiteracy` from our model

```r
step2 <- update(step1, . ~ . - illiteracy)
```

---

# Perform backward stepwise selection using AIC

- Can we add anything back?

```r
add1(step2, ~ . + area + illiteracy) %>%
  tidy() %>%
  slice_min(AIC)
```

<PRE class="fansi fansi-output"><CODE>## # A tibble: 1 x 5
## term df sumsq rss AIC
## &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 &lt;none&gt; NA NA 23.3 -26.2
</CODE></PRE>

- We cannot, so we continue with dropping

```r
drop1(step2, ~ .) %>%
  tidy() %>%
  slice_min(AIC)
```

<PRE class="fansi fansi-output"><CODE>## # A tibble: 1 x 5
## term df sumsq rss AIC
## &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 income 1 0.00606 23.3 -28.2
</CODE></PRE>

```r
step3 <- update(step2, . ~ . - income)
```

---

# Perform backward stepwise selection using AIC

- Can we add anything back?

```r
add1(step3, ~ . + area + illiteracy + income) %>%
  tidy() %>%
  slice_min(AIC)
```

<PRE class="fansi fansi-output"><CODE>## # A tibble: 1 x 5
## term df sumsq rss AIC
## &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 &lt;none&gt; NA NA 23.3 -28.2
</CODE></PRE>

- We cannot, so we continue with dropping

```r
drop1(step3, ~ .) %>%
  tidy() %>%
  slice_min(AIC)
```

<PRE class="fansi fansi-output"><CODE>## # A tibble: 1 x 5
## term df sumsq rss AIC
## &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 &lt;none&gt; NA NA 23.3 -28.2
</CODE></PRE>

- The model with the lowest AIC is the one that does not drop any more terms; `step3` is the
final model

---

# Perform backward stepwise selection using AIC

Our final model:

```r
summary(step3)
```

```
## 
## Call:
## lm(formula = life_exp ~ population + murder + hs_grad + frost, 
## data = state2)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -1.47095 -0.53464 -0.03701 0.57621 1.50683 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) 7.103e+01 9.529e-01 74.542 < 2e-16 ***
## population 5.014e-05 2.512e-05 1.996 0.05201 . 
## murder -3.001e-01 3.661e-02 -8.199 1.77e-10 ***
## hs_grad 4.658e-02 1.483e-02 3.142 0.00297 ** 
## frost -5.943e-03 2.421e-03 -2.455 0.01802 * 
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7197 on 45 degrees of freedom
## Multiple R-squared: 0.736,	Adjusted R-squared: 0.7126 
## F-statistic: 31.37 on 4 and 45 DF, p-value: 1.696e-12
```

---

# Check the linear model assumptions

Recall that we can check the assumptions using

```r
par(mfrow = c(2, 2))
plot(step3, which=c(1, 2, 3, 5))
```

or by using `performance::check_model()`

```r
check_model(step3, check=c("vif", "qq", "normality", "linearity", "homogeneity", "outliers"))
```

---

# Check the linear model assumptions

---

# Check the linear model assumptions

- It appears there are some issues with the linearity assumption
  - This line is not very flat &mdash; there is a slight dip around `$x \approx 71$`
  
- There seems to be some issues with the constant variance assumption as well
  - This line also does a dip near `$x \approx 70.5$`

---

# Make a scatterplot matrix

- Let's make a scatterplot matrix containing the fitted values, the predictors, and the residuals

```r
step3 %>%
  augment() %>%
  select(.fitted, population, murder, hs_grad, frost, .resid) %>%
  ggpairs()
```

---

# Make a scatterplot matrix

- Let's make a scatterplot matrix containing the fitted values, the predictors, and the residuals

---

# Make a scatterplot matrix

- The behaviour of the residual vs fitted plot (row 6 column 1) and the residual vs murder plot
(row 6 column 3) appear similar

- Therefore, let us regress the residuals upon murder

---

# Linear model of residuals against murder

```r
resid_murder_data <- step3 %>%
 augment() %>%
 select(.resid, murder)

resid_murder_model <- lm(abs(.resid) ~ murder, data=resid_murder_data)
```

---

# Perform weighted least squares regression

- To extend the linear model obtained via backward stepwise selection, we wish to perform weighted
least squares regression

- The previous model of fitting the absolute value of the residuals against the predictor `murder`
provides an estimate of the standard deviation, `$\sigma_{i}$`

- The weights are obtained by reciprocating the square of the fitted values obtained from this model

```r
w <- 1 / predict(resid_murder_model)^2

head(w)
```

```
##        1        2        3        4        5        6 
## 4.328989 3.630323 3.126332 3.444607 3.474556 3.001557
```

---

# Perform weighted least squares regression

- We can fit a weighted least squares model using the `lm()` function, and supplying the weights to
the `weights` argument

```r
weighted_model <- lm(life_exp ~ population + murder + hs_grad + frost, data=state2, weights=w)

summary(weighted_model)
```

```
## 
## Call:
## lm(formula = life_exp ~ population + murder + hs_grad + frost, 
## data = state2, weights = w)
## 
## Weighted Residuals:
## Min 1Q Median 3Q Max 
## -2.32866 -0.93775 -0.08347 1.04909 2.58497 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) 7.109e+01 9.254e-01 76.824 < 2e-16 ***
## population 5.337e-05 2.400e-05 2.224 0.0312 * 
## murder -3.012e-01 3.594e-02 -8.383 9.62e-11 ***
## hs_grad 4.554e-02 1.451e-02 3.137 0.0030 ** 
## frost -6.095e-03 2.406e-03 -2.533 0.0149 * 
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.26 on 45 degrees of freedom
## Multiple R-squared: 0.7463,	Adjusted R-squared: 0.7237 
## F-statistic: 33.09 on 4 and 45 DF, p-value: 7.067e-13
```

---

# Comparing the coefficients

```r
list(
  OLS = tidy(step3),
  WLS = tidy(weighted_model)
)
```

<PRE class="fansi fansi-output"><CODE>## $OLS
## # A tibble: 5 x 5
## term estimate std.error statistic p.value
## &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 (Intercept) 71.0 0.953 74.5 8.61e-49
## 2 population 0.0000501 0.0000251 2.00 5.20e- 2
## 3 murder -0.300 0.0366 -8.20 1.77e-10
## 4 hs_grad 0.0466 0.0148 3.14 2.97e- 3
## 5 frost -0.00594 0.00242 -2.46 1.80e- 2
## 
## $WLS
## # A tibble: 5 x 5
## term estimate std.error statistic p.value
## &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 (Intercept) 71.1 0.925 76.8 2.24e-49
## 2 population 0.0000534 0.0000240 2.22 3.12e- 2
## 3 murder -0.301 0.0359 -8.38 9.62e-11
## 4 hs_grad 0.0455 0.0145 3.14 3.00e- 3
## 5 frost -0.00610 0.00241 -2.53 1.49e- 2
</CODE></PRE>

---

# Comparing the coefficients

Computing the ratio of the coefficients between the two models:

```r
coef(step3) / coef(weighted_model)
```

```
## (Intercept)  population      murder     hs_grad       frost 
##   0.9990814   0.9394284   0.9963515   1.0229026   0.9750663
```

- Since the ratio of the estimated coefficients do not differ substantially we do not need to
consider iteratively reweighted least squares

- From page 426 of the textbook:

> *If the estimated coefficients differ substantially from the estimated regression coefficients
obtained by ordinary least squares, it is usually advisable to iterate the weighted least squares
process by using the residuals from the weighted least squares fit to re-estimate the variance or
standard deviation function and then obtain revised weights: Often one or two iterations are
sufficient to stabilize the estimated regression coefficients. This iteration process is often
called iteratively reweighted least squares.*