4 Lab 7 — November 10

4.1 Packages

library(tibble)
library(dplyr)
library(ggplot2)
library(tidyr)
library(readr)
library(broom)
library(leaps)
theme_set(theme_bw())

readr is another package from tidyverse for reading in various types of data. Many of its functions work similar to the base-R equivalents but work a bit better, are smarter, and read data in as a tibble.

We will be using the leaps package to aid us in computing all possible models from a set of predictors and obtaining the corresponding Mallows' \(C_{p}\) values for all of these models. The leaps package can also be used for best subset selection, forward selection, and backward selection. In this lab, while we will be covering an example on backward selection, we will be doing it manually to throroughly illustrate the process.

4.2 Weighted linear regression (via `lm`)

Last time, we saw how to compute the weighted least squares estimates for \(\beta\) using matrices. Now, let's take a look at how to do it via the lm() function. The data we will use for this example is the data from Table 2.1 (page 51).

I entered the data manually myself, so let's double-check that my values are correct. The resulting regression equation should be \(\widehat{Y}_{i} = 1.426 + 0.316X_{i}\) (page 51).

table2.1 <- read.table("./data/table2.1_DS.txt", header=TRUE) %>%
  as_tibble()

lm(Y ~ X, data=table2.1) %>%
  coef()

## (Intercept)           X 
##   1.4256449   0.3157857

Looks good!

4.2.1 Problem (i)

Suppose we are told that the 16th observation has variance \(16\sigma^{2}\) rather than \(\sigma^{2}\). Obtain the parameter estimates using weighted least squares.

4.2.2 Solution

If the 16th observation has variance \(16\sigma^{2}\), then

\[\textbf{Var}(\varepsilon) \,=\, \sigma^{2}V \,=\, \sigma^{2}\begin{bmatrix}1\\ &1\\ & &\ddots\\ & & &16\\ & & & &\ddots\\ & & & & &1\\ & & & & & &1 \end{bmatrix}\]

The weight matrix, \(V^{-1}\), is calculated as the inverse of \(V\). We know that the inverse of a diagonal matrix is the same matrix but with the diagonal elements reciprocated. Thus we have

\[V^{-1} \,=\, \begin{bmatrix}1\\ &1\\ & &\ddots\\ & & &1/16\\ & & & &\ddots\\ & & & & &1\\ & & & & & &1 \end{bmatrix}\]

The weights we will supply to lm() will be a vector of 15 ones, followed by the value \(1/16\), followed by 7 ones. Instead of typing the number 1 15 times, we can use the rep() function which repeats a value \(n\) times.

w1 <- c(rep(1, 15), 1/16, rep(1, 7))
w1

##  [1] 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
## [11] 1.0000 1.0000 1.0000 1.0000 1.0000 0.0625 1.0000 1.0000 1.0000 1.0000
## [21] 1.0000 1.0000 1.0000

# Check that it has the same length as the data (n=23)
length(w1)

## [1] 23

# Check that the value at position 16 is correct
w1[16]

## [1] 0.0625

Now we can proceed with building the model.

wm1 <- lm(Y ~ X, data=table2.1, weights=w1)
summary(wm1)

## 
## Call:
## lm(formula = Y ~ X, data = table2.1, weights = w1)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.18059 -0.51519 -0.07287  0.25786  2.39631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   1.4421     0.5112   2.821   0.0102 *
## X             0.3077     0.1155   2.663   0.0145 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8485 on 21 degrees of freedom
## Multiple R-squared:  0.2525, Adjusted R-squared:  0.2169 
## F-statistic: 7.094 on 1 and 21 DF,  p-value: 0.01454

The equation of the fitted line is:

\[\widehat{Y}_{i} \,=\, 1.4421 \,+\, 0.3077X_{i}\]

4.2.3 Problem (ii)

Suppose we are now told that the 16th observation has variance \(25\sigma^{2}\) rather than \(\sigma^{2}\). Obtain the parameter estimates using weighted least squares.

4.2.4 Solution

We repeat the process from before. Since the variance for the 16th observation has been scaled by a factor of 25, the weight will be \(1/25\).

w2 <- c(rep(1, 15), 1/25, rep(1, 7))
w2

##  [1] 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
## [16] 0.04 1.00 1.00 1.00 1.00 1.00 1.00 1.00

# Check that it has the same length as the data (n=23)
length(w2)

## [1] 23

# Check that the value at position 16 is correct
w2[16]

## [1] 0.04

Building a second model,

wm2 <- lm(Y ~ X, data=table2.1, weights=w2)
summary(wm2)

## 
## Call:
## lm(formula = Y ~ X, data = table2.1, weights = w2)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.17998 -0.51499 -0.07274  0.25852  2.39727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   1.4425     0.5112   2.822   0.0102 *
## X             0.3075     0.1155   2.661   0.0146 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8483 on 21 degrees of freedom
## Multiple R-squared:  0.2522, Adjusted R-squared:  0.2166 
## F-statistic: 7.082 on 1 and 21 DF,  p-value: 0.01461

The equation of the fitted line is:

\[\widehat{Y}_{i} \,=\, 1.4425 \,+\, 0.3075X_{i}\]

We can see that in changing our weight for the 16th observation from \(1/16\) to \(1/25\):

The value of \(\widehat{\beta}_{0}\) has increased (not by much)
The value of \(\widehat{\beta}_{1}\) has decreased (not by much)

The differences are more apparent when adjusting weights to the 23rd observation of this data set. Hint: make a scatterplot!

First, let's add a column of indices to the data.

table2.1 <- table2.1 %>%
  mutate(obs = 1:n(), .before = everything())

table2.1

## # A tibble: 23 x 3
##      obs     Y     X
##    <int> <dbl> <dbl>
##  1     1   2.3   1.3
##  2     2   1.8   1.3
##  3     3   2.8   2  
##  4     4   1.5   2  
##  5     5   2.2   2.7
##  6     6   3.8   3.3
##  7     7   1.8   3.3
##  8     8   3.7   3.7
##  9     9   1.7   3.7
## 10    10   2.8   4  
## # ... with 13 more rows

Then we can make a scatterplot and label each point with its observation number.

ggplot(table2.1, aes(x=X, y=Y, label=obs))+
  geom_point(alpha=0.4)+
  geom_text(hjust="outward", vjust="outward")

Observation 23 definitely stands out from the rest of the data!

4.3 All possible models

To be consistent with the textbook, let us define \(p\) as the total number of parameters in a model, including the intercept. Suppose we are interested in plotting Mallows' \(C_{p}\) against \(p\) for all possible models.

(Alternatively, we could consider plotting \(C_{p}\) against \(p+1\) where \(p\) is the number of non-intercept parameters).

Returning to the rock data set, an outline of our plan will be as follows:

To compute all the possible models in a single line, we employ leaps::regsubsets(). In the formula argument (x), we specify the formula area ~ .. This means fix area as the response variable, and allow any other variables appearing in the data set to be added to the candidate model (this is so that we don't need to manually type the names of the variables in our data set). We also set nbest=3 to force computation of all models when \(p < 4\).
We then pass the result through a custom function that is derived from broom::tidy.regsubsets() to obtain the required diagnostic information (including Mallows' \(C_{p}\)) as a tidy tibble. We need a custom function since we also want the \(SSE_{p}\) of each model in order to calculate the \(MSE_{p}\) (or equivalently, the \(S^{2}_{p}\)).
We can then create the column, \(p\), that counts the number of predictors included in the model (including the intercept). This involves some code that is a bit more complicated, but it is essentially summing the number of TRUEs appearing in each row (i.e. we are performing a horizontal sum rather than the usual vertical sum).
We then compute the mean square error, \(S^{2}_{p}\), for each model.
We then drop the columns that we won't be using.
Finally, we can plot our results.

4.3.1 A custom tidy function

We can access the source code used to tidy objects of class regsubsets by calling broom:::tidy.regsubsets (three colons).

broom:::tidy.regsubsets

## function (x, ...) 
## {
##     s <- summary(x)
##     inclusions <- as_tibble(s$which)
##     metrics <- with(s, tibble(r.squared = rsq, adj.r.squared = adjr2, 
##         BIC = bic, mallows_cp = cp))
##     bind_cols(inclusions, metrics)
## }
## <bytecode: 0x00000000311a77b8>
## <environment: namespace:broom>

Consulting the documentation of ?leaps::regsubsets(), we can create a custom function that also includes the SSE of each model.

tidy.regsubsets2 <- function(x, ...) {
  s <- summary(x)
  inclusions <- as_tibble(s$which)
  metrics <- with(s, tibble(r.squared = rsq, adj.r.squared = adjr2, 
                            BIC = bic, mallows_cp = cp, SSE_p = rss))
  bind_cols(inclusions, metrics)
}

4.3.2 Getting the results

Recall that

\[\widehat{\sigma}^{2}_{p} \,=\, S^{2}_{p} \,=\, \frac{SSE_{p}}{n-p}\]

where \(n\) is the number of observations in your data, and \(p\) is the number of parameters including the intercept.

n <- nrow(rock)

all_models <- regsubsets(area ~ ., data=rock, method="exhaustive", nbest=3) %>%
  tidy.regsubsets2() %>%
  rowwise() %>%
  mutate(p = sum(c_across(where(is.logical)))) %>%
  ungroup() %>%
  mutate(s.squared_p = SSE_p / (n - p)) %>%
  select(-c(r.squared, BIC, SSE_p))

all_models

## # A tibble: 7 x 8
##   `(Intercept)` peri  shape perm  adj.r.squared mallows_cp     p s.squared_p
##   <lgl>         <lgl> <lgl> <lgl>         <dbl>      <dbl> <int>       <dbl>
## 1 TRUE          TRUE  FALSE FALSE        0.669       20.8      2    2380718.
## 2 TRUE          FALSE FALSE TRUE         0.139      125.       2    6201808.
## 3 TRUE          FALSE TRUE  FALSE        0.0122     150.       2    7115420.
## 4 TRUE          TRUE  FALSE TRUE         0.764        3.20     3    1696625.
## 5 TRUE          TRUE  TRUE  FALSE        0.701       15.4      3    2152972.
## 6 TRUE          FALSE TRUE  TRUE         0.122      126.       3    6323336.
## 7 TRUE          TRUE  TRUE  TRUE         0.765        4        4    1689242.

In choosing a model, we want a model that has a high adjusted \(R^{2}\), \(C_{p} \approx p\), low \(S^{2}_{p}\), and as few predictors as possible. From the results above, models 4 and 7 are quite similar with model 7 (the full model) being only slightly better.

4.3.3 Obtaining the values for the intercept-only model

We first fit the intercept-only model.

rock_intercept_lm <- lm(area ~ 1, data=rock)

The formula for Mallows' \(C_{p}\) is given on page 332 of the textbook:

\[C_{p} \,=\, \frac{SSE_{p}}{S^{2}} \,-\, (n - 2p)\]

where \(S^{2}\) is the estimate of the error term variance for the maximal model \((p=4)\). Note that the value for \(n\) has already been initialised as a variable in our workspace. We can extract the value of \(S^{2}_{4}\) from our previous results using:

s.squared_4 <- all_models %>%
  slice_tail(n=1) %>%
  pull(s.squared_p)

We have the relationship that

\[S^{2}_{1} \,=\, MSE_{1} \,=\, \frac{SSE_{1}}{\text{df}_{1}}\]

Rearranging gives

\[SSE_{1} \,=\, \text{df}_{1} * S^{2}_{1}\]

Mimicking the structure of all_models, the corresponding values for our intercept-only model are:

intercept_lm_results <- tibble(
  `(Intercept)` = TRUE,
  peri = FALSE,
  shape = FALSE,
  perm = FALSE,
  adj.r.squared = summary(rock_intercept_lm)$adj.r.squared,
  s.squared_p = summary(rock_intercept_lm)$sigma^2,
  p = summary(rock_intercept_lm)$df[1],
  df = summary(rock_intercept_lm)$df[2],
  mallows_cp = (s.squared_p * df) / s.squared_4 - (n - 2 * p)
) %>%
  select(where(is.logical), adj.r.squared, mallows_cp, p, s.squared_p)

intercept_lm_results

## # A tibble: 1 x 8
##   `(Intercept)` peri  shape perm  adj.r.squared mallows_cp     p s.squared_p
##   <lgl>         <lgl> <lgl> <lgl>         <dbl>      <dbl> <int>       <dbl>
## 1 TRUE          FALSE FALSE FALSE             0       154.     1    7203045.

Binding this to the main output:

all_models <- bind_rows(intercept_lm_results, all_models)

all_models

## # A tibble: 8 x 8
##   `(Intercept)` peri  shape perm  adj.r.squared mallows_cp     p s.squared_p
##   <lgl>         <lgl> <lgl> <lgl>         <dbl>      <dbl> <int>       <dbl>
## 1 TRUE          FALSE FALSE FALSE        0          154.       1    7203045.
## 2 TRUE          TRUE  FALSE FALSE        0.669       20.8      2    2380718.
## 3 TRUE          FALSE FALSE TRUE         0.139      125.       2    6201808.
## 4 TRUE          FALSE TRUE  FALSE        0.0122     150.       2    7115420.
## 5 TRUE          TRUE  FALSE TRUE         0.764        3.20     3    1696625.
## 6 TRUE          TRUE  TRUE  FALSE        0.701       15.4      3    2152972.
## 7 TRUE          FALSE TRUE  TRUE         0.122      126.       3    6323336.
## 8 TRUE          TRUE  TRUE  TRUE         0.765        4        4    1689242.

4.3.4 Plotting the results

ggplot(all_models, aes(x=p, y=mallows_cp))+
  geom_point()+
  labs(caption="p is the number of parameter estimates including the intercept")

4.4 Stepwise selection

Let's revisit the penguins data from palmerpenguins, specifically the subset of penguins of species Adelie from the island Biscoe.

adeliebiscoe <- palmerpenguins::penguins %>%
  drop_na() %>%
  filter(species == "Adelie", island == "Biscoe") %>%
  rename(
    bill_length = bill_length_mm,
    bill_depth = bill_depth_mm,
    flipper_length = flipper_length_mm,
    body_mass = body_mass_g,
  ) %>%
  select(body_mass, bill_length, bill_depth, flipper_length)

For the following, we will use:

\[\alpha_{\text{entry}} = 0.05, \quad \alpha_{\text{exit}} = 0.05\]

4.4.1 Building the base model

We are interested in performing stepwise selection using body_mass as the response. As recommended in the lecture notes, we commence the model building process by including the predictor that is most correlated to body_mass.

cor(adeliebiscoe)

##                body_mass bill_length bill_depth flipper_length
## body_mass      1.0000000   0.6477493  0.6516595      0.5261972
## bill_length    0.6477493   1.0000000  0.4681935      0.3366548
## bill_depth     0.6516595   0.4681935  1.0000000      0.2727782
## flipper_length 0.5261972   0.3366548  0.2727782      1.0000000

bill_depth is the most correlated with body_mass (remember to look at the absolute value of the correlation).

step1 <- lm(body_mass ~ bill_depth, data=adeliebiscoe)

summary(step1)

## 
## Call:
## lm(formula = body_mass ~ bill_depth, data = adeliebiscoe)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -846.03 -227.10  -27.46  275.68  897.03 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1201.76     883.92  -1.360    0.181    
## bill_depth    267.35      48.02   5.568 1.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 374.3 on 42 degrees of freedom
## Multiple R-squared:  0.4247, Adjusted R-squared:  0.411 
## F-statistic:    31 on 1 and 42 DF,  p-value: 1.658e-06

From the summary output, the \(t\)-test of bill_depth is significant so we can continue considering predictors to add to the model.

4.4.2 Adding a predictor, round 1

Note that we can perform the partial \(F\)-tests of single predictor additions using the usual method of building the candidate model and then calling anova() on the reduced and full models.

step2a <- lm(body_mass ~ bill_depth + bill_length, data=adeliebiscoe)
step2b <- lm(body_mass ~ bill_depth + flipper_length, data=adeliebiscoe)

anova(step1, step2a)

## Analysis of Variance Table
## 
## Model 1: body_mass ~ bill_depth
## Model 2: body_mass ~ bill_depth + bill_length
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1     42 5885163                                  
## 2     41 4347044  1   1538118 14.507 0.0004594 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(step1, step2b)

## Analysis of Variance Table
## 
## Model 1: body_mass ~ bill_depth
## Model 2: body_mass ~ bill_depth + flipper_length
##   Res.Df     RSS Df Sum of Sq      F   Pr(>F)   
## 1     42 5885163                                
## 2     41 4543427  1   1341736 12.108 0.001205 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This can become a lot of work at each step and can become tedious if there are many predictors to consider at each step. Instead, we can perform the partial \(F\)-tests of single predictor additions using the add1() function.

add1(step1, ~ . + bill_length + flipper_length, test="F")

## Single term additions
## 
## Model:
## body_mass ~ bill_depth
##                Df Sum of Sq     RSS    AIC F value    Pr(>F)    
## <none>                      5885163 523.37                      
## bill_length     1   1538118 4347044 512.04  14.507 0.0004594 ***
## flipper_length  1   1341736 4543427 513.98  12.108 0.0012047 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The first argument to add1() is the base model which we are considering adding predictors to. The second argument is the scope, where we list all predictors (using formula syntax) that are eligible to be added to the main model. Notice that the results of the partial \(F\)-tests are identical to the previous results when we had constructed the models individually (step2a and step2b).

The predictor that gets added to our model is the one associated to the highest \(F\)-value. Since each of these \(F\)-values have the same numerator and denominator degrees of freedom, this is also equivalent to adding the predictor that has the smallest \(p\)-value associated to its partial \(F\)-test (so long as it is less than \(\alpha_{\text{entry}}\)). Since the \(F\)-test of adding bill_length to the base model results in the smallest \(p\)-value that is less than \(\alpha_{\text{entry}}\), bill_length is added to the base model.

step2 <- update(step1, . ~ . + bill_length)

4.4.3 Dropping a predictor, round 1

Now we check whether there are any predictors that can be dropped. As before, we can construct multiple models that differ by single term deletions and perform the partial \(F\)-tests by calling anova() on the reduced and full models. We can also use the equivalent of add1() for single term deletions, which is drop1().

drop1(step2, ~ ., test="F")

## Single term deletions
## 
## Model:
## body_mass ~ bill_depth + bill_length
##             Df Sum of Sq     RSS    AIC F value    Pr(>F)    
## <none>                   4347044 512.04                      
## bill_depth   1   1590092 5937136 523.75  14.997 0.0003796 ***
## bill_length  1   1538118 5885163 523.37  14.507 0.0004594 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The first argument to drop1() is the model from which we are considering dropping predictors from. The second argument is the scope (similar to add1()), where we list all predictors (using formula syntax) that are eligible to be dropped from the main model. Here, all predictors currently in the model under consideration are eligible to be dropped.

Recall that the hypotheses associated to dropping a single predictor are:

\[H_{0}: \beta_{1} = 0, \quad H_{A}: \beta_{1} \neq 0\] \[H_{0}: \beta_{2} = 0, \quad H_{A}: \beta_{2} \neq 0\]

For a predictor to be dropped from the model, we need to fail to reject the null hypothesis. So we are looking for small \(F\)-values, or equivalently, large \(p\)-values that are greater than \(\alpha_{\text{exit}}\). Since we do not have any here, we do not drop any of the predictors from the current model.

4.4.4 Adding a predictor, round 2

At this stage, we only have one more variable left in our data set that can be added to the model.

add1(step2, ~ . + flipper_length, test="F")

## Single term additions
## 
## Model:
## body_mass ~ bill_depth + bill_length
##                Df Sum of Sq     RSS    AIC F value   Pr(>F)   
## <none>                      4347044 512.04                    
## flipper_length  1    775322 3571722 505.39  8.6829 0.005336 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the \(p\)-value for this \(F\)-test is less than \(\alpha_{\text{entry}}\), we add flipper_length into the main model.

step3 <- update(step2, . ~ . + flipper_length)

4.4.5 Dropping a predictor, round 2

drop1(step3, ~ ., test="F")

## Single term deletions
## 
## Model:
## body_mass ~ bill_depth + bill_length + flipper_length
##                Df Sum of Sq     RSS    AIC F value    Pr(>F)    
## <none>                      3571722 505.39                      
## bill_depth      1   1270094 4841816 516.78 14.2239 0.0005257 ***
## bill_length     1    971705 4543427 513.98 10.8822 0.0020456 ** 
## flipper_length  1    775322 4347044 512.04  8.6829 0.0053359 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As before, predictors that should be considered for dropping are those that have a \(p\)-value greater than \(\alpha_{\text{exit}}\). Since we do not have any, we do not drop any of the predictors in our model.

4.4.6 The final model

summary(step3)

## 
## Call:
## lm(formula = body_mass ~ bill_depth + bill_length + flipper_length, 
##     data = adeliebiscoe)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -836.50 -186.29   19.01  183.14  486.90 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -6122.000   1341.697  -4.563 4.71e-05 ***
## bill_depth       165.196     43.802   3.771 0.000526 ***
## bill_length       70.743     21.445   3.299 0.002046 ** 
## flipper_length    21.397      7.262   2.947 0.005336 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 298.8 on 40 degrees of freedom
## Multiple R-squared:  0.6508, Adjusted R-squared:  0.6246 
## F-statistic: 24.85 on 3 and 40 DF,  p-value: 3.048e-09

The final model is:

\[\widehat{Y}_{i} \,=\, -6122.000 \,+\, 165.196X_{1,\,i} + 70.743X_{2,\,i} \,+\, 21.397X_{3,\,i}\]

Note that our adjusted R-squared value is 0.6246, which is not terrible, but not great. However, recall that in the last lab, removing the intercept from the model did result in a jump in the adjusted R-squared from 0.6246 to 0.9905.

4.5 Backward selection

In backward selection, we start with the full model and at each step, consider dropping predictors. As we had seen for the Adelie Biscoe data, performing the \(F\)-test for single term deletions on the model that contained all predictors, none were eligible for dropping. Therefore, for this example, let us consider another data set — the Scooby Doo data set, which is abundant in variables.

scoobydoo <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-13/scoobydoo.csv")

names(scoobydoo)

##  [1] "index"                    "series_name"             
##  [3] "network"                  "season"                  
##  [5] "title"                    "imdb"                    
##  [7] "engagement"               "date_aired"              
##  [9] "run_time"                 "format"                  
## [11] "monster_name"             "monster_gender"          
## [13] "monster_type"             "monster_subtype"         
## [15] "monster_species"          "monster_real"            
## [17] "monster_amount"           "caught_fred"             
## [19] "caught_daphnie"           "caught_velma"            
## [21] "caught_shaggy"            "caught_scooby"           
## [23] "captured_fred"            "captured_daphnie"        
## [25] "captured_velma"           "captured_shaggy"         
## [27] "captured_scooby"          "unmask_fred"             
## [29] "unmask_daphnie"           "unmask_velma"            
## [31] "unmask_shaggy"            "unmask_scooby"           
## [33] "snack_fred"               "snack_daphnie"           
## [35] "snack_velma"              "snack_shaggy"            
## [37] "snack_scooby"             "unmask_other"            
## [39] "caught_other"             "caught_not"              
## [41] "trap_work_first"          "setting_terrain"         
## [43] "setting_country_state"    "suspects_amount"         
## [45] "non_suspect"              "arrested"                
## [47] "culprit_name"             "culprit_gender"          
## [49] "culprit_amount"           "motive"                  
## [51] "if_it_wasnt_for"          "and_that"                
## [53] "door_gag"                 "number_of_snacks"        
## [55] "split_up"                 "another_mystery"         
## [57] "set_a_trap"               "jeepers"                 
## [59] "jinkies"                  "my_glasses"              
## [61] "just_about_wrapped_up"    "zoinks"                  
## [63] "groovy"                   "scooby_doo_where_are_you"
## [65] "rooby_rooby_roo"          "batman"                  
## [67] "scooby_dum"               "scrappy_doo"             
## [69] "hex_girls"                "blue_falcon"             
## [71] "fred_va"                  "daphnie_va"              
## [73] "velma_va"                 "shaggy_va"               
## [75] "scooby_va"

4.5.1 Data prep

For this example, we will only keep a subset of the variables for the sake of brevity. In addition, some of the variables that should be of type numeric were read in as strings due to the presence of the string "NULL" (note that the string "NULL" is not the same as NULL, which is also not the same as NA). For example:

scoobydoo %>%
  count(jinkies)

## # A tibble: 14 x 2
##    jinkies     n
##    <chr>   <int>
##  1 0         176
##  2 1          92
##  3 10          1
##  4 11          1
##  5 13          1
##  6 2          53
##  7 3          27
##  8 4          11
##  9 5           5
## 10 6          10
## 11 7           4
## 12 8           3
## 13 9           1
## 14 NULL      218

Therefore, we want to remove all episodes that contain "NULL" values in selected columns and coerce the remaining values to numeric.

scoobydoo <- scoobydoo %>%
  select(
    imdb, engagement, run_time, monster_amount, suspects_amount, culprit_amount,
    jeepers, jinkies, my_glasses, zoinks
  ) %>%
  filter(across(where(is.character), ~ . != "NULL")) %>%
  mutate(across(where(is.character), as.numeric))

In the line containing filter(), we are filtering over all character-type columns to only retain episodes that do not contain any "NULL" values. Specifying the filtering condition this way allows us to do less typing. Otherwise, we would need to type out all the variables of interest:

filter(imdb != "NULL", jeepers != "NULL", jinkies != "NULL", etc.)

Note that the portion ~ . != "NULL" is a lambda function / anonymous function, not a mathematical formula, and is the short-hand equivalent to:

function(x) {
  x != "NULL"
}

Previewing our data:

scoobydoo

## # A tibble: 369 x 10
##     imdb engagement run_time monster_amount suspects_amount culprit_amount jeepers jinkies my_glasses zoinks
##    <dbl>      <dbl>    <dbl>          <dbl>           <dbl>          <dbl>   <dbl>   <dbl>      <dbl>  <dbl>
##  1   8.1        556       21              1               2              1       0       0          1      1
##  2   8.1        479       22              1               2              1       0       0          0      3
##  3   8          455       21              1               0              1       0       0          0      1
##  4   7.8        426       21              1               2              1       0       0          0      2
##  5   7.5        391       21              1               1              1       0       0          1      0
##  6   8.4        384       21              1               2              1       1       0          0      2
##  7   7.6        358       21              1               1              1       0       0          0      1
##  8   8.2        358       21              1               2              1       0       0          1      0
##  9   8.1        371       21              1               1              1       0       0          0      0
## 10   8          346       21              1               1              1       0       0          0      0
## # ... with 359 more rows

Now we are ready to start building models. We are interested in creating a model that predicts the IMDB rating. As we are performing backward selection, we start with the model containing all predictors. For the following example, we will use \(\alpha_{\text{exit}} = 0.05\).

back1 <- lm(imdb ~ ., data=scoobydoo)

summary(back1)

## 
## Call:
## lm(formula = imdb ~ ., data = scoobydoo)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.84731 -0.30812 -0.02295  0.33015  2.48946 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      8.221e+00  7.074e-02 116.212  < 2e-16 ***
## engagement      -6.610e-06  5.797e-06  -1.140 0.254938    
## run_time        -1.040e-02  2.293e-03  -4.536 7.84e-06 ***
## monster_amount  -4.792e-02  2.054e-02  -2.333 0.020207 *  
## suspects_amount -6.519e-02  1.489e-02  -4.377 1.58e-05 ***
## culprit_amount   1.304e-02  2.813e-02   0.464 0.643254    
## jeepers         -2.581e-02  2.384e-02  -1.083 0.279719    
## jinkies         -5.536e-02  1.573e-02  -3.519 0.000488 ***
## my_glasses      -1.156e-01  9.429e-02  -1.226 0.221045    
## zoinks           8.496e-03  1.167e-02   0.728 0.467106    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5547 on 359 degrees of freedom
## Multiple R-squared:  0.3503, Adjusted R-squared:  0.334 
## F-statistic: 21.51 on 9 and 359 DF,  p-value: < 2.2e-16

4.5.2 Dropping a predictor, round 1

drop1(back1, ~ ., test="F")

## Single term deletions
## 
## Model:
## imdb ~ engagement + run_time + monster_amount + suspects_amount + 
##     culprit_amount + jeepers + jinkies + my_glasses + zoinks
##                 Df Sum of Sq    RSS     AIC F value    Pr(>F)    
## <none>                       110.48 -425.00                      
## engagement       1    0.4001 110.88 -425.66  1.3002 0.2549380    
## run_time         1    6.3308 116.81 -406.43 20.5713 7.843e-06 ***
## monster_amount   1    1.6748 112.16 -421.44  5.4422 0.0202074 *  
## suspects_amount  1    5.8966 116.38 -407.81 19.1605 1.578e-05 ***
## culprit_amount   1    0.0661 110.55 -426.77  0.2149 0.6432544    
## jeepers          1    0.3607 110.84 -425.79  1.1720 0.2797190    
## jinkies          1    3.8118 114.29 -414.48 12.3861 0.0004882 ***
## my_glasses       1    0.4625 110.94 -425.45  1.5028 0.2210448    
## zoinks           1    0.1631 110.64 -426.45  0.5299 0.4671058    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Remember that when we are dropping predictors, in the related hypothesis test, we want to fail to reject the null hypothesis. Therefore we are looking for \(p\)-values that are greater than \(\alpha_{\text{exit}} = 0.05\). From the output above, there are many predictors that are eligible to be dropped. Therefore, we select the predictor corresponding to the largest \(p\)-value greater than \(\alpha_{\text{exit}} = 0.05\). This is culprit_amount with a \(p\)-value of 0.643.

back2 <- update(back1, . ~ . - culprit_amount)

4.5.3 Dropping a predictor, round 2

drop1(back2, ~ ., test="F")

## Single term deletions
## 
## Model:
## imdb ~ engagement + run_time + monster_amount + suspects_amount + 
##     jeepers + jinkies + my_glasses + zoinks
##                 Df Sum of Sq    RSS     AIC F value    Pr(>F)    
## <none>                       110.55 -426.77                      
## engagement       1    0.4102 110.96 -427.41  1.3359   0.24853    
## run_time         1    6.2755 116.82 -408.40 20.4364 8.377e-06 ***
## monster_amount   1    1.6124 112.16 -423.43  5.2509   0.02251 *  
## suspects_amount  1    5.8491 116.40 -409.75 19.0476 1.668e-05 ***
## jeepers          1    0.3504 110.90 -427.61  1.1409   0.28617    
## jinkies          1    3.8516 114.40 -416.14 12.5429   0.00045 ***
## my_glasses       1    0.4734 111.02 -427.20  1.5416   0.21518    
## zoinks           1    0.1479 110.69 -428.28  0.4816   0.48816    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The variable with the largest \(p\)-value greater than \(\alpha_{\text{exit}} = 0.05\) is zoinks.

back3 <- update(back2, . ~ . - zoinks)

4.5.4 Dropping a predictor, round 3

drop1(back3, ~ ., test="F")

## Single term deletions
## 
## Model:
## imdb ~ engagement + run_time + monster_amount + suspects_amount + 
##     jeepers + jinkies + my_glasses
##                 Df Sum of Sq    RSS     AIC F value    Pr(>F)    
## <none>                       110.69 -428.28                      
## engagement       1    0.3912 111.09 -428.98  1.2758 0.2594306    
## run_time         1    6.2058 116.90 -410.15 20.2383 9.231e-06 ***
## monster_amount   1    1.5627 112.26 -425.11  5.0961 0.0245764 *  
## suspects_amount  1    5.8494 116.55 -411.28 19.0762 1.643e-05 ***
## jeepers          1    0.2204 110.92 -429.55  0.7188 0.3970933    
## jinkies          1    3.7181 114.41 -418.09 12.1255 0.0005584 ***
## my_glasses       1    0.4628 111.16 -428.74  1.5093 0.2200451    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The variable with the largest \(p\)-value greater than \(\alpha_{\text{exit}} = 0.05\) is jeepers.

back4 <- update(back3, . ~ . - jeepers)

4.5.5 Dropping a predictor, round 4

drop1(back4, ~ ., test="F")

## Single term deletions
## 
## Model:
## imdb ~ engagement + run_time + monster_amount + suspects_amount + 
##     jinkies + my_glasses
##                 Df Sum of Sq    RSS     AIC F value    Pr(>F)    
## <none>                       110.92 -429.55                      
## engagement       1    0.3585 111.27 -430.36  1.1702 0.2800874    
## run_time         1    6.6587 117.57 -410.03 21.7322 4.414e-06 ***
## monster_amount   1    1.5324 112.45 -426.48  5.0015 0.0259342 *  
## suspects_amount  1    5.8929 116.81 -412.45 19.2328 1.519e-05 ***
## jinkies          1    3.7525 114.67 -419.27 12.2471 0.0005241 ***
## my_glasses       1    0.4743 111.39 -429.97  1.5479 0.2142514    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The variable with the largest \(p\)-value greater than \(\alpha_{\text{exit}} = 0.05\) is engagement.

back5 <- update(back4, . ~ . - engagement)

4.5.6 Dropping a predictor, round 5

drop1(back5, ~ ., test="F")

## Single term deletions
## 
## Model:
## imdb ~ run_time + monster_amount + suspects_amount + jinkies + 
##     my_glasses
##                 Df Sum of Sq    RSS     AIC F value    Pr(>F)    
## <none>                       111.27 -430.36                      
## run_time         1    7.2140 118.49 -409.18 23.5337 1.825e-06 ***
## monster_amount   1    2.3265 113.60 -424.72  7.5896 0.0061662 ** 
## suspects_amount  1    6.0337 117.31 -412.87 19.6833 1.214e-05 ***
## jinkies          1    4.1226 115.40 -418.93 13.4489 0.0002818 ***
## my_glasses       1    0.8231 112.10 -429.64  2.6852 0.1021524    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The variable with the largest \(p\)-value greater than \(\alpha_{\text{exit}} = 0.05\) is my_glasses.

back6 <- update(back5, . ~ . - my_glasses)

4.5.7 Dropping a predictor, round 6

drop1(back6, ~ ., test="F")

## Single term deletions
## 
## Model:
## imdb ~ run_time + monster_amount + suspects_amount + jinkies
##                 Df Sum of Sq    RSS     AIC F value    Pr(>F)    
## <none>                       112.10 -429.64                      
## run_time         1    8.2939 120.39 -405.30 26.9316  3.51e-07 ***
## monster_amount   1    2.4078 114.50 -423.79  7.8187 0.0054452 ** 
## suspects_amount  1    5.9769 118.07 -412.47 19.4080  1.39e-05 ***
## jinkies          1    4.3332 116.43 -417.64 14.0708 0.0002048 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The remaining single term deletions have \(p\)-values that are less than \(\alpha_{\text{exit}} = 0.05\). Therefore, we stop here.

4.5.8 The final model

summary(back6)

## 
## Call:
## lm(formula = imdb ~ run_time + monster_amount + suspects_amount + 
##     jinkies, data = scoobydoo)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.89720 -0.30625 -0.03038  0.33928  2.42649 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      8.264472   0.062373 132.501  < 2e-16 ***
## run_time        -0.011483   0.002213  -5.190 3.51e-07 ***
## monster_amount  -0.052395   0.018738  -2.796 0.005445 ** 
## suspects_amount -0.064694   0.014685  -4.405 1.39e-05 ***
## jinkies         -0.057869   0.015427  -3.751 0.000205 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5549 on 364 degrees of freedom
## Multiple R-squared:  0.3408, Adjusted R-squared:  0.3336 
## F-statistic: 47.04 on 4 and 364 DF,  p-value: < 2.2e-16

The final model is:

\[\widehat{Y}_{i} \,=\, 8.2645 \,-\, 0.0114X_{1,\,i} \,-\, 0.0524X_{2,\,i} - 0.0647X_{3,\,i} - 0.0579X_{4,\,i}\]

Interestingly, this says that the number of jinkies in an episode has a (small) negative impact on the IMDB rating 😂!

While all the remaining predictors in our model are significant, we have a very poor adjusted R-squared value of 0.3336. This means that we have only explained 33.36% of the variation in IMDB ratings by using these predictors. In fact, our original model with all the predictors had an adjusted R-squared value of 0.3340. As such, the final model still has room for improvement (though do keep in mind that only a small subset of variables were included in this example for the sake of brevity).

3 Lab 5 — October 13

5 Lab 9 — December 1

4 Lab 7 — November 10

4.1 Packages

4.2 Weighted linear regression (via lm)

4.2.1 Problem (i)

4.2.2 Solution

4.2.3 Problem (ii)

4.2.4 Solution

4.3 All possible models

4.3.1 A custom tidy function

4.3.2 Getting the results

4.3.3 Obtaining the values for the intercept-only model

4.3.4 Plotting the results

4.4 Stepwise selection

4.4.1 Building the base model

4.4.2 Adding a predictor, round 1

4.4.3 Dropping a predictor, round 1

4.4.4 Adding a predictor, round 2

4.4.5 Dropping a predictor, round 2

4.4.6 The final model

4.5 Backward selection

4.5.1 Data prep

4.5.2 Dropping a predictor, round 1

4.5.3 Dropping a predictor, round 2

4.5.4 Dropping a predictor, round 3

4.5.5 Dropping a predictor, round 4

4.5.6 Dropping a predictor, round 5

4.5.7 Dropping a predictor, round 6

4.5.8 The final model

4.2 Weighted linear regression (via `lm`)