4 Lab 7 — November 10
4.1 Packages
library(tibble)
library(dplyr)
library(ggplot2)
library(tidyr)
library(readr)
library(broom)
library(leaps)
theme_set(theme_bw())
readr is another package from tidyverse for reading in various types of data. Many of its functions work similar to the base-R equivalents but work a bit better, are smarter, and read data in as a tibble.
We will be using the leaps package to aid us in computing all possible models from a set of predictors and obtaining the corresponding Mallows' \(C_{p}\) values for all of these models. The leaps package can also be used for best subset selection, forward selection, and backward selection. In this lab, while we will be covering an example on backward selection, we will be doing it manually to throroughly illustrate the process.
4.2 Weighted linear regression (via lm
)
Last time, we saw how to compute the weighted least squares estimates for \(\beta\) using matrices.
Now, let's take a look at how to do it via the lm()
function. The data we will use for this
example is the data from Table 2.1 (page 51).
I entered the data manually myself, so let's double-check that my values are correct. The resulting regression equation should be \(\widehat{Y}_{i} = 1.426 + 0.316X_{i}\) (page 51).
table2.1 <- read.table("./data/table2.1_DS.txt", header=TRUE) %>%
as_tibble()
lm(Y ~ X, data=table2.1) %>%
coef()
## (Intercept) X
## 1.4256449 0.3157857
Looks good!
4.2.1 Problem (i)
Suppose we are told that the 16th observation has variance \(16\sigma^{2}\) rather than \(\sigma^{2}\). Obtain the parameter estimates using weighted least squares.
4.2.2 Solution
If the 16th observation has variance \(16\sigma^{2}\), then
\[\textbf{Var}(\varepsilon) \,=\, \sigma^{2}V \,=\, \sigma^{2}\begin{bmatrix}1\\ &1\\ & &\ddots\\ & & &16\\ & & & &\ddots\\ & & & & &1\\ & & & & & &1 \end{bmatrix}\]
The weight matrix, \(V^{-1}\), is calculated as the inverse of \(V\). We know that the inverse of a diagonal matrix is the same matrix but with the diagonal elements reciprocated. Thus we have
\[V^{-1} \,=\, \begin{bmatrix}1\\ &1\\ & &\ddots\\ & & &1/16\\ & & & &\ddots\\ & & & & &1\\ & & & & & &1 \end{bmatrix}\]
The weights we will supply to lm()
will be a vector of 15 ones, followed by the value \(1/16\),
followed by 7 ones. Instead of typing the number 1
15 times, we can use the rep()
function
which repeats a value \(n\) times.
## [1] 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
## [11] 1.0000 1.0000 1.0000 1.0000 1.0000 0.0625 1.0000 1.0000 1.0000 1.0000
## [21] 1.0000 1.0000 1.0000
# Check that it has the same length as the data (n=23)
length(w1)
## [1] 23
# Check that the value at position 16 is correct
w1[16]
## [1] 0.0625
Now we can proceed with building the model.
##
## Call:
## lm(formula = Y ~ X, data = table2.1, weights = w1)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -1.18059 -0.51519 -0.07287 0.25786 2.39631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.4421 0.5112 2.821 0.0102 *
## X 0.3077 0.1155 2.663 0.0145 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8485 on 21 degrees of freedom
## Multiple R-squared: 0.2525, Adjusted R-squared: 0.2169
## F-statistic: 7.094 on 1 and 21 DF, p-value: 0.01454
The equation of the fitted line is:
\[\widehat{Y}_{i} \,=\, 1.4421 \,+\, 0.3077X_{i}\]
4.2.3 Problem (ii)
Suppose we are now told that the 16th observation has variance \(25\sigma^{2}\) rather than \(\sigma^{2}\). Obtain the parameter estimates using weighted least squares.
4.2.4 Solution
We repeat the process from before. Since the variance for the 16th observation has been scaled by a factor of 25, the weight will be \(1/25\).
## [1] 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
## [16] 0.04 1.00 1.00 1.00 1.00 1.00 1.00 1.00
# Check that it has the same length as the data (n=23)
length(w2)
## [1] 23
# Check that the value at position 16 is correct
w2[16]
## [1] 0.04
Building a second model,
##
## Call:
## lm(formula = Y ~ X, data = table2.1, weights = w2)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -1.17998 -0.51499 -0.07274 0.25852 2.39727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.4425 0.5112 2.822 0.0102 *
## X 0.3075 0.1155 2.661 0.0146 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8483 on 21 degrees of freedom
## Multiple R-squared: 0.2522, Adjusted R-squared: 0.2166
## F-statistic: 7.082 on 1 and 21 DF, p-value: 0.01461
The equation of the fitted line is:
\[\widehat{Y}_{i} \,=\, 1.4425 \,+\, 0.3075X_{i}\]
We can see that in changing our weight for the 16th observation from \(1/16\) to \(1/25\):
- The value of \(\widehat{\beta}_{0}\) has increased (not by much)
- The value of \(\widehat{\beta}_{1}\) has decreased (not by much)
The differences are more apparent when adjusting weights to the 23rd observation of this data set. Hint: make a scatterplot!
First, let's add a column of indices to the data.
table2.1 <- table2.1 %>%
mutate(obs = 1:n(), .before = everything())
table2.1
## # A tibble: 23 x 3
## obs Y X
## <int> <dbl> <dbl>
## 1 1 2.3 1.3
## 2 2 1.8 1.3
## 3 3 2.8 2
## 4 4 1.5 2
## 5 5 2.2 2.7
## 6 6 3.8 3.3
## 7 7 1.8 3.3
## 8 8 3.7 3.7
## 9 9 1.7 3.7
## 10 10 2.8 4
## # ... with 13 more rows
Then we can make a scatterplot and label each point with its observation number.
ggplot(table2.1, aes(x=X, y=Y, label=obs))+
geom_point(alpha=0.4)+
geom_text(hjust="outward", vjust="outward")
Observation 23 definitely stands out from the rest of the data!
4.3 All possible models
To be consistent with the textbook, let us define \(p\) as the total number of parameters in a model, including the intercept. Suppose we are interested in plotting Mallows' \(C_{p}\) against \(p\) for all possible models.
(Alternatively, we could consider plotting \(C_{p}\) against \(p+1\) where \(p\) is the number of non-intercept parameters).
Returning to the rock
data set, an outline of our plan will be as follows:
To compute all the possible models in a single line, we employ
leaps::regsubsets()
. In the formula argument (x
), we specify the formulaarea ~ .
. This means fixarea
as the response variable, and allow any other variables appearing in the data set to be added to the candidate model (this is so that we don't need to manually type the names of the variables in our data set). We also setnbest=3
to force computation of all models when \(p < 4\).We then pass the result through a custom function that is derived from
broom::tidy.regsubsets()
to obtain the required diagnostic information (including Mallows' \(C_{p}\)) as a tidy tibble. We need a custom function since we also want the \(SSE_{p}\) of each model in order to calculate the \(MSE_{p}\) (or equivalently, the \(S^{2}_{p}\)).We can then create the column, \(p\), that counts the number of predictors included in the model (including the intercept). This involves some code that is a bit more complicated, but it is essentially summing the number of
TRUE
s appearing in each row (i.e. we are performing a horizontal sum rather than the usual vertical sum).We then compute the mean square error, \(S^{2}_{p}\), for each model.
We then drop the columns that we won't be using.
Finally, we can plot our results.
4.3.1 A custom tidy function
We can access the source code used to tidy objects of class regsubsets
by calling
broom:::tidy.regsubsets
(three colons).
broom:::tidy.regsubsets
## function (x, ...)
## {
## s <- summary(x)
## inclusions <- as_tibble(s$which)
## metrics <- with(s, tibble(r.squared = rsq, adj.r.squared = adjr2,
## BIC = bic, mallows_cp = cp))
## bind_cols(inclusions, metrics)
## }
## <bytecode: 0x00000000311a77b8>
## <environment: namespace:broom>
Consulting the documentation of ?leaps::regsubsets()
, we can create a custom function that also
includes the SSE of each model.
4.3.2 Getting the results
Recall that
\[\widehat{\sigma}^{2}_{p} \,=\, S^{2}_{p} \,=\, \frac{SSE_{p}}{n-p}\]
where \(n\) is the number of observations in your data, and \(p\) is the number of parameters including the intercept.
n <- nrow(rock)
all_models <- regsubsets(area ~ ., data=rock, method="exhaustive", nbest=3) %>%
tidy.regsubsets2() %>%
rowwise() %>%
mutate(p = sum(c_across(where(is.logical)))) %>%
ungroup() %>%
mutate(s.squared_p = SSE_p / (n - p)) %>%
select(-c(r.squared, BIC, SSE_p))
all_models
## # A tibble: 7 x 8
## `(Intercept)` peri shape perm adj.r.squared mallows_cp p s.squared_p
## <lgl> <lgl> <lgl> <lgl> <dbl> <dbl> <int> <dbl>
## 1 TRUE TRUE FALSE FALSE 0.669 20.8 2 2380718.
## 2 TRUE FALSE FALSE TRUE 0.139 125. 2 6201808.
## 3 TRUE FALSE TRUE FALSE 0.0122 150. 2 7115420.
## 4 TRUE TRUE FALSE TRUE 0.764 3.20 3 1696625.
## 5 TRUE TRUE TRUE FALSE 0.701 15.4 3 2152972.
## 6 TRUE FALSE TRUE TRUE 0.122 126. 3 6323336.
## 7 TRUE TRUE TRUE TRUE 0.765 4 4 1689242.
In choosing a model, we want a model that has a high adjusted \(R^{2}\), \(C_{p} \approx p\), low \(S^{2}_{p}\), and as few predictors as possible. From the results above, models 4 and 7 are quite similar with model 7 (the full model) being only slightly better.
4.3.3 Obtaining the values for the intercept-only model
We first fit the intercept-only model.
rock_intercept_lm <- lm(area ~ 1, data=rock)
The formula for Mallows' \(C_{p}\) is given on page 332 of the textbook:
\[C_{p} \,=\, \frac{SSE_{p}}{S^{2}} \,-\, (n - 2p)\]
where \(S^{2}\) is the estimate of the error term variance for the maximal model \((p=4)\). Note that the value for \(n\) has already been initialised as a variable in our workspace. We can extract the value of \(S^{2}_{4}\) from our previous results using:
s.squared_4 <- all_models %>%
slice_tail(n=1) %>%
pull(s.squared_p)
We have the relationship that
\[S^{2}_{1} \,=\, MSE_{1} \,=\, \frac{SSE_{1}}{\text{df}_{1}}\]
Rearranging gives
\[SSE_{1} \,=\, \text{df}_{1} * S^{2}_{1}\]
Mimicking the structure of all_models
, the corresponding values for our intercept-only model are:
intercept_lm_results <- tibble(
`(Intercept)` = TRUE,
peri = FALSE,
shape = FALSE,
perm = FALSE,
adj.r.squared = summary(rock_intercept_lm)$adj.r.squared,
s.squared_p = summary(rock_intercept_lm)$sigma^2,
p = summary(rock_intercept_lm)$df[1],
df = summary(rock_intercept_lm)$df[2],
mallows_cp = (s.squared_p * df) / s.squared_4 - (n - 2 * p)
) %>%
select(where(is.logical), adj.r.squared, mallows_cp, p, s.squared_p)
intercept_lm_results
## # A tibble: 1 x 8
## `(Intercept)` peri shape perm adj.r.squared mallows_cp p s.squared_p
## <lgl> <lgl> <lgl> <lgl> <dbl> <dbl> <int> <dbl>
## 1 TRUE FALSE FALSE FALSE 0 154. 1 7203045.
Binding this to the main output:
all_models <- bind_rows(intercept_lm_results, all_models)
all_models
## # A tibble: 8 x 8
## `(Intercept)` peri shape perm adj.r.squared mallows_cp p s.squared_p
## <lgl> <lgl> <lgl> <lgl> <dbl> <dbl> <int> <dbl>
## 1 TRUE FALSE FALSE FALSE 0 154. 1 7203045.
## 2 TRUE TRUE FALSE FALSE 0.669 20.8 2 2380718.
## 3 TRUE FALSE FALSE TRUE 0.139 125. 2 6201808.
## 4 TRUE FALSE TRUE FALSE 0.0122 150. 2 7115420.
## 5 TRUE TRUE FALSE TRUE 0.764 3.20 3 1696625.
## 6 TRUE TRUE TRUE FALSE 0.701 15.4 3 2152972.
## 7 TRUE FALSE TRUE TRUE 0.122 126. 3 6323336.
## 8 TRUE TRUE TRUE TRUE 0.765 4 4 1689242.
4.3.4 Plotting the results
ggplot(all_models, aes(x=p, y=mallows_cp))+
geom_point()+
labs(caption="p is the number of parameter estimates including the intercept")
4.4 Stepwise selection
Let's revisit the penguins data from palmerpenguins, specifically the subset of penguins of species Adelie from the island Biscoe.
adeliebiscoe <- palmerpenguins::penguins %>%
drop_na() %>%
filter(species == "Adelie", island == "Biscoe") %>%
rename(
bill_length = bill_length_mm,
bill_depth = bill_depth_mm,
flipper_length = flipper_length_mm,
body_mass = body_mass_g,
) %>%
select(body_mass, bill_length, bill_depth, flipper_length)
For the following, we will use:
\[\alpha_{\text{entry}} = 0.05, \quad \alpha_{\text{exit}} = 0.05\]
4.4.1 Building the base model
We are interested in performing stepwise selection using body_mass
as the response. As recommended
in the lecture notes, we commence the model building process by including the predictor that is most
correlated to body_mass
.
cor(adeliebiscoe)
## body_mass bill_length bill_depth flipper_length
## body_mass 1.0000000 0.6477493 0.6516595 0.5261972
## bill_length 0.6477493 1.0000000 0.4681935 0.3366548
## bill_depth 0.6516595 0.4681935 1.0000000 0.2727782
## flipper_length 0.5261972 0.3366548 0.2727782 1.0000000
bill_depth
is the most correlated with body_mass
(remember to look at the absolute value of the
correlation).
##
## Call:
## lm(formula = body_mass ~ bill_depth, data = adeliebiscoe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -846.03 -227.10 -27.46 275.68 897.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1201.76 883.92 -1.360 0.181
## bill_depth 267.35 48.02 5.568 1.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 374.3 on 42 degrees of freedom
## Multiple R-squared: 0.4247, Adjusted R-squared: 0.411
## F-statistic: 31 on 1 and 42 DF, p-value: 1.658e-06
From the summary output, the \(t\)-test of bill_depth
is significant so we can continue considering
predictors to add to the model.
4.4.2 Adding a predictor, round 1
Note that we can perform the partial \(F\)-tests of single predictor additions using the usual method
of building the candidate model and then calling anova()
on the reduced and full models.
step2a <- lm(body_mass ~ bill_depth + bill_length, data=adeliebiscoe)
step2b <- lm(body_mass ~ bill_depth + flipper_length, data=adeliebiscoe)
anova(step1, step2a)
## Analysis of Variance Table
##
## Model 1: body_mass ~ bill_depth
## Model 2: body_mass ~ bill_depth + bill_length
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 42 5885163
## 2 41 4347044 1 1538118 14.507 0.0004594 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(step1, step2b)
## Analysis of Variance Table
##
## Model 1: body_mass ~ bill_depth
## Model 2: body_mass ~ bill_depth + flipper_length
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 42 5885163
## 2 41 4543427 1 1341736 12.108 0.001205 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This can become a lot of work at each step and can become tedious if there are many predictors to
consider at each step. Instead, we can perform the partial \(F\)-tests of single predictor additions
using the add1()
function.
add1(step1, ~ . + bill_length + flipper_length, test="F")
## Single term additions
##
## Model:
## body_mass ~ bill_depth
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 5885163 523.37
## bill_length 1 1538118 4347044 512.04 14.507 0.0004594 ***
## flipper_length 1 1341736 4543427 513.98 12.108 0.0012047 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The first argument to add1()
is the base model which we are considering adding predictors to. The
second argument is the scope, where we list all predictors (using formula syntax) that are eligible
to be added to the main model. Notice that the results of the partial \(F\)-tests are identical to
the previous results when we had constructed the models individually (step2a
and step2b
).
The predictor that gets added to our model is the one associated to the highest \(F\)-value. Since
each of these \(F\)-values have the same numerator and denominator degrees of freedom, this is also
equivalent to adding the predictor that has the smallest \(p\)-value associated to its partial
\(F\)-test (so long as it is less than \(\alpha_{\text{entry}}\)). Since the \(F\)-test of adding
bill_length
to the base model results in the smallest \(p\)-value that is less than
\(\alpha_{\text{entry}}\), bill_length
is added to the base model.
step2 <- update(step1, . ~ . + bill_length)
4.4.3 Dropping a predictor, round 1
Now we check whether there are any predictors that can be dropped. As before, we can construct
multiple models that differ by single term deletions and perform the partial \(F\)-tests by
calling anova()
on the reduced and full models. We can also use the equivalent of add1()
for
single term deletions, which is drop1()
.
drop1(step2, ~ ., test="F")
## Single term deletions
##
## Model:
## body_mass ~ bill_depth + bill_length
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 4347044 512.04
## bill_depth 1 1590092 5937136 523.75 14.997 0.0003796 ***
## bill_length 1 1538118 5885163 523.37 14.507 0.0004594 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The first argument to drop1()
is the model from which we are considering dropping predictors from.
The second argument is the scope (similar to add1()
), where we list all predictors (using formula
syntax) that are eligible to be dropped from the main model. Here, all predictors currently in the
model under consideration are eligible to be dropped.
Recall that the hypotheses associated to dropping a single predictor are:
\[H_{0}: \beta_{1} = 0, \quad H_{A}: \beta_{1} \neq 0\] \[H_{0}: \beta_{2} = 0, \quad H_{A}: \beta_{2} \neq 0\]
For a predictor to be dropped from the model, we need to fail to reject the null hypothesis. So we are looking for small \(F\)-values, or equivalently, large \(p\)-values that are greater than \(\alpha_{\text{exit}}\). Since we do not have any here, we do not drop any of the predictors from the current model.
4.4.4 Adding a predictor, round 2
At this stage, we only have one more variable left in our data set that can be added to the model.
add1(step2, ~ . + flipper_length, test="F")
## Single term additions
##
## Model:
## body_mass ~ bill_depth + bill_length
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 4347044 512.04
## flipper_length 1 775322 3571722 505.39 8.6829 0.005336 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the \(p\)-value for this \(F\)-test is less than \(\alpha_{\text{entry}}\), we add flipper_length
into the main model.
step3 <- update(step2, . ~ . + flipper_length)
4.4.5 Dropping a predictor, round 2
drop1(step3, ~ ., test="F")
## Single term deletions
##
## Model:
## body_mass ~ bill_depth + bill_length + flipper_length
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 3571722 505.39
## bill_depth 1 1270094 4841816 516.78 14.2239 0.0005257 ***
## bill_length 1 971705 4543427 513.98 10.8822 0.0020456 **
## flipper_length 1 775322 4347044 512.04 8.6829 0.0053359 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As before, predictors that should be considered for dropping are those that have a \(p\)-value greater than \(\alpha_{\text{exit}}\). Since we do not have any, we do not drop any of the predictors in our model.
4.4.6 The final model
summary(step3)
##
## Call:
## lm(formula = body_mass ~ bill_depth + bill_length + flipper_length,
## data = adeliebiscoe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -836.50 -186.29 19.01 183.14 486.90
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6122.000 1341.697 -4.563 4.71e-05 ***
## bill_depth 165.196 43.802 3.771 0.000526 ***
## bill_length 70.743 21.445 3.299 0.002046 **
## flipper_length 21.397 7.262 2.947 0.005336 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 298.8 on 40 degrees of freedom
## Multiple R-squared: 0.6508, Adjusted R-squared: 0.6246
## F-statistic: 24.85 on 3 and 40 DF, p-value: 3.048e-09
The final model is:
\[\widehat{Y}_{i} \,=\, -6122.000 \,+\, 165.196X_{1,\,i} + 70.743X_{2,\,i} \,+\, 21.397X_{3,\,i}\]
Note that our adjusted R-squared value is 0.6246, which is not terrible, but not great. However, recall that in the last lab, removing the intercept from the model did result in a jump in the adjusted R-squared from 0.6246 to 0.9905.
4.5 Backward selection
In backward selection, we start with the full model and at each step, consider dropping predictors. As we had seen for the Adelie Biscoe data, performing the \(F\)-test for single term deletions on the model that contained all predictors, none were eligible for dropping. Therefore, for this example, let us consider another data set — the Scooby Doo data set, which is abundant in variables.
scoobydoo <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-13/scoobydoo.csv")
names(scoobydoo)
## [1] "index" "series_name"
## [3] "network" "season"
## [5] "title" "imdb"
## [7] "engagement" "date_aired"
## [9] "run_time" "format"
## [11] "monster_name" "monster_gender"
## [13] "monster_type" "monster_subtype"
## [15] "monster_species" "monster_real"
## [17] "monster_amount" "caught_fred"
## [19] "caught_daphnie" "caught_velma"
## [21] "caught_shaggy" "caught_scooby"
## [23] "captured_fred" "captured_daphnie"
## [25] "captured_velma" "captured_shaggy"
## [27] "captured_scooby" "unmask_fred"
## [29] "unmask_daphnie" "unmask_velma"
## [31] "unmask_shaggy" "unmask_scooby"
## [33] "snack_fred" "snack_daphnie"
## [35] "snack_velma" "snack_shaggy"
## [37] "snack_scooby" "unmask_other"
## [39] "caught_other" "caught_not"
## [41] "trap_work_first" "setting_terrain"
## [43] "setting_country_state" "suspects_amount"
## [45] "non_suspect" "arrested"
## [47] "culprit_name" "culprit_gender"
## [49] "culprit_amount" "motive"
## [51] "if_it_wasnt_for" "and_that"
## [53] "door_gag" "number_of_snacks"
## [55] "split_up" "another_mystery"
## [57] "set_a_trap" "jeepers"
## [59] "jinkies" "my_glasses"
## [61] "just_about_wrapped_up" "zoinks"
## [63] "groovy" "scooby_doo_where_are_you"
## [65] "rooby_rooby_roo" "batman"
## [67] "scooby_dum" "scrappy_doo"
## [69] "hex_girls" "blue_falcon"
## [71] "fred_va" "daphnie_va"
## [73] "velma_va" "shaggy_va"
## [75] "scooby_va"
4.5.1 Data prep
For this example, we will only keep a subset of the variables for the sake of brevity. In addition,
some of the variables that should be of type numeric were read in as strings due to the presence
of the string "NULL"
(note that the string "NULL"
is not the same as NULL
, which is also not
the same as NA
). For example:
scoobydoo %>%
count(jinkies)
## # A tibble: 14 x 2
## jinkies n
## <chr> <int>
## 1 0 176
## 2 1 92
## 3 10 1
## 4 11 1
## 5 13 1
## 6 2 53
## 7 3 27
## 8 4 11
## 9 5 5
## 10 6 10
## 11 7 4
## 12 8 3
## 13 9 1
## 14 NULL 218
Therefore, we want to remove all episodes that contain "NULL"
values in selected columns and
coerce the remaining values to numeric.
scoobydoo <- scoobydoo %>%
select(
imdb, engagement, run_time, monster_amount, suspects_amount, culprit_amount,
jeepers, jinkies, my_glasses, zoinks
) %>%
filter(across(where(is.character), ~ . != "NULL")) %>%
mutate(across(where(is.character), as.numeric))
In the line containing filter()
, we are filtering over all character-type columns to only retain
episodes that do not contain any "NULL"
values. Specifying the filtering condition this way
allows us to do less typing. Otherwise, we would need to type out all the variables of interest:
filter(imdb != "NULL", jeepers != "NULL", jinkies != "NULL", etc.)
Note that the portion ~ . != "NULL"
is a lambda function / anonymous function, not a mathematical
formula, and is the short-hand equivalent to:
function(x) {
x != "NULL"
}
Previewing our data:
scoobydoo
## # A tibble: 369 x 10
## imdb engagement run_time monster_amount suspects_amount culprit_amount jeepers jinkies my_glasses zoinks
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 8.1 556 21 1 2 1 0 0 1 1
## 2 8.1 479 22 1 2 1 0 0 0 3
## 3 8 455 21 1 0 1 0 0 0 1
## 4 7.8 426 21 1 2 1 0 0 0 2
## 5 7.5 391 21 1 1 1 0 0 1 0
## 6 8.4 384 21 1 2 1 1 0 0 2
## 7 7.6 358 21 1 1 1 0 0 0 1
## 8 8.2 358 21 1 2 1 0 0 1 0
## 9 8.1 371 21 1 1 1 0 0 0 0
## 10 8 346 21 1 1 1 0 0 0 0
## # ... with 359 more rows
Now we are ready to start building models. We are interested in creating a model that predicts the IMDB rating. As we are performing backward selection, we start with the model containing all predictors. For the following example, we will use \(\alpha_{\text{exit}} = 0.05\).
##
## Call:
## lm(formula = imdb ~ ., data = scoobydoo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.84731 -0.30812 -0.02295 0.33015 2.48946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.221e+00 7.074e-02 116.212 < 2e-16 ***
## engagement -6.610e-06 5.797e-06 -1.140 0.254938
## run_time -1.040e-02 2.293e-03 -4.536 7.84e-06 ***
## monster_amount -4.792e-02 2.054e-02 -2.333 0.020207 *
## suspects_amount -6.519e-02 1.489e-02 -4.377 1.58e-05 ***
## culprit_amount 1.304e-02 2.813e-02 0.464 0.643254
## jeepers -2.581e-02 2.384e-02 -1.083 0.279719
## jinkies -5.536e-02 1.573e-02 -3.519 0.000488 ***
## my_glasses -1.156e-01 9.429e-02 -1.226 0.221045
## zoinks 8.496e-03 1.167e-02 0.728 0.467106
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5547 on 359 degrees of freedom
## Multiple R-squared: 0.3503, Adjusted R-squared: 0.334
## F-statistic: 21.51 on 9 and 359 DF, p-value: < 2.2e-16
4.5.2 Dropping a predictor, round 1
drop1(back1, ~ ., test="F")
## Single term deletions
##
## Model:
## imdb ~ engagement + run_time + monster_amount + suspects_amount +
## culprit_amount + jeepers + jinkies + my_glasses + zoinks
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 110.48 -425.00
## engagement 1 0.4001 110.88 -425.66 1.3002 0.2549380
## run_time 1 6.3308 116.81 -406.43 20.5713 7.843e-06 ***
## monster_amount 1 1.6748 112.16 -421.44 5.4422 0.0202074 *
## suspects_amount 1 5.8966 116.38 -407.81 19.1605 1.578e-05 ***
## culprit_amount 1 0.0661 110.55 -426.77 0.2149 0.6432544
## jeepers 1 0.3607 110.84 -425.79 1.1720 0.2797190
## jinkies 1 3.8118 114.29 -414.48 12.3861 0.0004882 ***
## my_glasses 1 0.4625 110.94 -425.45 1.5028 0.2210448
## zoinks 1 0.1631 110.64 -426.45 0.5299 0.4671058
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Remember that when we are dropping predictors, in the related hypothesis test, we want to fail to
reject the null hypothesis. Therefore we are looking for \(p\)-values that are greater than
\(\alpha_{\text{exit}} = 0.05\). From the output above, there are many predictors that are eligible to
be dropped. Therefore, we select the predictor corresponding to the largest \(p\)-value greater than
\(\alpha_{\text{exit}} = 0.05\). This is culprit_amount
with a \(p\)-value of 0.643.
back2 <- update(back1, . ~ . - culprit_amount)
4.5.3 Dropping a predictor, round 2
drop1(back2, ~ ., test="F")
## Single term deletions
##
## Model:
## imdb ~ engagement + run_time + monster_amount + suspects_amount +
## jeepers + jinkies + my_glasses + zoinks
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 110.55 -426.77
## engagement 1 0.4102 110.96 -427.41 1.3359 0.24853
## run_time 1 6.2755 116.82 -408.40 20.4364 8.377e-06 ***
## monster_amount 1 1.6124 112.16 -423.43 5.2509 0.02251 *
## suspects_amount 1 5.8491 116.40 -409.75 19.0476 1.668e-05 ***
## jeepers 1 0.3504 110.90 -427.61 1.1409 0.28617
## jinkies 1 3.8516 114.40 -416.14 12.5429 0.00045 ***
## my_glasses 1 0.4734 111.02 -427.20 1.5416 0.21518
## zoinks 1 0.1479 110.69 -428.28 0.4816 0.48816
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The variable with the largest \(p\)-value greater than \(\alpha_{\text{exit}} = 0.05\) is zoinks
.
back3 <- update(back2, . ~ . - zoinks)
4.5.4 Dropping a predictor, round 3
drop1(back3, ~ ., test="F")
## Single term deletions
##
## Model:
## imdb ~ engagement + run_time + monster_amount + suspects_amount +
## jeepers + jinkies + my_glasses
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 110.69 -428.28
## engagement 1 0.3912 111.09 -428.98 1.2758 0.2594306
## run_time 1 6.2058 116.90 -410.15 20.2383 9.231e-06 ***
## monster_amount 1 1.5627 112.26 -425.11 5.0961 0.0245764 *
## suspects_amount 1 5.8494 116.55 -411.28 19.0762 1.643e-05 ***
## jeepers 1 0.2204 110.92 -429.55 0.7188 0.3970933
## jinkies 1 3.7181 114.41 -418.09 12.1255 0.0005584 ***
## my_glasses 1 0.4628 111.16 -428.74 1.5093 0.2200451
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The variable with the largest \(p\)-value greater than \(\alpha_{\text{exit}} = 0.05\) is jeepers
.
back4 <- update(back3, . ~ . - jeepers)
4.5.5 Dropping a predictor, round 4
drop1(back4, ~ ., test="F")
## Single term deletions
##
## Model:
## imdb ~ engagement + run_time + monster_amount + suspects_amount +
## jinkies + my_glasses
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 110.92 -429.55
## engagement 1 0.3585 111.27 -430.36 1.1702 0.2800874
## run_time 1 6.6587 117.57 -410.03 21.7322 4.414e-06 ***
## monster_amount 1 1.5324 112.45 -426.48 5.0015 0.0259342 *
## suspects_amount 1 5.8929 116.81 -412.45 19.2328 1.519e-05 ***
## jinkies 1 3.7525 114.67 -419.27 12.2471 0.0005241 ***
## my_glasses 1 0.4743 111.39 -429.97 1.5479 0.2142514
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The variable with the largest \(p\)-value greater than \(\alpha_{\text{exit}} = 0.05\) is engagement
.
back5 <- update(back4, . ~ . - engagement)
4.5.6 Dropping a predictor, round 5
drop1(back5, ~ ., test="F")
## Single term deletions
##
## Model:
## imdb ~ run_time + monster_amount + suspects_amount + jinkies +
## my_glasses
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 111.27 -430.36
## run_time 1 7.2140 118.49 -409.18 23.5337 1.825e-06 ***
## monster_amount 1 2.3265 113.60 -424.72 7.5896 0.0061662 **
## suspects_amount 1 6.0337 117.31 -412.87 19.6833 1.214e-05 ***
## jinkies 1 4.1226 115.40 -418.93 13.4489 0.0002818 ***
## my_glasses 1 0.8231 112.10 -429.64 2.6852 0.1021524
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The variable with the largest \(p\)-value greater than \(\alpha_{\text{exit}} = 0.05\) is my_glasses
.
back6 <- update(back5, . ~ . - my_glasses)
4.5.7 Dropping a predictor, round 6
drop1(back6, ~ ., test="F")
## Single term deletions
##
## Model:
## imdb ~ run_time + monster_amount + suspects_amount + jinkies
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 112.10 -429.64
## run_time 1 8.2939 120.39 -405.30 26.9316 3.51e-07 ***
## monster_amount 1 2.4078 114.50 -423.79 7.8187 0.0054452 **
## suspects_amount 1 5.9769 118.07 -412.47 19.4080 1.39e-05 ***
## jinkies 1 4.3332 116.43 -417.64 14.0708 0.0002048 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The remaining single term deletions have \(p\)-values that are less than \(\alpha_{\text{exit}} = 0.05\). Therefore, we stop here.
4.5.8 The final model
summary(back6)
##
## Call:
## lm(formula = imdb ~ run_time + monster_amount + suspects_amount +
## jinkies, data = scoobydoo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.89720 -0.30625 -0.03038 0.33928 2.42649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.264472 0.062373 132.501 < 2e-16 ***
## run_time -0.011483 0.002213 -5.190 3.51e-07 ***
## monster_amount -0.052395 0.018738 -2.796 0.005445 **
## suspects_amount -0.064694 0.014685 -4.405 1.39e-05 ***
## jinkies -0.057869 0.015427 -3.751 0.000205 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5549 on 364 degrees of freedom
## Multiple R-squared: 0.3408, Adjusted R-squared: 0.3336
## F-statistic: 47.04 on 4 and 364 DF, p-value: < 2.2e-16
The final model is:
\[\widehat{Y}_{i} \,=\, 8.2645 \,-\, 0.0114X_{1,\,i} \,-\, 0.0524X_{2,\,i} - 0.0647X_{3,\,i} - 0.0579X_{4,\,i}\]
Interestingly, this says that the number of jinkies in an episode has a (small) negative impact on the IMDB rating 😂!
While all the remaining predictors in our model are significant, we have a very poor adjusted R-squared value of 0.3336. This means that we have only explained 33.36% of the variation in IMDB ratings by using these predictors. In fact, our original model with all the predictors had an adjusted R-squared value of 0.3340. As such, the final model still has room for improvement (though do keep in mind that only a small subset of variables were included in this example for the sake of brevity).