3 Lab 5 — October 13

3.1 Packages

The tidyr package is another component package of the tidyverse that contains functions for cleaning up and reshaping data.

The palmerpenguins package will be used for its penguins data set for the multiple linear regression example today.

We will be using the package ellipse for drawing ellipses (confidence regions). This package contains functions for drawing confidence regions associated to the estimates obtained from linear regression. Unlike the code provided in class for drawing confidence regions with the component confidence intervals, which took in raw data as input, the ellipse::ellipse.lm() function will allow you to pass in your linear model object instead. In using the ellipse package, we will however need to draw the lines corresponding to the component confidence intervals ourselves (though this is not difficult to do).

3.2 The pipe

The pipe operator, %>%, can be found in the package magrittr, which should have been installed when you installed the dplyr package. When dplyr is loaded (by calling library(dplyr)), the magrittr pipe is also imported.

The pipe works by taking the result on the left and passing it to the first argument of the function on the right. The pipe allows you to write cleaner code by converting something like this:

mutate(select(filter(mydata, var1 > 5), var2, var3), var4 = var2 + sqrt(var3))

into this:

mydata %>%
  filter(var1 > 5) %>%
  select(var2, var3) %>%
  mutate(var4 = var2 + sqrt(var3))

revealing the underlying sequential logic of your workflow. Since the pipe works by taking the result on the left and passing it to the first argument of the function on the right, the use of the pipe when wrangling data with dplyr is especially convenient as all dplyr data wrangling functions have the data frame argument as its first argument, and in most cases, will return a data frame.

In the example above, the first argument to filter(), select(), and mutate() is a data frame. After evaluation, each function also returns a data frame. Notice that as a result of piping a data frame into each of these functions, we don't need to actually specify where it's going (since it goes to the first argument by default) and we don't need to create extra variables for intermediate steps.

3.3 Multiple linear regression — hypothesis testing

3.3.1 The penguins data

Let us load the penguins data set. Some of the variables that we are interested in working with contain missing values. Immediately after loading the data, we can pass it into the drop_na() function (from tidyr) which will drop all rows (observations) that contain missing values.

In addition, let's also rename some of our variables to shorten some of the code that we need to type. This can be done by piping our data into the rename() function and supplying name pairs in the format new_name = old_name.

penguins <- palmerpenguins::penguins %>%
  drop_na() %>%
  rename(
    bill_length = bill_length_mm,
    bill_depth = bill_depth_mm,
    flipper_length = flipper_length_mm,
    body_mass = body_mass_g
  )

penguins
## # A tibble: 333 x 8
##    species island    bill_length bill_depth flipper_length body_mass sex    year
##    <fct>   <fct>           <dbl>      <dbl>          <int>     <int> <fct> <int>
##  1 Adelie  Torgersen        39.1       18.7            181      3750 male   2007
##  2 Adelie  Torgersen        39.5       17.4            186      3800 fema~  2007
##  3 Adelie  Torgersen        40.3       18              195      3250 fema~  2007
##  4 Adelie  Torgersen        36.7       19.3            193      3450 fema~  2007
##  5 Adelie  Torgersen        39.3       20.6            190      3650 male   2007
##  6 Adelie  Torgersen        38.9       17.8            181      3625 fema~  2007
##  7 Adelie  Torgersen        39.2       19.6            195      4675 male   2007
##  8 Adelie  Torgersen        41.1       17.6            182      3200 fema~  2007
##  9 Adelie  Torgersen        38.6       21.2            191      3800 male   2007
## 10 Adelie  Torgersen        34.6       21.1            198      4400 male   2007
## # ... with 323 more rows

3.3.2 Subsetting the data

This data set contains penguins from three species across three islands:

penguins %>%
  distinct(species, island)
## # A tibble: 5 x 2
##   species   island   
##   <fct>     <fct>    
## 1 Adelie    Torgersen
## 2 Adelie    Biscoe   
## 3 Adelie    Dream    
## 4 Gentoo    Biscoe   
## 5 Chinstrap Dream

For the following example, let us focus only on the penguins of species Adelie from the island Biscoe. We can obtain this subset of the penguins data using the following code:

adeliebiscoe <- penguins %>%
  filter(species == "Adelie", island == "Biscoe")

(Note that in the code above, we are using two equal signs since we are making a comparison and not setting values).

3.3.3 Visual check for linear relationship

Suppose we are interested in fitting a model using body mass as the response and bill length, bill depth, and flipper length as the predictors. Before building the model, we should verify that there is a linear relationship between the response and the individual predictors.

ggplot(adeliebiscoe, aes(x=bill_length, y=body_mass))+
  geom_point()
ggplot(adeliebiscoe, aes(x=bill_depth, y=body_mass))+
  geom_point()
ggplot(adeliebiscoe, aes(x=flipper_length, y=body_mass))+
  geom_point()

The plots look okay. Let's go ahead and fit the model.

3.3.4 Fit the model

ad_lm <- lm(body_mass ~ bill_length + bill_depth + flipper_length, data=adeliebiscoe)

summary(ad_lm)
## 
## Call:
## lm(formula = body_mass ~ bill_length + bill_depth + flipper_length, 
##     data = adeliebiscoe)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -836.50 -186.29   19.01  183.14  486.90 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -6122.000   1341.697  -4.563 4.71e-05 ***
## bill_length       70.743     21.445   3.299 0.002046 ** 
## bill_depth       165.196     43.802   3.771 0.000526 ***
## flipper_length    21.397      7.262   2.947 0.005336 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 298.8 on 40 degrees of freedom
## Multiple R-squared:  0.6508, Adjusted R-squared:  0.6246 
## F-statistic: 24.85 on 3 and 40 DF,  p-value: 3.048e-09

From the above output, the equation of our fitted line is:

\[\widehat{y}_{i} \,=\, -6122 \,+\, 70.743x_{1,\,i} \,+\, 165.196x_{2,\,i} + 21.397x_{3,\,i}\]

where:

  • \(\widehat{y}_{i}\) is the predicted body mass
  • \(x_{1,\,i}\) is the bill length
  • \(x_{2,\,i}\) is the bill depth
  • \(x_{3,\,i}\) is the flipper length

3.3.5 Fitting a model without an intercept

By default, linear models are fitted with an intercept. We can fit a linear model without an intercept by adding a + 0 or a - 1 in the formula specification (see the Details section of ?lm). To make the hypothesis tests that we will do in the next section a bit more interesting, let us remove the intercept from the model that we just fit and overwrite it.

As shown in Lab 1, we can do this by fitting a brand new model using lm():

ad_lm <- lm(body_mass ~ bill_length + bill_depth + flipper_length - 1, data=adeliebiscoe)

or by using update():

ad_lm <- update(ad_lm, formula = . ~ . - 1)

Recall that the above line of code means that (the new) ad_lm is constructed by taking the existing ad_lm and updating its formula. The response variable remains the same (represented by the dot on the left of the tilde), the predictors remain the same (represented by the dot on the right of the tilde) and a - 1 denotes that we do not want an intercept.

Let's take a look at our new model.

summary(ad_lm)
## 
## Call:
## lm(formula = body_mass ~ bill_length + bill_depth + flipper_length - 
##     1, data = adeliebiscoe)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -737.69 -296.78   41.82  236.37  769.39 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## bill_length      58.741     25.922   2.266   0.0288 *
## bill_depth      125.323     52.276   2.397   0.0212 *
## flipper_length   -4.635      5.471  -0.847   0.4018  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 363.9 on 41 degrees of freedom
## Multiple R-squared:  0.9912, Adjusted R-squared:  0.9905 
## F-statistic:  1536 on 3 and 41 DF,  p-value: < 2.2e-16

Interestingly, by removing the intercept from the model, our adjusted R-squared value went from 0.6246 to 0.9905 👀.

3.3.6 ANOVA with multiple linear regression models

anova(ad_lm)
## Analysis of Variance Table
## 
## Response: body_mass
##                Df    Sum Sq   Mean Sq   F value  Pr(>F)    
## bill_length     1 609528281 609528281 4601.6610 < 2e-16 ***
## bill_depth      1    683988    683988    5.1638 0.02837 *  
## flipper_length  1     95065     95065    0.7177 0.40182    
## Residuals      41   5430791    132458                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For multiple linear regression, the output resulting from passing our linear model to anova() is slightly different from what we saw with simple linear regression. In simple linear regression, wrapping our linear model with anova() produced the "classic" \(F\)-table: the sum of squares for the first row was the SSR, the sum of squares for the second row was the SSE, and there was a single \(F\)-value and corresponding \(p\)-value.

In the above output, we now have three \(F\)-values and three corresponding \(p\)-values! However, the interpretation here is slightly different. The sum of squares column represents the sequential increase in SSR (or decrease in SSE) by adding the variable to the model that came before it. As such:

  • The sum of squares on the bill_length row is the increase in SSR (or decrease in SSE) by including only bill_length in the model (because our base model has no intercept)
  • The sum of squares on the bill_depth row is the increase in SSR (or decrease in SSE) by including bill_depth in the model that already includes bill_length
  • The sum of squares on the flipper_length row is the increase in SSR (or decrease in SSE) by including flipper_length to the model that already includes bill_length and bill_depth
  • The sum of squares on the Residual row is the SSE of the model that includes all of bill_length, bill_depth, and flipper_length

Using only the output above, if we wish to perform a hypothesis test to check whether a subset of the parameters are different from zero, we need to do some extra math. But since we are using R, let's not do math! We can perform the partial \(F\)-tests that we are interested in by constructing reduced models (and this is where the update() function becomes extremely handy).

3.3.7 Hypothesis test: three parameters

Suppose we are interested in testing the hypotheses:

\[H_{0}: \beta_{1} \,=\, \beta_{2} \,=\, \beta_{3} \,=\, 0, \quad H_{A}: \text{At least one non-zero}\]

The reduced model is the intercept-only model that passes through the origin.

ad_lm_origin_only <- update(ad_lm, formula = . ~ 0)

The full model is the model that has all the predictors (ad_lm).

To test the above hypotheses, we pass our reduced and full models to anova():

anova(ad_lm_origin_only, ad_lm)
## Analysis of Variance Table
## 
## Model 1: body_mass ~ 1 - 1
## Model 2: body_mass ~ bill_length + bill_depth + flipper_length - 1
##   Res.Df       RSS Df Sum of Sq      F    Pr(>F)    
## 1     44 615738125                                  
## 2     41   5430791  3 610307334 1535.8 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The produces a \(F\)-value of 1535.8 with numerator degrees of freedom, 3, and denominator degrees of freedom, 41. But we don't need to use these values since a \(p\)-value is also given. Using a significance level of \(\alpha = 0.05\), since the \(p\)-value is less than 0.05, we reject the null hypothesis and conclude that at least one of the parameters is non-zero and as such, we should not consider a constant \((Y \,=\, 0)\) model. It is important to note that the result of this hypothesis test tells us that at least one parameter is non-zero, but it doesn't tell us which.

A hypothesis test that has a null hypothesis where all parameters are equal to zero is often known as a test for model usefulness. If we fail to reject this null hypothesis, it suggests that the current model is not a substantial improvement over a constant \((Y \,=\, c)\) model.

3.3.8 Hypothesis test: two parameters

Suppose we are interested in testing the hypotheses:

\[H_{0}: \beta_{2} \,=\, \beta_{3} \,=\, 0, \quad H_{A}: \text{At least one non-zero}\]

The reduced model is the model that contains only bill_length. We can construct this model using update() either by specifying the variables we want:

# Don't forget the no-intercept term!!
ad_lm1 <- update(ad_lm, formula = . ~ bill_length - 1)

or the variables we do not want

ad_lm1 <- update(ad_lm, formula = . ~ . - bill_depth - flipper_length)

To test the above hypotheses, we pass our reduced and full models to anova():

anova(ad_lm1, ad_lm)
## Analysis of Variance Table
## 
## Model 1: body_mass ~ bill_length - 1
## Model 2: body_mass ~ bill_length + bill_depth + flipper_length - 1
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     43 6209844                              
## 2     41 5430791  2    779053 2.9407 0.06405 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This produces a \(F\)-value of 2.9407 with numerator degrees of freedom, 2, and denominator degrees of freedom, 41. Using the usual significance level of \(\alpha = 0.05\), since the \(p\)-value is greater than 0.05, we fail to reject the null hypothesis and conclude that there is insufficient evidence to support the claim that at least one of \(\beta_{2}\) or \(\beta_{3}\) are non-zero.

3.3.9 Hypothesis test: one parameter

Suppose we are interested in testing the hypotheses:

\[H_{0}: \beta_{3} \,=\, 0, \quad H_{A}: \beta_{3} \,\neq\, 0\]

This hypothesis can be carried out by performing a \(F\)-test or a \(t\)-test (the \(t\)-test is much simpler). If we want to perform a \(F\)-test, we take the usual steps of first constructing the reduced model. The reduced model is the model that contains bill_length and bill_depth. We can construct this model by using update() and removing flipper_length.

ad_lm12 <- update(ad_lm, formula = . ~ . - flipper_length)

To test the above hypotheses, we pass our reduced and full models to anova():

anova(ad_lm12, ad_lm)
## Analysis of Variance Table
## 
## Model 1: body_mass ~ bill_length + bill_depth - 1
## Model 2: body_mass ~ bill_length + bill_depth + flipper_length - 1
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     42 5525856                           
## 2     41 5430791  1     95065 0.7177 0.4018

This produces a \(F\)-value of 0.7177 with numerator degrees of freedom, 1, and denominator degrees of freedom, 41. Using the usual significance level of \(\alpha = 0.05\), since the \(p\)-value is greater than 0.05, we fail to reject the null hypothesis and conclude that there is insufficient evidence to support the claim that \(\beta_{3}\) is different from zero.

In testing a single parameter non-zero, we could have done a \(t\)-test, whose corresponding \(p\)-value could have been read off the coefficient summary table.

tidy(ad_lm)
## # A tibble: 3 x 5
##   term           estimate std.error statistic p.value
##   <chr>             <dbl>     <dbl>     <dbl>   <dbl>
## 1 bill_length       58.7      25.9      2.27   0.0288
## 2 bill_depth       125.       52.3      2.40   0.0212
## 3 flipper_length    -4.64      5.47    -0.847  0.402

The corresponding \(p\)-value for this \(t\)-test is 0.402. Since this \(p\)-value is greater than 0.05, we fail to reject the null hypothesis once again.

3.3.10 Caution ❗

Suppose we had output that looked like:

## # A tibble: 3 x 5
##   term  estimate std.error statistic p.value
##   <chr> <chr>    <chr>     <chr>       <dbl>
## 1 var1  ...      ...       ...        0.0001
## 2 var2  ...      ...       ...        0.497 
## 3 var3  ...      ...       ...        0.885

It is not correct to say:

The \(p\)-values for the tests of var2 different from zero and var3 different from zero are 0.4971 and 0.8846, respectively. Since both \(p\)-values are larger than 0.05, we fail to reject the null hypothesis in both tests. Therefore we can simultaneously remove var2 and var3 from our model.

The \(p\)-value of 0.497 corresponds to testing var2 different from zero, assuming var1 and var3 are included in model. Similarly, the \(p\)-value of 0.885 corresponds to testing var3 different from zero, assuming that var1 and var2 are included in the model. Therefore the proper procedure using \(t\)-tests would be to test for a single parameter being non-zero, remove the variable from the model (assuming we failed to reject the null hypothesis), refit a new model without this variable, and then perform a second \(t\)-test.

3.4 Multiple linear regression — working with matrices in R

3.4.1 Creating a matrix in R

To begin, I recommend checking out the related documentation by calling ?matrix. The main arguments of interest are:

  • data: the data for your matrix
  • nrow: the number of rows your matrix should have
  • ncol: the number of columns your matrix should have
  • byrow: when set to TRUE, the supplied data are filled in horizontally (by row), otherwise the data are filled in vertically (by column). By default data is filled in vertically.

To illustrate how the data gets filled in depending on the value of byrow:

nums <- 1:9

mat1 <- matrix(nums, nrow=3, ncol=3)
mat1
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
nums2 <- c(1:4, 9, 2, 7, 3, 6)
mat2 <- matrix(nums2, nrow=3, ncol=3, byrow=TRUE)
mat2
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    9    2
## [3,]    7    3    6

3.4.2 Matrix operations

3.4.2.1 Matrix transpose

To obtain the transpose of a matrix, use t():

mat1
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
t(mat1)
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

3.4.2.2 Matrix inverse

To obtain the inverse of a matrix, use solve():

mat2
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    9    2
## [3,]    7    3    6
solve(mat2)
##        [,1]   [,2]   [,3]
## [1,] -0.384  0.024  0.184
## [2,]  0.080  0.120 -0.080
## [3,]  0.408 -0.088 -0.008

solve() will return an error if your matrix is not invertible.

3.4.2.3 Matrix multiplication

To multiply two matrices together, use %*%:

mat1 %*% mat2
##      [,1] [,2] [,3]
## [1,]   66   59   53
## [2,]   78   73   64
## [3,]   90   87   75

You cannot use the usual multiplication operator because that will perform element-wise multiplication, rather than matrix multiplication:

mat1
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
mat2
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    9    2
## [3,]    7    3    6
mat1 * mat2
##      [,1] [,2] [,3]
## [1,]    1    8   21
## [2,]    8   45   16
## [3,]   21   18   54

3.4.3 Example: Question K, page 174 (selected parts)

From the given data, we can start by constructing our matrices.

Y <- matrix(c(7.2, 8.1, 9.8, 12.3, 12.9), nrow=5, ncol=1)
Y
##      [,1]
## [1,]  7.2
## [2,]  8.1
## [3,]  9.8
## [4,] 12.3
## [5,] 12.9

Don't forget the column of ones in the design matrix that corresponds to the intercept! Let's also supply some column names to the design matrix to keep track of the columns and so that our resulting coefficient estimates will have names.

X <- matrix(
  c(rep(1, 5), -1, -1, 0, 1, 1, -1, 0, 0, 0, 1),
  nrow=5, ncol=3,
  dimnames = list(NULL, c("Intercept", "X1", "X2"))
)
X
##      Intercept X1 X2
## [1,]         1 -1 -1
## [2,]         1 -1  0
## [3,]         1  0  0
## [4,]         1  1  0
## [5,]         1  1  1

In supplying names to a matrix during its construction, a list of length 2 must be supplied specifying the row and column names respectively. Since I don't need any rownames I supply NULL to the first element of the list. Note that this list has length 2: NULL is one element, and c("Intercept", "X1", "X2") is another element. An alternative is to build a matrix with no names and set the names afterward using:

rownames(X) <- c("row1", "row2", "etc.")
colnames(X) <- c("col1", "col2", "etc.")

3.4.3.1 Obtain the coefficient estimates

We can obtain the coefficient estimates using the equation:

\[\widehat{\beta} \,=\, (X^{T}X)^{-1}\,X^{T}\,Y\]

betahat <- solve(t(X) %*% X) %*% t(X) %*% Y
betahat
##            [,1]
## Intercept 10.06
## X1         2.10
## X2         0.75

The equation of the fitted line is:

\[\widehat{y}_{i} \,=\, 10.06 \,+\, 2.10x_{1,\,i} \,+\, 0.75x_{2,\,i}\]

3.4.3.2 Compute the following regression sum of squares

3.4.3.2.1 (i)

\[SS(\widehat{\beta}_{0},\, \widehat{\beta}_{1},\, \widehat{\beta}_{2}) \,=\, \widehat{\beta}^{T}\,X^{T}\,Y\]

t(betahat) %*% t(X) %*% Y
##         [,1]
## [1,] 531.083
3.4.3.2.2 (ii)

\[\begin{align*} SS(\widehat{\beta}_{1},\, \widehat{\beta}_{2} \,|\, \widehat{\beta}_{0}) &= SS(\widehat{\beta}_{0},\, \widehat{\beta}_{1},\, \widehat{\beta}_{2}) \,-\, SS(\widehat{\beta}_{0})\\ &= \widehat{\beta}^{T}\,X^{T}\,Y \,-\, \frac{1}{n}\left(\sum_{i}Y_{i}\right)^{2}\\ &= \widehat{\beta}^{T}\,X^{T}\,Y \,-\, \frac{1}{n}Y^{T}\,\mathbf{1}\,\mathbf{1}^{T}\,Y \end{align*}\]

ones <- matrix(1, nrow=5)
ones
##      [,1]
## [1,]    1
## [2,]    1
## [3,]    1
## [4,]    1
## [5,]    1
(t(betahat) %*% t(X) %*% Y) - (1/5)*(t(Y) %*% ones %*% t(ones) %*% Y)
##        [,1]
## [1,] 25.065
3.4.3.2.3 (iii)

\[SS(\widehat{\beta}_{2} \,|\, \widehat{\beta}_{1},\, \widehat{\beta}_{0}) \,=\, SS(\widehat{\beta}_{0},\, \widehat{\beta}_{1},\, \widehat{\beta}_{2}) \,-\, SS(\widehat{\beta}_{0},\,\widehat{\beta}_{1})\]

We already have the first term. The second term is obtained by fitting a model without the \(X_{2}\) variable and computing

\[SS(\widehat{\beta}_{0},\, \widehat{\beta}_{1}) \,=\, \widehat{\beta}^{T}\,X^{T}\,Y\]

X_reduced <- X[, -3]
X_reduced
##      Intercept X1
## [1,]         1 -1
## [2,]         1 -1
## [3,]         1  0
## [4,]         1  1
## [5,]         1  1
betahat_reduced <- solve(t(X_reduced) %*% X_reduced) %*% t(X_reduced) %*% Y
betahat_reduced
##             [,1]
## Intercept 10.060
## X1         2.475
t(betahat_reduced) %*% t(X_reduced) %*% Y
##          [,1]
## [1,] 530.5205

Therefore the quantity

\[SS(\widehat{\beta}_{2} \,|\, \widehat{\beta}_{1},\, \widehat{\beta}_{0}) \,=\, SS(\widehat{\beta}_{0},\, \widehat{\beta}_{1},\,\widehat{\beta}_{2}) \,-\, SS(\widehat{\beta}_{0},\,\widehat{\beta}_{1})\]

can be computed as:

(t(betahat) %*% t(X) %*% Y) - (t(betahat_reduced) %*% t(X_reduced) %*% Y)
##        [,1]
## [1,] 0.5625

3.4.3.3 Obtain the residual sum of squares

\[SSE \,=\, Y^{T}\,Y \,-\, \widehat{\beta}^{T}\,X^{T}\,Y\]

SSE <- (t(Y) %*% Y) - (t(betahat) %*% t(X) %*% Y)
SSE
##       [,1]
## [1,] 0.107

3.4.3.4 Obtain an estimate for error term variance

\[\widehat{\sigma}^{2} \,=\, S^{2} \,=\, MSE \,=\, \frac{SSE}{\text{df}_{SSE}}\]

The degrees of freedom of the SSE is \(n - p - 1\) where \(n\) is the number of observations, and \(p\) is the number of non-intercept parameters estimated. Here, \(n = 5\) and \(p = 2\). So

\[\text{df}_{SSE} \,=\, n - p - 1 \,=\, 5 - 2 - 1 \,=\, 2\]

MSE <- SSE / 2
MSE <- as.numeric(MSE)
MSE
## [1] 0.0535

Note: We need to wrap MSE in as.numeric() to coerce it to a scalar for future calculations. Otherwise, in R, this value is still treated as a 1x1 matrix and matrix operations may fail due to incompatible matrix dimensions.

3.4.3.5 Obtain the variance-covariance matrix of the coefficient estimates

\[\textbf{Var}(\widehat{\beta}) \,=\, \widehat{\sigma}^{2}(X^{T}\,X)^{-1}\]

vcov_betahat <- MSE * solve(t(X) %*% X)
vcov_betahat
##           Intercept       X1       X2
## Intercept    0.0107  0.00000  0.00000
## X1           0.0000  0.02675 -0.02675
## X2           0.0000 -0.02675  0.05350

The variances are along the diagonal of the matrix. The covariances are the off-diagonal terms. If we wanted the standard errors for all coefficient estimates we would need to take the square root of all the diagonal terms. We can do this by first extracting the diagonal elements by wrapping vcov_betahat with diag(), and then wrapping it in sqrt().

se_betahat <- sqrt(diag(vcov_betahat))
se_betahat
## Intercept        X1        X2 
## 0.1034408 0.1635543 0.2313007

3.4.3.6 Predict the mean response value at a given point

Find \(\widehat{Y}_{0}\) at point \((X_{1,\,0},\, X_{2,\,0}) \,=\, (0.5, 0)\).

\[\widehat{y}_{0} \,=\, x_{0}^{T}\widehat{\beta} \,=\, \begin{bmatrix}1 &0.5 &0\end{bmatrix} \begin{bmatrix}10.06\\ 2.10\\ 0.75\end{bmatrix}\]

x0 <- matrix(c(1, 0.5, 0), nrow=3, ncol=1)
x0
##      [,1]
## [1,]  1.0
## [2,]  0.5
## [3,]  0.0
t(x0) %*% betahat
##       [,1]
## [1,] 11.11

3.4.3.7 Obtain the standard error of the mean response at the given point

\[\textbf{se}(\widehat{Y}_{0}) \,=\, \sqrt{\widehat{\sigma}^{2}x_{0}^{T}(X^{T}\,X)^{-1}x_{0}}\]

sqrt(MSE * t(x0) %*% solve(t(X) %*% X) %*% x0)
##           [,1]
## [1,] 0.1318617

3.4.3.8 Weighted least squares

From the information given in the question, the variance of the error term is now

\[\textbf{Var}(\varepsilon) \,=\, \sigma^{2}V\]

where

\[V \,=\, \begin{bmatrix} 1 &0 &0 &0 &0\\ 0 &1 &0 &0 &0\\ 0 &0 &1/4 &0 &0\\ 0 &0 &0 &1 &0\\ 0 &0 &0 &0 &1 \end{bmatrix}\]

and \(V^{-1}\) is the "weight matrix".

The new estimates are computed using:

\[\widehat{\beta} \,=\, (X^{T}\,V^{-1}\,X)^{-1}\,X^{T}\,V^{-1}\,Y\]

V <- diag(c(1, 1, 1/4, 1, 1))
V
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    0 0.00    0    0
## [2,]    0    1 0.00    0    0
## [3,]    0    0 0.25    0    0
## [4,]    0    0 0.00    1    0
## [5,]    0    0 0.00    0    1

Recall that the inverse of a diagonal matrix is still a diagonal matrix where all the diagonal terms have been reciprocated.

V_inv <- solve(V)
V_inv
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    0    0    0    0
## [2,]    0    1    0    0    0
## [3,]    0    0    4    0    0
## [4,]    0    0    0    1    0
## [5,]    0    0    0    0    1
betahat_weighted <- solve(t(X) %*% V_inv %*% X) %*% t(X) %*% V_inv %*% Y
betahat_weighted
##             [,1]
## Intercept 9.9625
## X1        2.1000
## X2        0.7500

3.5 Simple linear regression — simultaneous confidence regions

Let us return to the rock data set and consider the simple linear regression model of area as the response variable and peri as the predictor.

data(rock)

slr_rock <- lm(area ~ peri, data=rock)

Suppose we are interested in constructing a 90% confidence region for \((\beta_{0}, \beta_{1})\) with individual 95% confidence intervals. We can start by obtaining the path of this ellipse by passing our linear model to ellipse(). The default confidence level for the ellipse is 95% so we need to make a slight adjustment to the value of the level parameter. In addition, ellipse() returns a matrix of coordinates for the path of the ellipse. Since we'll be passing the coordinates of the path of the ellipse to ggplot later, we should also convert it to a data frame.

ellipse_path <- slr_rock %>%
  ellipse(level=0.90) %>%
  as_tibble()

ellipse_path
## # A tibble: 100 x 2
##    `(Intercept)`  peri
##            <dbl> <dbl>
##  1         3305.  1.63
##  2         3239.  1.65
##  3         3174.  1.67
##  4         3107.  1.69
##  5         3041.  1.71
##  6         2974.  1.73
##  7         2908.  1.74
##  8         2842.  1.76
##  9         2777.  1.78
## 10         2714.  1.79
## # ... with 90 more rows

The values under the (Intercept) and peri columns represent the x- and y-values, respectively, of the path of the ellipse. Note that ellipse() is another example of a generic function. When passing a linear model object (an object of class lm) to ellipse(), it calls the more specific ellipse.lm().

Now that we have the path of the ellipse, we also need to obtain the point \((\widehat{\beta}_{0},\, \widehat{\beta}_{1})\), and the individual 95% confidence intervals for \(\beta_{0}\) and \(\beta_{1}\). We can obtain these by passing our linear model to tidy.lm() and specifying conf.int=TRUE to obtain both the coefficient table and corresponding 95% confidence intervals.

coef_table <- slr_rock %>%
  tidy(conf.int=TRUE)

coef_table
## # A tibble: 2 x 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)  3052.     477.         6.40 7.26e- 8  2092.     4012.  
## 2 peri            1.54     0.157      9.81 7.51e-13     1.23      1.86

Before we can begin plotting, we have to do some data reshaping. Recall that when plotting with ggplot(), we need all of our x-values in one column and all our y-values in another column.

We need a data frame where the estimate of \(\beta_{0}\) is in one column and the estimate of \(\beta_{1}\) is in another column.

coef_point <- coef_table %>%
  pivot_wider(id_cols=c(term, estimate), names_from=term, values_from=estimate)

coef_point
## # A tibble: 1 x 2
##   `(Intercept)`  peri
##           <dbl> <dbl>
## 1         3052.  1.54

We need a data frame where the lower and upper bounds of the confidence interval for \(\beta_{0}\) are in a single column.

beta0 <- coef_table %>%
  slice(1) %>%
  pivot_longer(cols=contains("conf"))

beta0
## # A tibble: 2 x 7
##   term        estimate std.error statistic      p.value name      value
##   <chr>          <dbl>     <dbl>     <dbl>        <dbl> <chr>     <dbl>
## 1 (Intercept)    3052.      477.      6.40 0.0000000726 conf.low  2092.
## 2 (Intercept)    3052.      477.      6.40 0.0000000726 conf.high 4012.

We need a data frame where the lower and upper bounds of the confidence interval for \(\beta_{1}\) are in a single column.

beta1 <- coef_table %>%
  slice(2) %>%
  pivot_longer(cols=contains("conf"))

beta1
## # A tibble: 2 x 7
##   term  estimate std.error statistic  p.value name      value
##   <chr>    <dbl>     <dbl>     <dbl>    <dbl> <chr>     <dbl>
## 1 peri      1.54     0.157      9.81 7.51e-13 conf.low   1.23
## 2 peri      1.54     0.157      9.81 7.51e-13 conf.high  1.86

Now that our data is "in shape", we can start building the plot. This plot will be constructed by taking advantage of the fact that every geom_*() layer has its own data argument. This means that we can use a different data set for each layer of our plot. Since we don't have a common data set or common aesthetics that will be shared by all layers, we will not supply anything to the call to ggplot(). Try running the code below, layer by layer, to see how the plot is being constructed!

ggplot()+
  geom_vline(data=beta0, aes(xintercept=value), colour="#3366FF", lty=2)+
  geom_hline(data=beta1, aes(yintercept=value), colour="#3366FF", lty=2)+
  geom_point(data=coef_point, aes(x=`(Intercept)`, y=peri), colour="red")+
  geom_path(data=ellipse_path, aes(x=`(Intercept)`, y=peri))+
  labs(
    x="Intercept", y="Slope",
    caption="Solid line represents the boundary of the 90% confidence region. Dashed lines represent
    the boundaries of the 95% confidence intervals."
  )

Note that points contained within the ellipse are values \((\beta_{0},\,\beta_{1})\) that the data suggest are jointly reasonable for the parameters. Meanwhile, within the rectangular region are plausible values for the parameters when considered individually.