3 Lab 5 — October 13
3.1 Packages
library(tibble)
library(dplyr)
library(ggplot2)
library(tidyr)
library(broom)
library(ellipse)
library(palmerpenguins)
theme_set(theme_bw())
The tidyr package is another component package of the tidyverse that contains functions for cleaning up and reshaping data.
The palmerpenguins package will be used for its
penguins
data set for the multiple linear regression example today.
We will be using the package ellipse for drawing
ellipses (confidence regions). This package contains functions for drawing confidence regions
associated to the estimates obtained from linear regression. Unlike the code provided in class for
drawing confidence regions with the component confidence intervals, which took in raw data as input,
the ellipse::ellipse.lm()
function will allow you to pass in your linear model object instead. In
using the ellipse package, we will however need to
draw the lines corresponding to the component confidence intervals ourselves (though this is not
difficult to do).
3.2 The pipe
The pipe operator, %>%
, can be found in the package magrittr,
which should have been installed when you installed the dplyr
package. When dplyr is loaded (by calling library(dplyr)
),
the magrittr pipe is also imported.
The pipe works by taking the result on the left and passing it to the first argument of the function on the right. The pipe allows you to write cleaner code by converting something like this:
into this:
revealing the underlying sequential logic of your workflow. Since the pipe works by taking the result on the left and passing it to the first argument of the function on the right, the use of the pipe when wrangling data with dplyr is especially convenient as all dplyr data wrangling functions have the data frame argument as its first argument, and in most cases, will return a data frame.
In the example above, the first argument to filter()
, select()
, and mutate()
is a data frame.
After evaluation, each function also returns a data frame. Notice that as a result of piping a data
frame into each of these functions, we don't need to actually specify where it's going (since it
goes to the first argument by default) and we don't need to create extra variables for intermediate
steps.
3.3 Multiple linear regression — hypothesis testing
3.3.1 The penguins data
Let us load the penguins
data set. Some of the variables that we are interested in working with
contain missing values. Immediately after loading the data, we can pass it into the drop_na()
function (from tidyr) which will drop all rows (observations) that
contain missing values.
In addition, let's also rename some of our variables to shorten some of the code that we need to
type. This can be done by piping our data into the rename()
function and supplying name pairs
in the format new_name = old_name
.
penguins <- palmerpenguins::penguins %>%
drop_na() %>%
rename(
bill_length = bill_length_mm,
bill_depth = bill_depth_mm,
flipper_length = flipper_length_mm,
body_mass = body_mass_g
)
penguins
## # A tibble: 333 x 8
## species island bill_length bill_depth flipper_length body_mass sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
## 2 Adelie Torgersen 39.5 17.4 186 3800 fema~ 2007
## 3 Adelie Torgersen 40.3 18 195 3250 fema~ 2007
## 4 Adelie Torgersen 36.7 19.3 193 3450 fema~ 2007
## 5 Adelie Torgersen 39.3 20.6 190 3650 male 2007
## 6 Adelie Torgersen 38.9 17.8 181 3625 fema~ 2007
## 7 Adelie Torgersen 39.2 19.6 195 4675 male 2007
## 8 Adelie Torgersen 41.1 17.6 182 3200 fema~ 2007
## 9 Adelie Torgersen 38.6 21.2 191 3800 male 2007
## 10 Adelie Torgersen 34.6 21.1 198 4400 male 2007
## # ... with 323 more rows
3.3.2 Subsetting the data
This data set contains penguins from three species across three islands:
penguins %>%
distinct(species, island)
## # A tibble: 5 x 2
## species island
## <fct> <fct>
## 1 Adelie Torgersen
## 2 Adelie Biscoe
## 3 Adelie Dream
## 4 Gentoo Biscoe
## 5 Chinstrap Dream
For the following example, let us focus only on the penguins of species Adelie from the island
Biscoe. We can obtain this subset of the penguins
data using the following code:
adeliebiscoe <- penguins %>%
filter(species == "Adelie", island == "Biscoe")
(Note that in the code above, we are using two equal signs since we are making a comparison and not setting values).
3.3.3 Visual check for linear relationship
Suppose we are interested in fitting a model using body mass as the response and bill length, bill depth, and flipper length as the predictors. Before building the model, we should verify that there is a linear relationship between the response and the individual predictors.
ggplot(adeliebiscoe, aes(x=bill_length, y=body_mass))+
geom_point()
ggplot(adeliebiscoe, aes(x=bill_depth, y=body_mass))+
geom_point()
ggplot(adeliebiscoe, aes(x=flipper_length, y=body_mass))+
geom_point()
The plots look okay. Let's go ahead and fit the model.
3.3.4 Fit the model
ad_lm <- lm(body_mass ~ bill_length + bill_depth + flipper_length, data=adeliebiscoe)
summary(ad_lm)
##
## Call:
## lm(formula = body_mass ~ bill_length + bill_depth + flipper_length,
## data = adeliebiscoe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -836.50 -186.29 19.01 183.14 486.90
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6122.000 1341.697 -4.563 4.71e-05 ***
## bill_length 70.743 21.445 3.299 0.002046 **
## bill_depth 165.196 43.802 3.771 0.000526 ***
## flipper_length 21.397 7.262 2.947 0.005336 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 298.8 on 40 degrees of freedom
## Multiple R-squared: 0.6508, Adjusted R-squared: 0.6246
## F-statistic: 24.85 on 3 and 40 DF, p-value: 3.048e-09
From the above output, the equation of our fitted line is:
\[\widehat{y}_{i} \,=\, -6122 \,+\, 70.743x_{1,\,i} \,+\, 165.196x_{2,\,i} + 21.397x_{3,\,i}\]
where:
- \(\widehat{y}_{i}\) is the predicted body mass
- \(x_{1,\,i}\) is the bill length
- \(x_{2,\,i}\) is the bill depth
- \(x_{3,\,i}\) is the flipper length
3.3.5 Fitting a model without an intercept
By default, linear models are fitted with an intercept. We can fit a linear model without an
intercept by adding a + 0
or a - 1
in the formula specification (see the Details section of
?lm
). To make the hypothesis tests that we will do in the next section a bit more interesting,
let us remove the intercept from the model that we just fit and overwrite it.
As shown in Lab 1, we can do this by fitting a brand new model using lm()
:
ad_lm <- lm(body_mass ~ bill_length + bill_depth + flipper_length - 1, data=adeliebiscoe)
or by using update()
:
ad_lm <- update(ad_lm, formula = . ~ . - 1)
Recall that the above line of code means that (the new) ad_lm
is constructed by taking the
existing ad_lm
and updating its formula. The response variable remains the same (represented by
the dot on the left of the tilde), the predictors remain the same (represented by the dot on the
right of the tilde) and a - 1
denotes that we do not want an intercept.
Let's take a look at our new model.
summary(ad_lm)
##
## Call:
## lm(formula = body_mass ~ bill_length + bill_depth + flipper_length -
## 1, data = adeliebiscoe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -737.69 -296.78 41.82 236.37 769.39
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## bill_length 58.741 25.922 2.266 0.0288 *
## bill_depth 125.323 52.276 2.397 0.0212 *
## flipper_length -4.635 5.471 -0.847 0.4018
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 363.9 on 41 degrees of freedom
## Multiple R-squared: 0.9912, Adjusted R-squared: 0.9905
## F-statistic: 1536 on 3 and 41 DF, p-value: < 2.2e-16
Interestingly, by removing the intercept from the model, our adjusted R-squared value went from 0.6246 to 0.9905 👀.
3.3.6 ANOVA with multiple linear regression models
anova(ad_lm)
## Analysis of Variance Table
##
## Response: body_mass
## Df Sum Sq Mean Sq F value Pr(>F)
## bill_length 1 609528281 609528281 4601.6610 < 2e-16 ***
## bill_depth 1 683988 683988 5.1638 0.02837 *
## flipper_length 1 95065 95065 0.7177 0.40182
## Residuals 41 5430791 132458
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For multiple linear regression, the output resulting from passing our linear model to anova()
is
slightly different from what we saw with simple linear regression. In simple linear regression,
wrapping our linear model with anova()
produced the "classic" \(F\)-table: the sum of squares for
the first row was the SSR, the sum of squares for the second row was the SSE, and there was a single
\(F\)-value and corresponding \(p\)-value.
In the above output, we now have three \(F\)-values and three corresponding \(p\)-values! However, the interpretation here is slightly different. The sum of squares column represents the sequential increase in SSR (or decrease in SSE) by adding the variable to the model that came before it. As such:
- The sum of squares on the
bill_length
row is the increase in SSR (or decrease in SSE) by including onlybill_length
in the model (because our base model has no intercept) - The sum of squares on the
bill_depth
row is the increase in SSR (or decrease in SSE) by includingbill_depth
in the model that already includesbill_length
- The sum of squares on the
flipper_length
row is the increase in SSR (or decrease in SSE) by includingflipper_length
to the model that already includesbill_length
andbill_depth
- The sum of squares on the
Residual
row is the SSE of the model that includes all ofbill_length
,bill_depth
, andflipper_length
Using only the output above, if we wish to perform a hypothesis test to check whether a subset of
the parameters are different from zero, we need to do some extra math. But since we are using R,
let's not do math! We can perform the partial \(F\)-tests that we are interested in by constructing
reduced models (and this is where the update()
function becomes extremely handy).
3.3.7 Hypothesis test: three parameters
Suppose we are interested in testing the hypotheses:
\[H_{0}: \beta_{1} \,=\, \beta_{2} \,=\, \beta_{3} \,=\, 0, \quad H_{A}: \text{At least one non-zero}\]
The reduced model is the intercept-only model that passes through the origin.
ad_lm_origin_only <- update(ad_lm, formula = . ~ 0)
The full model is the model that has all the predictors (ad_lm
).
To test the above hypotheses, we pass our reduced and full models to anova()
:
anova(ad_lm_origin_only, ad_lm)
## Analysis of Variance Table
##
## Model 1: body_mass ~ 1 - 1
## Model 2: body_mass ~ bill_length + bill_depth + flipper_length - 1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 44 615738125
## 2 41 5430791 3 610307334 1535.8 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The produces a \(F\)-value of 1535.8 with numerator degrees of freedom, 3, and denominator degrees of freedom, 41. But we don't need to use these values since a \(p\)-value is also given. Using a significance level of \(\alpha = 0.05\), since the \(p\)-value is less than 0.05, we reject the null hypothesis and conclude that at least one of the parameters is non-zero and as such, we should not consider a constant \((Y \,=\, 0)\) model. It is important to note that the result of this hypothesis test tells us that at least one parameter is non-zero, but it doesn't tell us which.
A hypothesis test that has a null hypothesis where all parameters are equal to zero is often known as a test for model usefulness. If we fail to reject this null hypothesis, it suggests that the current model is not a substantial improvement over a constant \((Y \,=\, c)\) model.
3.3.8 Hypothesis test: two parameters
Suppose we are interested in testing the hypotheses:
\[H_{0}: \beta_{2} \,=\, \beta_{3} \,=\, 0, \quad H_{A}: \text{At least one non-zero}\]
The reduced model is the model that contains only bill_length
. We can construct this model using
update()
either by specifying the variables we want:
# Don't forget the no-intercept term!!
ad_lm1 <- update(ad_lm, formula = . ~ bill_length - 1)
or the variables we do not want
ad_lm1 <- update(ad_lm, formula = . ~ . - bill_depth - flipper_length)
To test the above hypotheses, we pass our reduced and full models to anova()
:
anova(ad_lm1, ad_lm)
## Analysis of Variance Table
##
## Model 1: body_mass ~ bill_length - 1
## Model 2: body_mass ~ bill_length + bill_depth + flipper_length - 1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 43 6209844
## 2 41 5430791 2 779053 2.9407 0.06405 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This produces a \(F\)-value of 2.9407 with numerator degrees of freedom, 2, and denominator degrees of freedom, 41. Using the usual significance level of \(\alpha = 0.05\), since the \(p\)-value is greater than 0.05, we fail to reject the null hypothesis and conclude that there is insufficient evidence to support the claim that at least one of \(\beta_{2}\) or \(\beta_{3}\) are non-zero.
3.3.9 Hypothesis test: one parameter
Suppose we are interested in testing the hypotheses:
\[H_{0}: \beta_{3} \,=\, 0, \quad H_{A}: \beta_{3} \,\neq\, 0\]
This hypothesis can be carried out by performing a \(F\)-test or a \(t\)-test (the \(t\)-test is much
simpler). If we want to perform a \(F\)-test, we take the usual steps of first constructing the
reduced model. The reduced model is the model that contains bill_length
and bill_depth
. We can
construct this model by using update()
and removing flipper_length
.
ad_lm12 <- update(ad_lm, formula = . ~ . - flipper_length)
To test the above hypotheses, we pass our reduced and full models to anova()
:
anova(ad_lm12, ad_lm)
## Analysis of Variance Table
##
## Model 1: body_mass ~ bill_length + bill_depth - 1
## Model 2: body_mass ~ bill_length + bill_depth + flipper_length - 1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 42 5525856
## 2 41 5430791 1 95065 0.7177 0.4018
This produces a \(F\)-value of 0.7177 with numerator degrees of freedom, 1, and denominator degrees of freedom, 41. Using the usual significance level of \(\alpha = 0.05\), since the \(p\)-value is greater than 0.05, we fail to reject the null hypothesis and conclude that there is insufficient evidence to support the claim that \(\beta_{3}\) is different from zero.
In testing a single parameter non-zero, we could have done a \(t\)-test, whose corresponding \(p\)-value could have been read off the coefficient summary table.
tidy(ad_lm)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 bill_length 58.7 25.9 2.27 0.0288
## 2 bill_depth 125. 52.3 2.40 0.0212
## 3 flipper_length -4.64 5.47 -0.847 0.402
The corresponding \(p\)-value for this \(t\)-test is 0.402. Since this \(p\)-value is greater than 0.05, we fail to reject the null hypothesis once again.
3.3.10 Caution ❗
Suppose we had output that looked like:
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <chr> <chr> <chr> <dbl>
## 1 var1 ... ... ... 0.0001
## 2 var2 ... ... ... 0.497
## 3 var3 ... ... ... 0.885
It is not correct to say:
The \(p\)-values for the tests of
var2
different from zero andvar3
different from zero are 0.4971 and 0.8846, respectively. Since both \(p\)-values are larger than 0.05, we fail to reject the null hypothesis in both tests. Therefore we can simultaneously removevar2
andvar3
from our model.
The \(p\)-value of 0.497 corresponds to testing var2
different from zero, assuming var1
and var3
are included in model. Similarly, the \(p\)-value of 0.885 corresponds to testing var3
different
from zero, assuming that var1
and var2
are included in the model. Therefore the proper procedure
using \(t\)-tests would be to test for a single parameter being non-zero, remove the variable from the
model (assuming we failed to reject the null hypothesis), refit a new model without this variable,
and then perform a second \(t\)-test.
3.4 Multiple linear regression — working with matrices in R
3.4.1 Creating a matrix in R
To begin, I recommend checking out the related documentation by calling ?matrix
. The main
arguments of interest are:
-
data
: the data for your matrix -
nrow
: the number of rows your matrix should have -
ncol
: the number of columns your matrix should have -
byrow
: when set toTRUE
, the supplied data are filled in horizontally (by row), otherwise the data are filled in vertically (by column). By default data is filled in vertically.
To illustrate how the data gets filled in depending on the value of byrow
:
nums <- 1:9
mat1 <- matrix(nums, nrow=3, ncol=3)
mat1
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 9 2
## [3,] 7 3 6
3.4.2 Matrix operations
3.4.2.1 Matrix transpose
To obtain the transpose of a matrix, use t()
:
mat1
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
t(mat1)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
3.4.2.2 Matrix inverse
To obtain the inverse of a matrix, use solve()
:
mat2
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 9 2
## [3,] 7 3 6
solve(mat2)
## [,1] [,2] [,3]
## [1,] -0.384 0.024 0.184
## [2,] 0.080 0.120 -0.080
## [3,] 0.408 -0.088 -0.008
solve()
will return an error if your matrix is not invertible.
3.4.2.3 Matrix multiplication
To multiply two matrices together, use %*%
:
mat1 %*% mat2
## [,1] [,2] [,3]
## [1,] 66 59 53
## [2,] 78 73 64
## [3,] 90 87 75
You cannot use the usual multiplication operator because that will perform element-wise multiplication, rather than matrix multiplication:
mat1
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
mat2
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 9 2
## [3,] 7 3 6
mat1 * mat2
## [,1] [,2] [,3]
## [1,] 1 8 21
## [2,] 8 45 16
## [3,] 21 18 54
3.4.3 Example: Question K, page 174 (selected parts)
From the given data, we can start by constructing our matrices.
## [,1]
## [1,] 7.2
## [2,] 8.1
## [3,] 9.8
## [4,] 12.3
## [5,] 12.9
Don't forget the column of ones in the design matrix that corresponds to the intercept! Let's also supply some column names to the design matrix to keep track of the columns and so that our resulting coefficient estimates will have names.
X <- matrix(
c(rep(1, 5), -1, -1, 0, 1, 1, -1, 0, 0, 0, 1),
nrow=5, ncol=3,
dimnames = list(NULL, c("Intercept", "X1", "X2"))
)
X
## Intercept X1 X2
## [1,] 1 -1 -1
## [2,] 1 -1 0
## [3,] 1 0 0
## [4,] 1 1 0
## [5,] 1 1 1
In supplying names to a matrix during its construction, a list of length 2 must be supplied
specifying the row and column names respectively. Since I don't need any rownames I supply NULL
to
the first element of the list. Note that this list has length 2: NULL
is one element, and
c("Intercept", "X1", "X2")
is another element. An alternative is to build a matrix with no names
and set the names afterward using:
3.4.3.1 Obtain the coefficient estimates
We can obtain the coefficient estimates using the equation:
\[\widehat{\beta} \,=\, (X^{T}X)^{-1}\,X^{T}\,Y\]
## [,1]
## Intercept 10.06
## X1 2.10
## X2 0.75
The equation of the fitted line is:
\[\widehat{y}_{i} \,=\, 10.06 \,+\, 2.10x_{1,\,i} \,+\, 0.75x_{2,\,i}\]
3.4.3.2 Compute the following regression sum of squares
3.4.3.2.1 (i)
\[SS(\widehat{\beta}_{0},\, \widehat{\beta}_{1},\, \widehat{\beta}_{2}) \,=\, \widehat{\beta}^{T}\,X^{T}\,Y\]
## [,1]
## [1,] 531.083
3.4.3.2.2 (ii)
\[\begin{align*} SS(\widehat{\beta}_{1},\, \widehat{\beta}_{2} \,|\, \widehat{\beta}_{0}) &= SS(\widehat{\beta}_{0},\, \widehat{\beta}_{1},\, \widehat{\beta}_{2}) \,-\, SS(\widehat{\beta}_{0})\\ &= \widehat{\beta}^{T}\,X^{T}\,Y \,-\, \frac{1}{n}\left(\sum_{i}Y_{i}\right)^{2}\\ &= \widehat{\beta}^{T}\,X^{T}\,Y \,-\, \frac{1}{n}Y^{T}\,\mathbf{1}\,\mathbf{1}^{T}\,Y \end{align*}\]
ones <- matrix(1, nrow=5)
ones
## [,1]
## [1,] 1
## [2,] 1
## [3,] 1
## [4,] 1
## [5,] 1
## [,1]
## [1,] 25.065
3.4.3.2.3 (iii)
\[SS(\widehat{\beta}_{2} \,|\, \widehat{\beta}_{1},\, \widehat{\beta}_{0}) \,=\, SS(\widehat{\beta}_{0},\, \widehat{\beta}_{1},\, \widehat{\beta}_{2}) \,-\, SS(\widehat{\beta}_{0},\,\widehat{\beta}_{1})\]
We already have the first term. The second term is obtained by fitting a model without the \(X_{2}\) variable and computing
\[SS(\widehat{\beta}_{0},\, \widehat{\beta}_{1}) \,=\, \widehat{\beta}^{T}\,X^{T}\,Y\]
X_reduced <- X[, -3]
X_reduced
## Intercept X1
## [1,] 1 -1
## [2,] 1 -1
## [3,] 1 0
## [4,] 1 1
## [5,] 1 1
## [,1]
## Intercept 10.060
## X1 2.475
## [,1]
## [1,] 530.5205
Therefore the quantity
\[SS(\widehat{\beta}_{2} \,|\, \widehat{\beta}_{1},\, \widehat{\beta}_{0}) \,=\, SS(\widehat{\beta}_{0},\, \widehat{\beta}_{1},\,\widehat{\beta}_{2}) \,-\, SS(\widehat{\beta}_{0},\,\widehat{\beta}_{1})\]
can be computed as:
## [,1]
## [1,] 0.5625
3.4.3.3 Obtain the residual sum of squares
\[SSE \,=\, Y^{T}\,Y \,-\, \widehat{\beta}^{T}\,X^{T}\,Y\]
## [,1]
## [1,] 0.107
3.4.3.4 Obtain an estimate for error term variance
\[\widehat{\sigma}^{2} \,=\, S^{2} \,=\, MSE \,=\, \frac{SSE}{\text{df}_{SSE}}\]
The degrees of freedom of the SSE is \(n - p - 1\) where \(n\) is the number of observations, and \(p\) is the number of non-intercept parameters estimated. Here, \(n = 5\) and \(p = 2\). So
\[\text{df}_{SSE} \,=\, n - p - 1 \,=\, 5 - 2 - 1 \,=\, 2\]
MSE <- SSE / 2
MSE <- as.numeric(MSE)
MSE
## [1] 0.0535
Note: We need to wrap MSE
in as.numeric()
to coerce it to a scalar for future calculations.
Otherwise, in R, this value is still treated as a 1x1 matrix and matrix operations may fail due
to incompatible matrix dimensions.
3.4.3.5 Obtain the variance-covariance matrix of the coefficient estimates
\[\textbf{Var}(\widehat{\beta}) \,=\, \widehat{\sigma}^{2}(X^{T}\,X)^{-1}\]
## Intercept X1 X2
## Intercept 0.0107 0.00000 0.00000
## X1 0.0000 0.02675 -0.02675
## X2 0.0000 -0.02675 0.05350
The variances are along the diagonal of the matrix. The covariances are the off-diagonal terms. If
we wanted the standard errors for all coefficient estimates we would need to take the square root
of all the diagonal terms. We can do this by first extracting the diagonal elements by wrapping
vcov_betahat
with diag()
, and then wrapping it in sqrt()
.
## Intercept X1 X2
## 0.1034408 0.1635543 0.2313007
3.4.3.6 Predict the mean response value at a given point
Find \(\widehat{Y}_{0}\) at point \((X_{1,\,0},\, X_{2,\,0}) \,=\, (0.5, 0)\).
\[\widehat{y}_{0} \,=\, x_{0}^{T}\widehat{\beta} \,=\, \begin{bmatrix}1 &0.5 &0\end{bmatrix} \begin{bmatrix}10.06\\ 2.10\\ 0.75\end{bmatrix}\]
## [,1]
## [1,] 1.0
## [2,] 0.5
## [3,] 0.0
t(x0) %*% betahat
## [,1]
## [1,] 11.11
3.4.3.7 Obtain the standard error of the mean response at the given point
\[\textbf{se}(\widehat{Y}_{0}) \,=\, \sqrt{\widehat{\sigma}^{2}x_{0}^{T}(X^{T}\,X)^{-1}x_{0}}\]
## [,1]
## [1,] 0.1318617
3.4.3.8 Weighted least squares
From the information given in the question, the variance of the error term is now
\[\textbf{Var}(\varepsilon) \,=\, \sigma^{2}V\]
where
\[V \,=\, \begin{bmatrix} 1 &0 &0 &0 &0\\ 0 &1 &0 &0 &0\\ 0 &0 &1/4 &0 &0\\ 0 &0 &0 &1 &0\\ 0 &0 &0 &0 &1 \end{bmatrix}\]
and \(V^{-1}\) is the "weight matrix".
The new estimates are computed using:
\[\widehat{\beta} \,=\, (X^{T}\,V^{-1}\,X)^{-1}\,X^{T}\,V^{-1}\,Y\]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 0 0.00 0 0
## [2,] 0 1 0.00 0 0
## [3,] 0 0 0.25 0 0
## [4,] 0 0 0.00 1 0
## [5,] 0 0 0.00 0 1
Recall that the inverse of a diagonal matrix is still a diagonal matrix where all the diagonal terms have been reciprocated.
V_inv <- solve(V)
V_inv
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 0 0 0 0
## [2,] 0 1 0 0 0
## [3,] 0 0 4 0 0
## [4,] 0 0 0 1 0
## [5,] 0 0 0 0 1
## [,1]
## Intercept 9.9625
## X1 2.1000
## X2 0.7500
3.5 Simple linear regression — simultaneous confidence regions
Let us return to the rock
data set and consider the simple linear regression model of area
as
the response variable and peri
as the predictor.
Suppose we are interested in constructing a 90% confidence region for \((\beta_{0}, \beta_{1})\) with
individual 95% confidence intervals. We can start by obtaining the path of this ellipse by passing
our linear model to ellipse()
. The default confidence level for the ellipse is 95% so we need to
make a slight adjustment to the value of the level
parameter. In addition, ellipse()
returns a
matrix of coordinates for the path of the ellipse. Since we'll be passing the coordinates of the
path of the ellipse to ggplot
later, we should also convert it to a data frame.
## # A tibble: 100 x 2
## `(Intercept)` peri
## <dbl> <dbl>
## 1 3305. 1.63
## 2 3239. 1.65
## 3 3174. 1.67
## 4 3107. 1.69
## 5 3041. 1.71
## 6 2974. 1.73
## 7 2908. 1.74
## 8 2842. 1.76
## 9 2777. 1.78
## 10 2714. 1.79
## # ... with 90 more rows
The values under the (Intercept)
and peri
columns represent the x- and y-values, respectively,
of the path of the ellipse. Note that ellipse()
is another example of a generic function. When
passing a linear model object (an object of class lm
) to ellipse()
, it calls the more specific
ellipse.lm()
.
Now that we have the path of the ellipse, we also need to obtain the point
\((\widehat{\beta}_{0},\, \widehat{\beta}_{1})\), and the individual 95% confidence intervals for
\(\beta_{0}\) and \(\beta_{1}\). We can obtain these by passing our linear model to tidy.lm()
and
specifying conf.int=TRUE
to obtain both the coefficient table and corresponding 95% confidence
intervals.
coef_table <- slr_rock %>%
tidy(conf.int=TRUE)
coef_table
## # A tibble: 2 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3052. 477. 6.40 7.26e- 8 2092. 4012.
## 2 peri 1.54 0.157 9.81 7.51e-13 1.23 1.86
Before we can begin plotting, we have to do some data reshaping. Recall that when plotting with
ggplot()
, we need all of our x-values in one column and all our y-values in another column.
We need a data frame where the estimate of \(\beta_{0}\) is in one column and the estimate of \(\beta_{1}\) is in another column.
coef_point <- coef_table %>%
pivot_wider(id_cols=c(term, estimate), names_from=term, values_from=estimate)
coef_point
## # A tibble: 1 x 2
## `(Intercept)` peri
## <dbl> <dbl>
## 1 3052. 1.54
We need a data frame where the lower and upper bounds of the confidence interval for \(\beta_{0}\) are in a single column.
beta0 <- coef_table %>%
slice(1) %>%
pivot_longer(cols=contains("conf"))
beta0
## # A tibble: 2 x 7
## term estimate std.error statistic p.value name value
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 (Intercept) 3052. 477. 6.40 0.0000000726 conf.low 2092.
## 2 (Intercept) 3052. 477. 6.40 0.0000000726 conf.high 4012.
We need a data frame where the lower and upper bounds of the confidence interval for \(\beta_{1}\) are in a single column.
beta1 <- coef_table %>%
slice(2) %>%
pivot_longer(cols=contains("conf"))
beta1
## # A tibble: 2 x 7
## term estimate std.error statistic p.value name value
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 peri 1.54 0.157 9.81 7.51e-13 conf.low 1.23
## 2 peri 1.54 0.157 9.81 7.51e-13 conf.high 1.86
Now that our data is "in shape", we can start building the plot. This plot will be constructed by
taking advantage of the fact that every geom_*()
layer has its own data
argument. This means
that we can use a different data set for each layer of our plot. Since we don't have a common data
set or common aesthetics that will be shared by all layers, we will not supply anything to the call
to ggplot()
. Try running the code below, layer by layer, to see how the plot is being constructed!
ggplot()+
geom_vline(data=beta0, aes(xintercept=value), colour="#3366FF", lty=2)+
geom_hline(data=beta1, aes(yintercept=value), colour="#3366FF", lty=2)+
geom_point(data=coef_point, aes(x=`(Intercept)`, y=peri), colour="red")+
geom_path(data=ellipse_path, aes(x=`(Intercept)`, y=peri))+
labs(
x="Intercept", y="Slope",
caption="Solid line represents the boundary of the 90% confidence region. Dashed lines represent
the boundaries of the 95% confidence intervals."
)
Note that points contained within the ellipse are values \((\beta_{0},\,\beta_{1})\) that the data suggest are jointly reasonable for the parameters. Meanwhile, within the rectangular region are plausible values for the parameters when considered individually.