Press o for the presentation overview (and jump to a specific slide)
To export slides to PDF, press e to activate PDF export mode, then print to PDF (note that some features may not function properly when exported to a PDF)
To copy a code block, hover over the top-right corner of the code block and click the clipboard icon
Today's packages
We will be using the Tidyverse extensively today since we are working with data frames.
As usual, packages can be installed using install.packages("package-name") and only needs to be done once per device.
theme_set(theme_bw())
We will be plotting with the ggplot2 package later. theme_set(theme_bw()) sets a plotting theme.
Read in the data
migraine <-read_csv("./data/Pdata.csv")migraine
# A tibble: 50 x 3
N Trt sBMI
<dbl><dbl><dbl> 1 6 0 0.266
2 16 0 -0.925 3 11 0 -0.230 4 1 0 -1.99 5 33 0 0.580
6 14 0 1.29
7 9 0 -0.513 8 13 0 -0.271 9 6 0 -1.2510 20 0 -0.167# ... with 40 more rows
Note: although the lab instructions say to extract the counts as N and the covariates as X from the data set, this is not necessary (and in my opinion, not recommended).
Fit a basic GLM
model <-glm(N ~ ., family="poisson", data=migraine)
In the formula N ~ .:
The N on the left hand side means that we want N to be the dependent variable
The . on the right hand side means that we want all other variables appearing in the data set (other than N) to be used as covariates
The usage of the . in formulas is especially useful when our model uses all variables found in a data set, saving us from needing to type out the names of all the variables individually
Inspect the GLM
summary(model)
Call:
glm(formula = N ~ ., family = "poisson", data = migraine)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.634 -1.574 -0.333 0.758 4.400
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.54493 0.05627 45.224 < 2e-16 ***
Trt -0.52712 0.09102 -5.791 7e-09 ***
sBMI 0.12749 0.04278 2.980 0.00288 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 247.99 on 49 degrees of freedom
Residual deviance: 204.65 on 47 degrees of freedom
AIC: 407.16
Number of Fisher Scoring iterations: 5
Non-parametric bootstrap
The non-parametric bootstrap is a procedure that begins with simulating the observation of "new" data by sampling the current data with replacement
This new resampled data is then used to fit a new model in order to obtain new parameter estimates
Once we have resampled our data and fit new models a sufficient number of times, we can aggregate all the parameter estimates and perform inference
Task: create a function that performs the non-parametric bootstrap and returns the parameter estimates from each re-fitting.
✔️ yourself before you 💥 yourself
We should make sure that our existing and new variable names are clear
We will need to compare our bootstrap results with the original model so we don't want to get them mixed up
We want to make sure that we don't end up in a situation where you resample from a recently resampled data set
Since one observation corresponds to one row in our data set, we need to make sure that we are resampling entire rows from the data set and that we are not resampling the response values independently of the covariate values
In other words, there should not be any "shuffling" of response values among covariate values
Procedure outline
Initialize a variable that will store our parameter estimates. We do not need to store the resampled data, or the re-fitted models!
Create a resample of the original data set.
Fit a new model using the resampled data.
Extract parameter estimates from the model and store them for later.
Repeat steps 1 to 3 a pre-specified number of times.
Checklist for building our function
Our function should have arguments for:
The data set that needs to be resampled
We don't want the function to depend on the existence of a data set in our global environment called migraine. This will allow us to reuse this function in the future with different data sets, if needed
The original model that we created
We will take advantage of the update function that exists in base-R
This function will allow us to "swap" the data set used in the original model with our resampled data and the model will automatically get re-fitted
The number of times we wish to repeat the procedure
Build our function
bootstrap_glm <-function(original_data, original_model, n) { out <-matrix(nrow=n, ncol=length(coef(original_model)),dimnames=list(NULL, names(coef(original_model))) )}
We start by creating a matrix with n rows, and columns according to the number of parameters in the original model. This matrix has no row names. The column names will be the names of the covariates in our original model. We will coerce this matrix to a tibble at the end.
We pass our original model to the update function and specify data=resampled_data. This means that we want to update the fit of our original model by updating the data used to fit the model.
We coerce our results into a nice tibble before returning it. Note that this function is general enough that in addition to glm support, it will also supports models of class lm.
We will need our original data for resampling and our original model to use as a template.
B is the number of bootstrap resamples, R is the number of bootstrap-within-bootstrap resamples (used to calculate the bootstrap standard error of the bth bootstrap estimate of \widehat{\beta}_{1})
level=0.95: by default, the function will compute a 95% CI
term=2: when we call coef for any model, \widehat{\beta}_{1} is the second term
boot_estimates will keep track of the \widehat{\beta}_{1}s from each refitted model.
boot_se will keep track of the bootstrap standard error estimates for the bth bootstrap sample.
For effective memory consumption, we initialize boot_estimates and boot_se as vectors of zeroes with pre-defined length B. As we progress through our main loop, we will replace these placeholder zeroes with the required values.
This time, we are only after a particular term of a model, rather than all the terms. After the model is refitted, it is passed to coef to obtain the coefficients, and then passed to pluck to extract the term (in this case, second) coefficient.
original_estimate is the observed estimate, \widehat{\theta}. In this case, we are extracting the value of \widehat{\beta}_{1} from our original model.
We compute all of the required z^{(b)} values by taking advantage of the fact that operations such as subtraction and division are vectorized.
se_boot_estimates is the standard error of all the bootstrapped estimates of \beta_{1} (this is the quantity denoted \widehat{\text{se}}_{B} in the Week 6 notes).
Once again, we take advantage of the vectorization of operations such as subtraction and multiplication.
These quantities are passed to rev (reverse) to reverse the order. Recall that the lower bound of the interval is computed using the upper quantile, and the upper bound is computed using the lower quantile!
Finally, these values are passed to set_names to create a named vector, as is the convention with any confidence intervals in R.
Calculate a bootstrap-t CI
Unsurprisingly, this will take a while to run. Adjust the values of B and R as needed. I will keep the defaults of B=500 and R=100.