Press o for the presentation overview (and jump to a specific slide)
To export slides to PDF, press e to activate PDF export mode, then print to PDF (note that some features may not function properly when exported to a PDF)
To copy a code block, hover over the top-right corner of the code block and click the clipboard icon
Today's packages
We will be using the Tidyverse extensively today since we are working with data frames.
As usual, packages can be installed using install.packages("package-name") and only needs to be done once per device.
theme_set(theme_bw())
We will be plotting with the ggplot2 package later. theme_set(theme_bw()) sets a plotting theme.
Read in the data
migraine <-read_csv("./data/Pdata.csv")migraine
# A tibble: 50 x 3
N Trt sBMI
<dbl><dbl><dbl> 1 6 0 0.266
2 16 0 -0.925 3 11 0 -0.230 4 1 0 -1.99 5 33 0 0.580
6 14 0 1.29
7 9 0 -0.513 8 13 0 -0.271 9 6 0 -1.2510 20 0 -0.167# ... with 40 more rows
Note: although the lab instructions say to extract the counts as N and the covariates as X from the data set, this is not necessary (and in my opinion, not recommended).
Fit a basic GLM
model <-glm(N ~ ., family="poisson", data=migraine)
In the formula N ~ .:
The N on the left hand side means that we want N to be the dependent variable
The . on the right hand side means that we want all other variables appearing in the data set (other than N) to be used as covariates
The usage of the . in formulas is especially useful when our model uses all variables found in a data set, saving us from needing to type out the names of all the variables individually
Inspect the GLM
summary(model)
Call:
glm(formula = N ~ ., family = "poisson", data = migraine)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.634 -1.574 -0.333 0.758 4.400
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.54493 0.05627 45.224 < 2e-16 ***
Trt -0.52712 0.09102 -5.791 7e-09 ***
sBMI 0.12749 0.04278 2.980 0.00288 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 247.99 on 49 degrees of freedom
Residual deviance: 204.65 on 47 degrees of freedom
AIC: 407.16
Number of Fisher Scoring iterations: 5
The parametric bootstrap
The parametric bootstrap is a procedure that begins with simulating the observation of "new" data by sampling from some distribution where initial parameter estimates are taken to be the truth
This newly simulated data is then used to obtain new parameter estimates
Once we have resimulated our data and estimated our parameter of interest a sufficient number of times, we can aggregate all the parameter estimates and perform inference
Task: create a function that performs the parametric bootstrap for the coefficients of a Poisson GLM and returns the parameter estimates from each re-fitting.
Procedure outline
Initialize a variable that will store our parameter estimates. We do not need to store any of the simulated data, or the re-fitted models!
Create a simulated data set using parameter estimates of the original model.
Fit a new model using the simulated data.
Extract parameter estimates from the model and store them for later.
Repeat steps 1 to 3 a pre-specified number of times.
Checklist for building our function
Our function should have arguments for:
The data set that needs to be resampled
We don't want the function to depend on the existence of a data set in our global environment called migraine. This will allow us to reuse this function in the future with different data sets, if needed
The original model that we created
We will take advantage of the update function that exists in base-R
This function will allow us to "swap" the data set used in the original model with our simulated data and the model will automatically get re-fitted
The number of times we wish to repeat the procedure
For a data set consisting of 50 observations, we will
Estimate \mu_{i} using each set of covariate values
Pass this value of \widehat{\mu}_{i} to the lambda argument of the rpois function and generate a single observation
Use these newly generated values as our new vector of N_{i}s
Build our function
para_bootstrap_poisson <-function(original_data, original_model, n) { out <-matrix(nrow=n, ncol=length(coef(original_model)),dimnames=list(NULL, names(coef(original_model))) )}
We start by creating a matrix with n rows, and columns according to the number of parameters in the original model. This matrix has no row names. The column names will be the names of the covariates in our original model. We will coerce this matrix to a tibble at the end.
We create the variable mu_hat to store the predicted values of \mu_{i}. We do this outside of the loop that we will eventually use because these values stay constant and do not need to recalculated repeatedly.
Our simulated data will be based off of the original data. We pass it to mutate to overwite the existing values of N with our newly generated values of N.
We pass our original model to the update function and specify data=simulated_data. This means that we want to update the fit of our original model by updating the data used to fit the model.