Stat 4604: Lab 3

Adam Shen

October 19, 2022

Slide navigation

  • Type ? to bring up the presentation menu

  • Press o for the presentation overview (and jump to a specific slide)

  • To export slides to PDF, press e to activate PDF export mode, then print to PDF (note that some features may not function properly when exported to a PDF)

  • To copy a code block, hover over the top-right corner of the code block and click the clipboard icon

Today's packages

We will be using the Tidyverse extensively today since we are working with data frames.

library(tidyverse)
-- Attaching packages --------------------------------------- tidyverse 1.3.1 --
v ggplot2 3.3.6     v purrr   0.3.4
v tibble  3.1.8     v dplyr   1.0.9
v tidyr   1.2.0     v stringr 1.4.0
v readr   2.1.2     v forcats 0.5.1
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

As usual, packages can be installed using install.packages("package-name") and only needs to be done once per device.

theme_set(theme_bw())

We will be plotting with the ggplot2 package later. theme_set(theme_bw()) sets a plotting theme.

Read in the data

migraine <- read_csv("./data/Pdata.csv")

migraine
# A tibble: 50 x 3
       N   Trt   sBMI
   <dbl> <dbl>  <dbl>
 1     6     0  0.266
 2    16     0 -0.925
 3    11     0 -0.230
 4     1     0 -1.99 
 5    33     0  0.580
 6    14     0  1.29 
 7     9     0 -0.513
 8    13     0 -0.271
 9     6     0 -1.25 
10    20     0 -0.167
# ... with 40 more rows

Note: although the lab instructions say to extract the counts as N and the covariates as X from the data set, this is not necessary (and in my opinion, not recommended).

Fit a basic GLM

model <- glm(N ~ ., family="poisson", data=migraine)
  • In the formula N ~ .:

    • The N on the left hand side means that we want N to be the dependent variable

    • The . on the right hand side means that we want all other variables appearing in the data set (other than N) to be used as covariates

    • The usage of the . in formulas is especially useful when our model uses all variables found in a data set, saving us from needing to type out the names of all the variables individually

Inspect the GLM

summary(model)

Call:
glm(formula = N ~ ., family = "poisson", data = migraine)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-3.634  -1.574  -0.333   0.758   4.400  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.54493    0.05627  45.224  < 2e-16 ***
Trt         -0.52712    0.09102  -5.791    7e-09 ***
sBMI         0.12749    0.04278   2.980  0.00288 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 247.99  on 49  degrees of freedom
Residual deviance: 204.65  on 47  degrees of freedom
AIC: 407.16

Number of Fisher Scoring iterations: 5

The parametric bootstrap

  • The parametric bootstrap is a procedure that begins with simulating the observation of "new" data by sampling from some distribution where initial parameter estimates are taken to be the truth
  • This newly simulated data is then used to obtain new parameter estimates
  • Once we have resimulated our data and estimated our parameter of interest a sufficient number of times, we can aggregate all the parameter estimates and perform inference

Task: create a function that performs the parametric bootstrap for the coefficients of a Poisson GLM and returns the parameter estimates from each re-fitting.

Procedure outline

  1. Initialize a variable that will store our parameter estimates. We do not need to store any of the simulated data, or the re-fitted models!

  2. Create a simulated data set using parameter estimates of the original model.

  3. Fit a new model using the simulated data.

  4. Extract parameter estimates from the model and store them for later.

Repeat steps 1 to 3 a pre-specified number of times.

Checklist for building our function

Our function should have arguments for:

  • The data set that needs to be resampled

    • We don't want the function to depend on the existence of a data set in our global environment called migraine. This will allow us to reuse this function in the future with different data sets, if needed
  • The original model that we created

    • We will take advantage of the update function that exists in base-R

    • This function will allow us to "swap" the data set used in the original model with our simulated data and the model will automatically get re-fitted

  • The number of times we wish to repeat the procedure

Sampling from the Poisson distribution

  • We make the assumption that

N_{i} \,\sim\, \text{Pois}(\lambda=\mu_{i}),

where

\mu_{i} \,=\, \exp{\lbrace\beta_{0} \,+\, \beta_{1}x_{i1} \,+\, \beta_{2}x_{i2}\rbrace}

  • To get the estimate, \widehat{\mu}_{i}, for each set of covariate values, we will need to use
predict(model, type="response")

Sampling from the Poisson distribution

  • If instead, we used:
predict(model)

this would make predictions on the predictor scale, i.e.

\log{(\widehat{\mu}_{i})} \,=\, \widehat{\beta}_{0} \,+\, \widehat{\beta}_{1}x_{i1} \,+\, \widehat{\beta}_{2}x_{i2}

  • For a data set consisting of 50 observations, we will

    • Estimate \mu_{i} using each set of covariate values

    • Pass this value of \widehat{\mu}_{i} to the lambda argument of the rpois function and generate a single observation

    • Use these newly generated values as our new vector of N_{i}s

Build our function

para_bootstrap_poisson <- function(original_data, original_model, n) {
  out <- matrix(
    nrow=n, ncol=length(coef(original_model)),
    dimnames=list(NULL, names(coef(original_model)))
  )
}

We start by creating a matrix with n rows, and columns according to the number of parameters in the original model. This matrix has no row names. The column names will be the names of the covariates in our original model. We will coerce this matrix to a tibble at the end.

Build our function

para_bootstrap_poisson <- function(original_data, original_model, n) {
  out <- matrix(
    nrow=n, ncol=length(coef(original_model)),
    dimnames=list(NULL, names(coef(original_model)))
  )
  
  mu_hat <- predict(original_model, type="response")
}

We create the variable mu_hat to store the predicted values of \mu_{i}. We do this outside of the loop that we will eventually use because these values stay constant and do not need to recalculated repeatedly.

Build our function

para_bootstrap_poisson <- function(original_data, original_model, n) {
  out <- matrix(
    nrow=n, ncol=length(coef(original_model)),
    dimnames=list(NULL, names(coef(original_model)))
  )
  
  mu_hat <- predict(original_model, type="response")
  
  for (iter in 1:n) {
  
  }
}

We will use a for loop since we know exactly how many times to repeat the procedure.

Build our function

para_bootstrap_poisson <- function(original_data, original_model, n) {
  out <- matrix(
    nrow=n, ncol=length(coef(original_model)),
    dimnames=list(NULL, names(coef(original_model)))
  )
  
  mu_hat <- predict(original_model, type="response")
  
  for (iter in 1:n) {
    simulated_data <- original_data %>%
      mutate(N = rpois(nrow(original_data), lambda=mu_hat))
  }
}

Our simulated data will be based off of the original data. We pass it to mutate to overwite the existing values of N with our newly generated values of N.

Build our function

para_bootstrap_poisson <- function(original_data, original_model, n) {
  out <- matrix(
    nrow=n, ncol=length(coef(original_model)),
    dimnames=list(NULL, names(coef(original_model)))
  )
  
  mu_hat <- predict(original_model, type="response")
  
  for (iter in 1:n) {
    simulated_data <- original_data %>%
      mutate(N = rpois(nrow(original_data), lambda=mu_hat))
    
    refitted_model <- original_model %>%
      update(data=simulated_data)
  }
}

We pass our original model to the update function and specify data=simulated_data. This means that we want to update the fit of our original model by updating the data used to fit the model.

Build our function

para_bootstrap_poisson <- function(original_data, original_model, n) {
  out <- matrix(
    nrow=n, ncol=length(coef(original_model)),
    dimnames=list(NULL, names(coef(original_model)))
  )
  
  mu_hat <- predict(original_model, type="response")
  
  for (iter in 1:n) {
    simulated_data <- original_data %>%
      mutate(N = rpois(nrow(original_data), lambda=mu_hat))
    
    refitted_model <- original_model %>%
      update(data=simulated_data)
    
    out[iter, ] <- coef(refitted_model)
  }
}

We update the iterth row of the matrix that stores our parameter estimates.

Build our function

para_bootstrap_poisson <- function(original_data, original_model, n) {
  out <- matrix(
    nrow=n, ncol=length(coef(original_model)),
    dimnames=list(NULL, names(coef(original_model)))
  )
  
  mu_hat <- predict(original_model, type="response")
  
  for (iter in 1:n) {
    simulated_data <- original_data %>%
      mutate(N = rpois(nrow(original_data), lambda=mu_hat))
    
    refitted_model <- original_model %>%
      update(data=simulated_data)
    
    out[iter, ] <- coef(refitted_model)
  }
  
  as_tibble(out)
}

We coerce our results into a nice tibble before returning it.

Test our function

set.seed(99)

para_boot_coefs <- para_bootstrap_poisson(migraine, model, n=5000)

para_boot_coefs
# A tibble: 5,000 x 3
   `(Intercept)`    Trt   sBMI
           <dbl>  <dbl>  <dbl>
 1          2.50 -0.478 0.170 
 2          2.51 -0.419 0.0961
 3          2.64 -0.746 0.126 
 4          2.63 -0.564 0.160 
 5          2.55 -0.725 0.0564
 6          2.56 -0.540 0.0817
 7          2.62 -0.484 0.0844
 8          2.53 -0.365 0.122 
 9          2.43 -0.307 0.120 
10          2.54 -0.574 0.183 
# ... with 4,990 more rows

Histogram for bootstrapped \widehat{\beta}_{1}

ggplot(para_boot_coefs, aes(x=Trt)) +
  geom_histogram(
    aes(y=after_stat(density)), 
    colour="#56B4E9", fill="#0072B2", alpha=0.7
  )

Overlay a predicted density

mean_para_boot_beta1 <- mean(para_boot_coefs$Trt)
sd_para_boot_beta1 <- sd(para_boot_coefs$Trt)

ggplot(para_boot_coefs, aes(x=Trt)) +
  geom_histogram(
    aes(y=after_stat(density)),
    colour="#56B4E9", fill="#0072B2", alpha=0.7
  ) +
  geom_function(
    fun=dnorm, args=list(mean=mean_para_boot_beta1, sd=sd_para_boot_beta1),
    colour="#D55E00", size=1.2
  )

Compute confidence intervals for \beta_{1}

Using 95% confidence level.

list(
  boot_norm = tibble(
    lwr = mean_para_boot_beta1 - qnorm(0.975) * sd_para_boot_beta1,
    upr = mean_para_boot_beta1 + qnorm(0.975) * sd_para_boot_beta1
  ),
  
  boot_perc = tibble(
    lwr = quantile(para_boot_coefs$Trt, 0.025),
    upr = quantile(para_boot_coefs$Trt, 0.975)
  ),
  
  wald = model %>%
    confint.default() %>%
    as_tibble() %>%
    slice(2) %>%
    rename(lwr=1, upr=2)
)
$boot_norm
# A tibble: 1 x 2
     lwr    upr
   <dbl>  <dbl>
1 -0.708 -0.346

$boot_perc
# A tibble: 1 x 2
     lwr    upr
   <dbl>  <dbl>
1 -0.709 -0.344

$wald
# A tibble: 1 x 2
     lwr    upr
   <dbl>  <dbl>
1 -0.706 -0.349