Stat 4604: Lab 4

Adam Shen

November 2, 2022

Slide navigation

  • Type ? to bring up the presentation menu

  • Press o for the presentation overview (and jump to a specific slide)

  • To export slides to PDF, press e to activate PDF export mode, then print to PDF (note that some features may not function properly when exported to a PDF)

  • To copy a code block, hover over the top-right corner of the code block and click the clipboard icon

Today's packages

We will be using the Tidyverse extensively today since we are working with data frames.

library(tidyverse)
-- Attaching packages --------------------------------------- tidyverse 1.3.1 --
v ggplot2 3.3.6     v purrr   0.3.4
v tibble  3.1.8     v dplyr   1.0.9
v tidyr   1.2.0     v stringr 1.4.0
v readr   2.1.2     v forcats 0.5.1
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

As usual, packages can be installed using install.packages("package-name") and only needs to be done once per device.

theme_set(theme_bw())

We will be plotting with the ggplot2 package later. theme_set(theme_bw()) sets a plotting theme.

Read in the data

migraine <- read_csv("./data/Pdata.csv")

migraine
# A tibble: 50 x 3
       N   Trt   sBMI
   <dbl> <dbl>  <dbl>
 1     6     0  0.266
 2    16     0 -0.925
 3    11     0 -0.230
 4     1     0 -1.99 
 5    33     0  0.580
 6    14     0  1.29 
 7     9     0 -0.513
 8    13     0 -0.271
 9     6     0 -1.25 
10    20     0 -0.167
# ... with 40 more rows

Note: although the lab instructions say to extract the counts as N and the covariates as X from the data set, this is not necessary (and in my opinion, not recommended).

Fit a basic GLM

model <- glm(N ~ ., family="poisson", data=migraine)
  • In the formula N ~ .:

    • The N on the left hand side means that we want N to be the dependent variable

    • The . on the right hand side means that we want all other variables appearing in the data set (other than N) to be used as covariates

    • The usage of the . in formulas is especially useful when our model uses all variables found in a data set, saving us from needing to type out the names of all the variables individually

Inspect the GLM

summary(model)

Call:
glm(formula = N ~ ., family = "poisson", data = migraine)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-3.634  -1.574  -0.333   0.758   4.400  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.54493    0.05627  45.224  < 2e-16 ***
Trt         -0.52712    0.09102  -5.791    7e-09 ***
sBMI         0.12749    0.04278   2.980  0.00288 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 247.99  on 49  degrees of freedom
Residual deviance: 204.65  on 47  degrees of freedom
AIC: 407.16

Number of Fisher Scoring iterations: 5

Non-parametric bootstrap

  • The non-parametric bootstrap is a procedure that begins with simulating the observation of "new" data by sampling the current data with replacement
  • This new resampled data is then used to fit a new model in order to obtain new parameter estimates
  • Once we have resampled our data and fit new models a sufficient number of times, we can aggregate all the parameter estimates and perform inference

Task: create a function that performs the non-parametric bootstrap and returns the parameter estimates from each re-fitting.

✔️ yourself before you 💥 yourself

  • We should make sure that our existing and new variable names are clear

    • We will need to compare our bootstrap results with the original model so we don't want to get them mixed up

    • We want to make sure that we don't end up in a situation where you resample from a recently resampled data set

  • Since one observation corresponds to one row in our data set, we need to make sure that we are resampling entire rows from the data set and that we are not resampling the response values independently of the covariate values

    • In other words, there should not be any "shuffling" of response values among covariate values

Procedure outline

  1. Initialize a variable that will store our parameter estimates. We do not need to store the resampled data, or the re-fitted models!

  2. Create a resample of the original data set.

  3. Fit a new model using the resampled data.

  4. Extract parameter estimates from the model and store them for later.

  5. Repeat steps 1 to 3 a pre-specified number of times.

Checklist for building our function

Our function should have arguments for:

  • The data set that needs to be resampled

    • We don't want the function to depend on the existence of a data set in our global environment called migraine. This will allow us to reuse this function in the future with different data sets, if needed
  • The original model that we created

    • We will take advantage of the update function that exists in base-R

    • This function will allow us to "swap" the data set used in the original model with our resampled data and the model will automatically get re-fitted

  • The number of times we wish to repeat the procedure

Build our function

bootstrap_glm <- function(original_data, original_model, n) {
  out <- matrix(
    nrow=n, ncol=length(coef(original_model)),
    dimnames=list(NULL, names(coef(original_model)))
  )
}

We start by creating a matrix with n rows, and columns according to the number of parameters in the original model. This matrix has no row names. The column names will be the names of the covariates in our original model. We will coerce this matrix to a tibble at the end.

Build our function

bootstrap_glm <- function(original_data, original_model, n) {
  out <- matrix(
    nrow=n, ncol=length(coef(original_model)),
    dimnames=list(NULL, names(coef(original_model)))
  )
  
  for (iter in 1:n) {
  
  }
}

We will use a for loop since we know exactly how many times to repeat the procedure.

Build our function

bootstrap_glm <- function(original_data, original_model, n) {
  out <- matrix(
    nrow=n, ncol=length(coef(original_model)),
    dimnames=list(NULL, names(coef(original_model)))
  )
  
  for (iter in 1:n) {
    resampled_data <- original_data %>%
      slice_sample(n=nrow(original_data), replace=TRUE)
  }
}

We resample from original_data using the slice_sample function (From dplyr), which returns a random sample of the rows of a data set. We specify:

  • n=nrow(original_data) since we want our resampled data to have the same number of rows as the original data

  • replace=TRUE because we want to sample with replacement

Build our function

bootstrap_glm <- function(original_data, original_model, n) {
  out <- matrix(
    nrow=n, ncol=length(coef(original_model)),
    dimnames=list(NULL, names(coef(original_model)))
  )
  
  for (iter in 1:n) {
    resampled_data <- original_data %>%
      slice_sample(n=nrow(original_data), replace=TRUE)
    
    refitted_model <- original_model %>%
      update(data=resampled_data)
  }
}

We pass our original model to the update function and specify data=resampled_data. This means that we want to update the fit of our original model by updating the data used to fit the model.

Build our function

bootstrap_glm <- function(original_data, original_model, n) {
  out <- matrix(
    nrow=n, ncol=length(coef(original_model)),
    dimnames=list(NULL, names(coef(original_model)))
  )
  
  for (iter in 1:n) {
    resampled_data <- original_data %>%
      slice_sample(n=nrow(original_data), replace=TRUE)
    
    refitted_model <- original_model %>%
      update(data=resampled_data)
    
    out[iter, ] <- coef(refitted_model)
  }
}

We update the iterth row of the matrix that stores our parameter estimates.

Build our function

bootstrap_glm <- function(original_data, original_model, n) {
  out <- matrix(
    nrow=n, ncol=length(coef(original_model)),
    dimnames=list(NULL, names(coef(original_model)))
  )
  
  for (iter in 1:n) {
    resampled_data <- original_data %>%
      slice_sample(n=nrow(original_data), replace=TRUE)
    
    refitted_model <- original_model %>%
      update(data=resampled_data)
    
    out[iter, ] <- coef(refitted_model)
  }
  
  as_tibble(out)
}

We coerce our results into a nice tibble before returning it. Note that this function is general enough that in addition to glm support, it will also supports models of class lm.

Test our function

set.seed(99)

boot_coefs <- bootstrap_glm(migraine, model, n=5000)

boot_coefs
# A tibble: 5,000 x 3
   `(Intercept)`    Trt   sBMI
           <dbl>  <dbl>  <dbl>
 1          2.23 -0.205 0.219 
 2          2.66 -0.703 0.184 
 3          2.83 -0.885 0.116 
 4          2.41 -0.292 0.202 
 5          2.38 -0.356 0.184 
 6          2.65 -0.586 0.0927
 7          2.61 -0.551 0.114 
 8          2.64 -0.752 0.146 
 9          2.59 -1.03  0.211 
10          2.56 -0.624 0.202 
# ... with 4,990 more rows

Histogram for bootstrapped \widehat{\beta}_{1}

ggplot(boot_coefs, aes(x=Trt)) +
  geom_histogram(
    aes(y=after_stat(density)), 
    colour="#56B4E9", fill="#0072B2", alpha=0.7
  )

Question: Is B = 5000 sufficient?

Overlay a predicted density

mean_boot_beta1 <- mean(boot_coefs$Trt)
sd_boot_beta1 <- sd(boot_coefs$Trt)

ggplot(boot_coefs, aes(x=Trt)) +
  geom_histogram(
    aes(y=after_stat(density)),
    colour="#56B4E9", fill="#0072B2", alpha=0.7
  ) +
  geom_function(
    fun=dnorm, args=list(mean=mean_boot_beta1, sd=sd_boot_beta1),
    colour="#D55E00", size=1.2
  )

Reload results from the parametric bootstrap

para_boot_coefs <- read_rds("./data/para_boot_coefs.rds")

para_boot_coefs
# A tibble: 5,000 x 3
   `(Intercept)`    Trt   sBMI
           <dbl>  <dbl>  <dbl>
 1          2.50 -0.478 0.170 
 2          2.51 -0.419 0.0961
 3          2.64 -0.746 0.126 
 4          2.63 -0.564 0.160 
 5          2.55 -0.725 0.0564
 6          2.56 -0.540 0.0817
 7          2.62 -0.484 0.0844
 8          2.53 -0.365 0.122 
 9          2.43 -0.307 0.120 
10          2.54 -0.574 0.183 
# ... with 4,990 more rows

Bind the data sets together with an identifier

boot_results <- bind_rows(
  parametric = para_boot_coefs,
  nonparametric = boot_coefs,
  .id = "method"
)

slice(boot_results, 1:5)
# A tibble: 5 x 4
  method     `(Intercept)`    Trt   sBMI
  <chr>              <dbl>  <dbl>  <dbl>
1 parametric          2.50 -0.478 0.170 
2 parametric          2.51 -0.419 0.0961
3 parametric          2.64 -0.746 0.126 
4 parametric          2.63 -0.564 0.160 
5 parametric          2.55 -0.725 0.0564
slice(boot_results, 5001:5006)
# A tibble: 6 x 4
  method        `(Intercept)`    Trt   sBMI
  <chr>                 <dbl>  <dbl>  <dbl>
1 nonparametric          2.23 -0.205 0.219 
2 nonparametric          2.66 -0.703 0.184 
3 nonparametric          2.83 -0.885 0.116 
4 nonparametric          2.41 -0.292 0.202 
5 nonparametric          2.38 -0.356 0.184 
6 nonparametric          2.65 -0.586 0.0927

Compare results

ggplot(boot_results, aes(x=Trt, y=after_stat(density), colour=method, fill=method)) +
  geom_histogram(alpha=0.7) +
  scale_colour_manual(values = c("#56B4E9", "#E69F00")) +
  scale_fill_manual(values = c("#0072B2", "#D55E00"))

Compare results

What did we notice?

  • For both methods, the centres of the distributions are very similar.

  • The spread (variance) of the non-parametric distribution is much larger than that of the parametric distribution.

Bootstrap-t CI

(Also known as the studentized bootstrap interval)

See Week 6 lecture notes for steps.

Alternatively, see page 229 of Statistical Computing with R for steps that (in my opinion) are a bit clearer.

Things are going to get messy...

Build a function for bootstrap-t CI

boot_t_ci_glm <- function(original_data, original_model, B=500, R=100, level=0.95, term=2) {

}
  • We will need our original data for resampling and our original model to use as a template.

  • B is the number of bootstrap resamples, R is the number of bootstrap-within-bootstrap resamples (used to calculate the bootstrap standard error of the bth bootstrap estimate of \widehat{\beta}_{1})

  • level=0.95: by default, the function will compute a 95% CI

  • term=2: when we call coef for any model, \widehat{\beta}_{1} is the second term

Build a function for bootstrap-t CI

boot_t_ci_glm <- function(original_data, original_model, B=500, R=100, level=0.95, term=2) {
  boot_estimates <- numeric(B)
  boot_se <- numeric(B)
}

Initialising our constants:

  • boot_estimates will keep track of the \widehat{\beta}_{1}s from each refitted model.

  • boot_se will keep track of the bootstrap standard error estimates for the bth bootstrap sample.

  • For effective memory consumption, we initialize boot_estimates and boot_se as vectors of zeroes with pre-defined length B. As we progress through our main loop, we will replace these placeholder zeroes with the required values.

Build a function for bootstrap-t CI

boot_t_ci_glm <- function(original_data, original_model, B=500, R=100, level=0.95, term=2) {
  boot_estimates <- numeric(B)
  boot_se <- numeric(B)
  
  for (b in 1:B) {
    resampled_data <- original_data %>%
      slice_sample(n=nrow(original_data), replace=TRUE)
    
    boot_estimates[b] <- original_model %>%
      update(data=resampled_data) %>%
      coef() %>%
      pluck(term)
  }
}
  • We create a resampled data set as usual.

  • This time, we are only after a particular term of a model, rather than all the terms. After the model is refitted, it is passed to coef to obtain the coefficients, and then passed to pluck to extract the term (in this case, second) coefficient.

Build a function for bootstrap-t CI

boot_t_ci_glm <- function(original_data, original_model, B=500, R=100, level=0.95, term=2) {
  boot_estimates <- numeric(B)
  boot_se <- numeric(B)
  
  for (b in 1:B) {
    resampled_data <- original_data %>%
      slice_sample(n=nrow(original_data), replace=TRUE)
    
    boot_estimates[b] <- original_model %>%
      update(data=resampled_data) %>%
      coef() %>%
      pluck(term)
    
    boot_vals <- numeric(R)
    
    for (r in 1:R) {

    }
  }
}
  • We initialise a boot_vals to keep track of the estimates that we will receive from the bootstrap-within-bootstrap procedure.

  • This will be used later to calculate the bootstrap standard error for the bth bootstrap sample.

Build a function for bootstrap-t CI

boot_t_ci_glm <- function(original_data, original_model, B=500, R=100, level=0.95, term=2) {
  boot_estimates <- numeric(B)
  boot_se <- numeric(B)
  
  for (b in 1:B) {
    resampled_data <- original_data %>%
      slice_sample(n=nrow(original_data), replace=TRUE)
    
    boot_estimates[b] <- original_model %>%
      update(data=resampled_data) %>%
      coef() %>%
      pluck(term)
    
    boot_vals <- numeric(R)
    
    for (r in 1:R) {
      re_resampled_data <- resampled_data %>%
        slice_sample(n=nrow(original_data), replace=TRUE)
      
      boot_vals[r] <- original_model %>%
        update(data=re_resampled_data) %>%
        coef() %>%
        pluck(term)
    }
  }
}
  • We resample from the resampled data.

  • We then fit a new model on the re-resampled data and extract the value of \widehat{\beta}_{1}.

Build a function for bootstrap-t CI

boot_t_ci_glm <- function(original_data, original_model, B=500, R=100, level=0.95, term=2) {
  boot_estimates <- numeric(B)
  boot_se <- numeric(B)
  
  for (b in 1:B) {
    resampled_data <- original_data %>%
      slice_sample(n=nrow(original_data), replace=TRUE)
    
    boot_estimates[b] <- original_model %>%
      update(data=resampled_data) %>%
      coef() %>%
      pluck(term)
    
    boot_vals <- numeric(R)
    
    for (r in 1:R) {
      re_resampled_data <- resampled_data %>%
        slice_sample(n=nrow(original_data), replace=TRUE)
      
      boot_vals[r] <- original_model %>%
        update(data=re_resampled_data) %>%
        coef() %>%
        pluck(term)
    }
    
    boot_se[b] <- sd(boot_vals)
  }
}
  • Once we have exited the bootstrap-within-bootstrap loop, we can update the bootstrap standard error of the bth resample.

Build a function for bootstrap-t CI

boot_t_ci_glm <- function(original_data, original_model, B=500, R=100, level=0.95, term=2) {
  boot_estimates <- numeric(B)
  boot_se <- numeric(B)
  
  for (b in 1:B) {
    resampled_data <- original_data %>%
      slice_sample(n=nrow(original_data), replace=TRUE)
    
    boot_estimates[b] <- original_model %>%
      update(data=resampled_data) %>%
      coef() %>%
      pluck(term)
    
    boot_vals <- numeric(R)
    
    for (r in 1:R) {
      re_resampled_data <- resampled_data %>%
        slice_sample(n=nrow(original_data), replace=TRUE)
      
      boot_vals[r] <- original_model %>%
        update(data=re_resampled_data) %>%
        coef() %>%
        pluck(term)
    }
    
    boot_se[b] <- sd(boot_vals)
  }
  
  original_estimate <- original_model %>%
    coef() %>%
    pluck(term)
  
  z <- (boot_estimates - original_estimate) / boot_se
}
  • original_estimate is the observed estimate, \widehat{\theta}. In this case, we are extracting the value of \widehat{\beta}_{1} from our original model.

  • We compute all of the required z^{(b)} values by taking advantage of the fact that operations such as subtraction and division are vectorized.

Build a function for bootstrap-t CI

boot_t_ci_glm <- function(original_data, original_model, B=500, R=100, level=0.95, term=2) {
  boot_estimates <- numeric(B)
  boot_se <- numeric(B)
  
  for (b in 1:B) {
    resampled_data <- original_data %>%
      slice_sample(n=nrow(original_data), replace=TRUE)
    
    boot_estimates[b] <- original_model %>%
      update(data=resampled_data) %>%
      coef() %>%
      pluck(term)
    
    boot_vals <- numeric(R)
    
    for (r in 1:R) {
      re_resampled_data <- resampled_data %>%
        slice_sample(n=nrow(original_data), replace=TRUE)
      
      boot_vals[r] <- original_model %>%
        update(data=re_resampled_data) %>%
        coef() %>%
        pluck(term)
    }
    
    boot_se[b] <- sd(boot_vals)
  }
  
  original_estimate <- original_model %>%
    coef() %>%
    pluck(term)
  
  z <- (boot_estimates - original_estimate) / boot_se

  alpha <- 1 - level
  q <- quantile(z, probs=c(alpha/2, 1-alpha/2), names=FALSE)
}
  • We now calculate the \frac{\alpha}{2} and 1 \,-\, \frac{\alpha}{2} percentiles of z^{(b)}, giving us q_{L} and q_{U}, respectively.

Build a function for bootstrap-t CI

boot_t_ci_glm <- function(original_data, original_model, B=500, R=100, level=0.95, term=2) {
  boot_estimates <- numeric(B)
  boot_se <- numeric(B)
  
  for (b in 1:B) {
    resampled_data <- original_data %>%
      slice_sample(n=nrow(original_data), replace=TRUE)
    
    boot_estimates[b] <- original_model %>%
      update(data=resampled_data) %>%
      coef() %>%
      pluck(term)
    
    boot_vals <- numeric(R)
    
    for (r in 1:R) {
      re_resampled_data <- resampled_data %>%
        slice_sample(n=nrow(original_data), replace=TRUE)
      
      boot_vals[r] <- original_model %>%
        update(data=re_resampled_data) %>%
        coef() %>%
        pluck(term)
    }
    
    boot_se[b] <- sd(boot_vals)
  }
  
  original_estimate <- original_model %>%
    coef() %>%
    pluck(term)
  
  z <- (boot_estimates - original_estimate) / boot_se

  alpha <- 1 - level
  q <- quantile(z, probs=c(alpha/2, 1-alpha/2), names=FALSE)
  
  se_boot_estimates <- sd(boot_estimates)
  
  (original_estimate - q * se_boot_estimates) %>%
    rev() %>%
    set_names(c("lwr" ,"upr"))
}
  • se_boot_estimates is the standard error of all the bootstrapped estimates of \beta_{1} (this is the quantity denoted \widehat{\text{se}}_{B} in the Week 6 notes).

  • Once again, we take advantage of the vectorization of operations such as subtraction and multiplication.

  • These quantities are passed to rev (reverse) to reverse the order. Recall that the lower bound of the interval is computed using the upper quantile, and the upper bound is computed using the lower quantile!

  • Finally, these values are passed to set_names to create a named vector, as is the convention with any confidence intervals in R.

Calculate a bootstrap-t CI

Unsurprisingly, this will take a while to run. Adjust the values of B and R as needed. I will keep the defaults of B=500 and R=100.

set.seed(120)

boot_t_ci_glm(migraine, model)
       lwr        upr 
-0.8988935 -0.1350479