Stat 4604: Lab 8

Adam Shen

November 30, 2022

Slide navigation

  • Type ? to bring up the presentation menu

  • Press o for the presentation overview (and jump to a specific slide)

  • To export slides to PDF, press e to activate PDF export mode, then print to PDF (note that some features may not function properly when exported to a PDF)

  • To copy a code block, hover over the top-right corner of the code block and click the clipboard icon

Expectation by simulation

Suppose we are interested in the quantity \textbf{E}(m(X)), where

m(X) \,=\, 5e^{-3(X-4)^{4}}

and X \,\sim\, t_{7}.

We want to estimate this quantity through simulation.

Task: Using various sample sizes from small to large, sample from the t_{7} distribution and calculate the mean and variance of the m(x) values.

Define m

We start by defining a function for m(X).

m <- function(x) {
  5 * exp(-3 * (x - 4)^4)
}

Note that this function is vectorized since mathematical operations are vectorized. In other words, we can pass in more than one value and the function will operate on the values element-wise. For example:

m(c(2, 4, 0, 7))
[1]  7.125820e-21  5.000000e+00  0.000000e+00 1.463561e-105

Create the function m_t

I will now create a function that I will call m_t. This function will:

  • Generate a sample from the t distribution

  • Evaluate m(x) over each value of x generated

  • Return a vector containing the mean and variance of the m(x) values

Create the function m_t

m_t <- function(sample_size, df, func) {
  
}
  • sample_size: the sample size that we generate from the t distribution. This is passed to the rt function which the function that generates deviates from the t distribution.

  • df: the degrees of freedom for the desired t distribution. This is also passed to rt.

  • func: the function that we wish to evaluate on our sample values. I have included this to keep it general, i.e. I don't want to hard code m(X) in case we decide to change the definition of m(X) in the future.

Create the function m_t

m_t <- function(sample_size, df, func) {
  m_values <- rt(sample_size, df=df) |>
    func()
}

We generate our sample from the t-distribution and then pass it to the supplied function. We store the values in a local variable called m_values.

Create the function m_t

m_t <- function(sample_size, df, func) {
  m_values <- rt(sample_size, df=df) |>
    func()
  
  c("Mean" = mean(m_values), "Variance" = var(m_values))
}

We return a named vector, constructed by supplying name-value pairs.

Test m_t

set.seed(20)
m_t(sample_size=10, df=7, func=m)
     Mean  Variance 
0.4995431 2.4954327 

Vary the sample size and repeat a few times

We wish to vary the sample size and for each value of the sample size, repeat the procedure a few times to get a feel for these means and variances.

We can do this using the replicate function. The replicate function calls an expression a specific number of times. This is a cleaner alternative to using a for loop.

Vary the sample size and repeat a few times

Sample size: 10

set.seed(20)
replicate(n=5, m_t(sample_size=10, df=7, func=m))
              [,1]          [,2]          [,3]          [,4]          [,5]
Mean     0.4995431 1.388976e-210  3.384022e-62  8.885861e-59  1.995046e-69
Variance 2.4954327  0.000000e+00 1.145161e-122 7.895853e-116 3.980210e-137

Vary the sample size and repeat a few times

Sample size: 50

set.seed(20)
replicate(n=5, m_t(sample_size=50, df=7, func=m))
               [,1]       [,2]        [,3]         [,4]        [,5]
Mean     0.09990861 0.03046291 0.005574462 6.885981e-05 0.006783616
Variance 0.49908654 0.03000931 0.001298717 9.166538e-08 0.002300861

Vary the sample size and repeat a few times

Sample size: 100

set.seed(20)
replicate(n=5, m_t(sample_size=100, df=7, func=m))
               [,1]        [,2]       [,3]         [,4]         [,5]
Mean     0.06518576 0.002821661 0.05135831 1.334204e-05 8.728140e-14
Variance 0.26309358 0.000650499 0.23090031 1.581584e-08 7.618043e-25

Vary the sample size and repeat a few times

Sample size: 1000

set.seed(20)
replicate(n=5, m_t(sample_size=1000, df=7, func=m))
               [,1]       [,2]       [,3]       [,4]       [,5]
Mean     0.01868417 0.03164596 0.03824293 0.02340458 0.03278975
Variance 0.07675311 0.14727127 0.13869585 0.09554089 0.13254573

Vary the sample size and repeat a few times

Sample size: 5000

set.seed(20)
replicate(n=5, m_t(sample_size=5000, df=7, func=m))
               [,1]       [,2]       [,3]       [,4]       [,5]
Mean     0.02895348 0.03197802 0.02902119 0.02174979 0.03463668
Variance 0.11811573 0.11977340 0.11049475 0.08076089 0.14428099

Vary the sample size and repeat a few times

Sample size: 10000

set.seed(20)
replicate(n=5, m_t(sample_size=10000, df=7, func=m))
               [,1]       [,2]       [,3]       [,4]       [,5]
Mean     0.03046575 0.02538549 0.03405653 0.02105902 0.02758749
Variance 0.11893496 0.09563148 0.13793839 0.08514177 0.10923151

Approximate the variation of the sample means

Now, we wish to approximate, by simulation, the variation in the means of the m(x) values for different sample sizes. This is similar to what we have been doing on the past few slides, but we will do it more than five times.

We will store our results in matrices with dimension B \times S, where

  • B is the number of repetitions

  • S is the sample size

Approximate the variation of the sample means

From these matrices, we can compute the mean of each sample of m(x) values (the row means), and then compute the variance of these B means.

This means we will go from a B \times S matrix to a vector of length B to a vector of length one.

B <- 2000

Approximate the variation of the sample means

Variation of sample means

S <- c(10, 50, 100, 1000, 5000, 10000)
results <- rep(NA_real_, length(S))

set.seed(20)

We first initialize our vector sample sizes and the vector that will store our results.

Variation of sample means

S <- c(10, 50, 100, 1000, 5000, 10000)
results <- rep(NA_real_, length(S))

set.seed(20)

for (i in 1:length(S)) {

}

We will use a for loop since this will run a fixed number of times.

Variation of sample means

S <- c(10, 50, 100, 1000, 5000, 10000)
results <- rep(NA_real_, length(S))

set.seed(20)

for (i in 1:length(S)) {
  results[i] <- rt(B * S[i], df=7)
}

We first generate B \times S values from the t_{7} distribution. We can actually batch generate all of the sample values at once rather than generating sample by sample.

Variation of sample means

S <- c(10, 50, 100, 1000, 5000, 10000)
results <- rep(NA_real_, length(S))

set.seed(20)

for (i in 1:length(S)) {
  results[i] <- rt(B * S[i], df=7) |>
    m()
}

We pass these deviates from the t_{7} distribution to the m function to evaluate m(x). Once again, m is a vectorized function, and as such, will apply m to each element in the sample (which is what we want).

Variation of sample means

S <- c(10, 50, 100, 1000, 5000, 10000)
results <- rep(NA_real_, length(S))

set.seed(20)

for (i in 1:length(S)) {
  results[i] <- rt(B * S[i], df=7) |>
    m() |>
    matrix(nrow=B, ncol=S[i])
}

These m(x) values are then passed to matrix with rows equal to the number of samples, B, and columns equal to the number of elements in a single sample, S.

Variation of sample means

S <- c(10, 50, 100, 1000, 5000, 10000)
results <- rep(NA_real_, length(S))

set.seed(20)

for (i in 1:length(S)) {
  results[i] <- rt(B * S[i], df=7) |>
    m() |>
    matrix(nrow=B, ncol=S[i]) |>
    apply(MARGIN=1, FUN=mean)
}

The matrix is then passed to the apply function. The apply function applies a function to the rows of a matrix (MARGIN = 1), or the columns of a matrix (MARGIN = 2). Here, we want the row means, so we specify MARGIN = 1 and FUN = mean.

Variation of sample means

S <- c(10, 50, 100, 1000, 5000, 10000)
results <- rep(NA_real_, length(S))

set.seed(20)

for (i in 1:length(S)) {
  results[i] <- rt(B * S[i], df=7) |>
    m() |>
    matrix(nrow=B, ncol=S[i]) |>
    apply(MARGIN=1, FUN=mean) |>
    var()
}

Finally, these means are passed to the variance function to calculate the variance of these means.

Visualize results

plot(results ~ S, type="b", col="darkred", xlab="Sample size", ylab="Variance of means")
text(x=S, y=results, labels=as.character(S), adj=c(0.5, 0))

The variance of the means seems to begin to stabilize around S = 1000.

Importance sampling

Now, suppose that instead of the t_{7} distribution, the data came from a N(4,1) distribution. We can use importance sampling to estimate \textbf{E}(m(X)) using:

\frac{1}{S}\sum_{s=1}^{S}\frac{f(X_{s})}{h(X_{s})} \cdot m(X_{s}),

where

  • f(\cdot) is the density of the t_{7} distribution

  • h(\cdot) is the density of the N(4,1) distribution

  • m(\cdot) is the function defined in the question

Create m_norm

m_norm <- function(sample_size, df, mean, sd, func) {
  
}

This time, we accept additional arguments mean and sd which will be passed to dnorm.

Create m_norm

m_norm <- function(sample_size, df, mean, sd, func) {
  t_values <- rt(sample_size, df=df)
}

First, we generate our values from the t distribution.

Create m_norm

m_norm <- function(sample_size, df, mean, sd, func) {
  t_values <- rt(sample_size, df=df)
  
  (1 / sample_size) * sum(dt(t_values, df=df) / dnorm(t_values, mean=mean, sd=sd) * func(t_values))
}

Next, we apply the formula given earlier.

Vary the sample size and repeat a few times

Sample size: 10

set.seed(20)
replicate(n=5, m_t(sample_size=10, df=7, func=m))
              [,1]          [,2]          [,3]          [,4]          [,5]
Mean     0.4995431 1.388976e-210  3.384022e-62  8.885861e-59  1.995046e-69
Variance 2.4954327  0.000000e+00 1.145161e-122 7.895853e-116 3.980210e-137
set.seed(20)
replicate(n=5, m_norm(sample_size=10, df=7, mean=4, sd=1, func=m))
[1]  5.020335e-03 6.823781e-208  3.824304e-61  8.696391e-58  3.005171e-68

Vary the sample size and repeat a few times

Sample size: 50

set.seed(20)
replicate(n=5, m_t(sample_size=50, df=7, func=m))
               [,1]       [,2]        [,3]         [,4]        [,5]
Mean     0.09990861 0.03046291 0.005574462 6.885981e-05 0.006783616
Variance 0.49908654 0.03000931 0.001298717 9.166538e-08 0.002300861
set.seed(20)
replicate(n=5, m_norm(sample_size=50, df=7, mean=4, sd=1, func=m))
[1] 1.004067e-03 1.291549e-03 3.378956e-04 8.621854e-06 3.699058e-04

Vary the sample size and repeat a few times

Sample size: 100

set.seed(20)
replicate(n=5, m_t(sample_size=100, df=7, func=m))
               [,1]        [,2]       [,3]         [,4]         [,5]
Mean     0.06518576 0.002821661 0.05135831 1.334204e-05 8.728140e-14
Variance 0.26309358 0.000650499 0.23090031 1.581584e-08 7.618043e-25
set.seed(20)
replicate(n=5, m_norm(sample_size=100, df=7, mean=4, sd=1, func=m))
[1] 1.147808e-03 1.732587e-04 8.690805e-04 1.705837e-06 4.183762e-14

Vary the sample size and repeat a few times

Sample size: 1000

set.seed(20)
replicate(n=5, m_t(sample_size=1000, df=7, func=m))
               [,1]       [,2]       [,3]       [,4]       [,5]
Mean     0.01868417 0.03164596 0.03824293 0.02340458 0.03278975
Variance 0.07675311 0.14727127 0.13869585 0.09554089 0.13254573
set.seed(20)
replicate(n=5, m_norm(sample_size=1000, df=7, mean=4, sd=1, func=m))
[1] 0.0003328647 0.0004058555 0.0007563775 0.0003397010 0.0005798904

Vary the sample size and repeat a few times

Sample size: 5000

set.seed(20)
replicate(n=5, m_t(sample_size=5000, df=7, func=m))
               [,1]       [,2]       [,3]       [,4]       [,5]
Mean     0.02895348 0.03197802 0.02902119 0.02174979 0.03463668
Variance 0.11811573 0.11977340 0.11049475 0.08076089 0.14428099
set.seed(20)
replicate(n=5, m_norm(sample_size=5000, df=7, mean=4, sd=1, func=m))
[1] 0.0004829378 0.0005979614 0.0005594219 0.0004160453 0.0004897032

Vary the sample size and repeat a few times

Sample size: 10000

set.seed(20)
replicate(n=5, m_t(sample_size=10000, df=7, func=m))
               [,1]       [,2]       [,3]       [,4]       [,5]
Mean     0.03046575 0.02538549 0.03405653 0.02105902 0.02758749
Variance 0.11893496 0.09563148 0.13793839 0.08514177 0.10923151
set.seed(20)
replicate(n=5, m_norm(sample_size=10000, df=7, mean=4, sd=1, func=m))
[1] 0.0005404496 0.0004877336 0.0005361786 0.0003476668 0.0004580410

Variation of sample means

S <- c(10, 50, 100, 1000, 5000, 10000)
results <- rep(NA_real_, length(S))

set.seed(20)

for (i in 1:length(S)) {
  results[i] <- rt(B * S[i], df=7) |>
    m() |>
    matrix(nrow=B, ncol=S[i]) |>
    apply(MARGIN=1, FUN=mean) |>
    var()
}

We proceed with an approach that is nearly identical to the previous when we only used the t_{7} distribution, only that we need to swap out the ordinary mean function for a custom function that uses the proper weights.

Custom mean function

custom_mean <- function(x, df, mean, sd, func) {
  (1 / length(x)) * sum(dt(x, df=df) / dnorm(x, mean=mean, sd=sd) * func(x))
}

Variation of sample means

S <- c(10, 50, 100, 1000, 5000, 10000)
results2 <- rep(NA_real_, length(S))

set.seed(20)

for (i in 1:length(S)) {
  results2[i] <- rt(B * S[i], df=7) |>
    m() |>
    matrix(nrow=B, ncol=S[i]) |>
    apply(MARGIN=1, FUN=custom_mean, df=7, mean=4, sd=1, func=m) |>
    var()
}

Create the new variable results2 to store our results and compare them with our earlier results in the variable results.

Visualize results

plot(results ~ S, type="b", col="darkred", xlab="Sample size", ylab="Variance of means")
text(x=S, y=results, labels=as.character(S), adj=c(0.5, 0))
points(results2 ~ S, type="b", col="royalblue")

View results

results # Using t distribution only
[1] 1.034922e-02 2.047549e-03 1.067166e-03 1.072433e-04 2.144838e-05
[6] 1.058633e-05
results2 # Using t and normal distribution via importance sampling
[1] 4.844602e-07 1.040104e-07 4.893394e-08 4.821039e-09 9.564740e-10
[6] 4.711147e-10

We can see that the variability of the means is much smaller in the importance sampling case, even for extremely small sample sizes. As such, this does improve the performance of the approximation.