November 30, 2022
Type ?
to bring up the presentation menu
Press o
for the presentation overview (and jump to a specific slide)
To export slides to PDF, press e
to activate PDF export mode, then print to PDF (note that some features may not function properly when exported to a PDF)
To copy a code block, hover over the top-right corner of the code block and click the clipboard icon
Suppose we are interested in the quantity \textbf{E}(m(X)), where
m(X) \,=\, 5e^{-3(X-4)^{4}}
and X \,\sim\, t_{7}.
We want to estimate this quantity through simulation.
Task: Using various sample sizes from small to large, sample from the t_{7} distribution and calculate the mean and variance of the m(x) values.
m
We start by defining a function for m(X).
m_t
I will now create a function that I will call m_t
. This function will:
Generate a sample from the t distribution
Evaluate m(x) over each value of x generated
Return a vector containing the mean and variance of the m(x) values
m_t
sample_size
: the sample size that we generate from the t distribution. This is passed to the rt
function which the function that generates deviates from the t distribution.
df
: the degrees of freedom for the desired t distribution. This is also passed to rt
.
func
: the function that we wish to evaluate on our sample values. I have included this to keep it general, i.e. I don't want to hard code m(X) in case we decide to change the definition of m(X) in the future.
m_t
We generate our sample from the t-distribution and then pass it to the supplied function. We store the values in a local variable called m_values
.
m_t
We return a named vector, constructed by supplying name-value pairs.
m_t
We wish to vary the sample size and for each value of the sample size, repeat the procedure a few times to get a feel for these means and variances.
We can do this using the replicate
function. The replicate
function calls an expression a specific number of times. This is a cleaner alternative to using a for
loop.
Sample size: 10
Sample size: 50
Sample size: 100
Sample size: 1000
Sample size: 5000
Sample size: 10000
Now, we wish to approximate, by simulation, the variation in the means of the m(x) values for different sample sizes. This is similar to what we have been doing on the past few slides, but we will do it more than five times.
We will store our results in matrices with dimension B \times S, where
B is the number of repetitions
S is the sample size
From these matrices, we can compute the mean of each sample of m(x) values (the row means), and then compute the variance of these B means.
This means we will go from a B \times S matrix to a vector of length B to a vector of length one.
We first initialize our vector sample sizes and the vector that will store our results.
We will use a for
loop since this will run a fixed number of times.
We first generate B \times S values from the t_{7} distribution. We can actually batch generate all of the sample values at once rather than generating sample by sample.
We pass these deviates from the t_{7} distribution to the m
function to evaluate m(x). Once again, m
is a vectorized function, and as such, will apply m
to each element in the sample (which is what we want).
These m(x) values are then passed to matrix
with rows equal to the number of samples, B, and columns equal to the number of elements in a single sample, S.
The matrix is then passed to the apply
function. The apply
function applies a function to the rows of a matrix (MARGIN = 1
), or the columns of a matrix (MARGIN = 2
). Here, we want the row means, so we specify MARGIN = 1
and FUN = mean
.
Finally, these means are passed to the variance function to calculate the variance of these means.
The variance of the means seems to begin to stabilize around S = 1000.
Now, suppose that instead of the t_{7} distribution, the data came from a N(4,1) distribution. We can use importance sampling to estimate \textbf{E}(m(X)) using:
\frac{1}{S}\sum_{s=1}^{S}\frac{f(X_{s})}{h(X_{s})} \cdot m(X_{s}),
where
f(\cdot) is the density of the t_{7} distribution
h(\cdot) is the density of the N(4,1) distribution
m(\cdot) is the function defined in the question
m_norm
This time, we accept additional arguments mean
and sd
which will be passed to dnorm
.
m_norm
First, we generate our values from the t distribution.
m_norm
Next, we apply the formula given earlier.
Sample size: 10
[,1] [,2] [,3] [,4] [,5]
Mean 0.4995431 1.388976e-210 3.384022e-62 8.885861e-59 1.995046e-69
Variance 2.4954327 0.000000e+00 1.145161e-122 7.895853e-116 3.980210e-137
[1] 5.020335e-03 6.823781e-208 3.824304e-61 8.696391e-58 3.005171e-68
Sample size: 50
[,1] [,2] [,3] [,4] [,5]
Mean 0.09990861 0.03046291 0.005574462 6.885981e-05 0.006783616
Variance 0.49908654 0.03000931 0.001298717 9.166538e-08 0.002300861
[1] 1.004067e-03 1.291549e-03 3.378956e-04 8.621854e-06 3.699058e-04
Sample size: 100
[,1] [,2] [,3] [,4] [,5]
Mean 0.06518576 0.002821661 0.05135831 1.334204e-05 8.728140e-14
Variance 0.26309358 0.000650499 0.23090031 1.581584e-08 7.618043e-25
[1] 1.147808e-03 1.732587e-04 8.690805e-04 1.705837e-06 4.183762e-14
Sample size: 1000
[,1] [,2] [,3] [,4] [,5]
Mean 0.01868417 0.03164596 0.03824293 0.02340458 0.03278975
Variance 0.07675311 0.14727127 0.13869585 0.09554089 0.13254573
[1] 0.0003328647 0.0004058555 0.0007563775 0.0003397010 0.0005798904
Sample size: 5000
[,1] [,2] [,3] [,4] [,5]
Mean 0.02895348 0.03197802 0.02902119 0.02174979 0.03463668
Variance 0.11811573 0.11977340 0.11049475 0.08076089 0.14428099
[1] 0.0004829378 0.0005979614 0.0005594219 0.0004160453 0.0004897032
Sample size: 10000
[,1] [,2] [,3] [,4] [,5]
Mean 0.03046575 0.02538549 0.03405653 0.02105902 0.02758749
Variance 0.11893496 0.09563148 0.13793839 0.08514177 0.10923151
[1] 0.0005404496 0.0004877336 0.0005361786 0.0003476668 0.0004580410
We proceed with an approach that is nearly identical to the previous when we only used the t_{7} distribution, only that we need to swap out the ordinary mean function for a custom function that uses the proper weights.
Create the new variable results2
to store our results and compare them with our earlier results in the variable results
.
[1] 1.034922e-02 2.047549e-03 1.067166e-03 1.072433e-04 2.144838e-05
[6] 1.058633e-05
[1] 4.844602e-07 1.040104e-07 4.893394e-08 4.821039e-09 9.564740e-10
[6] 4.711147e-10
We can see that the variability of the means is much smaller in the importance sampling case, even for extremely small sample sizes. As such, this does improve the performance of the approximation.