Lab 6, Stat 3503 Fall 2021

class: language-r
layout: true

---

# Lab 6 &mdash; November 8

```r
library(tidyverse)
```

<PRE class="fansi fansi-message"><CODE>## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
</CODE></PRE><PRE class="fansi fansi-message"><CODE>## v ggplot2 3.3.6 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.9
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
</CODE></PRE><PRE class="fansi fansi-message"><CODE>## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
</CODE></PRE>

```r
library(broom)
theme_set(theme_bw())
```

---

# Diamonds data set

```r
data(diamonds)

diamonds
```

<PRE class="fansi fansi-output"><CODE>## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y z
## &lt;dbl&gt; &lt;ord&gt; &lt;ord&gt; &lt;ord&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # ... with 53,930 more rows
</CODE></PRE>

---

# Create a variable for the volume

```r
diamonds <- diamonds %>%
 mutate(volume = x * y * z)

diamonds
```

<PRE class="fansi fansi-output"><CODE>## # A tibble: 53,940 x 11
## carat cut color clarity depth table price x y z volume
## &lt;dbl&gt; &lt;ord&gt; &lt;ord&gt; &lt;ord&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 38.2
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 34.5
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 38.1
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 46.7
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 51.9
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 38.7
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 38.8
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 42.3
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 36.4
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 38.7
## # ... with 53,930 more rows
</CODE></PRE>

---

# Remove zero-volume diamonds

```r
diamonds <- diamonds %>%
 filter(volume > 0)
```

---

# One diamond has a suspiciously high volume

There is one observation with an extremely large volume.

```r
diamonds %>%
  arrange(desc(volume))
```

<PRE class="fansi fansi-output"><CODE>## # A tibble: 53,920 x 11
## carat cut color clarity depth table price x y z volume
## &lt;dbl&gt; &lt;ord&gt; &lt;ord&gt; &lt;ord&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 2 Premium H SI2 58.9 57 12210 8.09 58.9 8.06 3841.
## 2 0.51 Very Good E VS1 61.8 54.7 1970 5.12 5.15 31.8 839.
## 3 0.51 Ideal E VS1 61.8 55 2075 5.15 31.8 5.12 839.
## 4 5.01 Fair J I1 65.5 59 18018 10.7 10.5 6.98 790.
## 5 4.5 Fair J I1 65.8 58 18531 10.2 10.2 6.72 698.
## 6 4.13 Fair H I1 64.8 61 17329 10 9.85 6.43 633.
## 7 4.01 Premium I I1 61 61 15223 10.1 10.1 6.17 632.
## 8 4 Very Good I I1 63.3 58 15984 10.0 9.94 6.31 628.
## 9 4.01 Premium J I1 62.5 62 15223 10.0 9.94 6.24 621.
## 10 3.67 Premium I I1 62.4 56 16193 9.86 9.81 6.13 593.
## # ... with 53,910 more rows
</CODE></PRE>

---

# One diamond has a suspiciously high volume

There is one observation with an extremely large volume.

We can extract this row using:

```r
diamonds %>%
  slice_max(volume)
```

<PRE class="fansi fansi-output"><CODE>## # A tibble: 1 x 11
## carat cut color clarity depth table price x y z volume
## &lt;dbl&gt; &lt;ord&gt; &lt;ord&gt; &lt;ord&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 2 Premium H SI2 58.9 57 12210 8.09 58.9 8.06 3841.
</CODE></PRE>

---

# One diamond has a suspiciously high volume

We can extract the volume value using:

```r
diamonds %>%
  slice_max(volume) %>%
  pull(volume)
```

```
## [1] 3840.598
```

We don't actually need the exact value to remove it from the data set. We can just use an arbitrary
cutoff such as 3000.

---

# Remove this diamond from the data

To remove this observation, we can use:

```r
diamonds <- diamonds %>%
 filter(volume < 3000)
```

Check our work:

```r
diamonds %>%
  arrange(desc(volume))
```

<PRE class="fansi fansi-output"><CODE>## # A tibble: 53,919 x 11
## carat cut color clarity depth table price x y z volume
## &lt;dbl&gt; &lt;ord&gt; &lt;ord&gt; &lt;ord&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 0.51 Very Good E VS1 61.8 54.7 1970 5.12 5.15 31.8 839.
## 2 0.51 Ideal E VS1 61.8 55 2075 5.15 31.8 5.12 839.
## 3 5.01 Fair J I1 65.5 59 18018 10.7 10.5 6.98 790.
## 4 4.5 Fair J I1 65.8 58 18531 10.2 10.2 6.72 698.
## 5 4.13 Fair H I1 64.8 61 17329 10 9.85 6.43 633.
## 6 4.01 Premium I I1 61 61 15223 10.1 10.1 6.17 632.
## 7 4 Very Good I I1 63.3 58 15984 10.0 9.94 6.31 628.
## 8 4.01 Premium J I1 62.5 62 15223 10.0 9.94 6.24 621.
## 9 3.67 Premium I I1 62.4 56 16193 9.86 9.81 6.13 593.
## 10 3.65 Fair H I1 67.1 53 11668 9.53 9.48 6.38 576.
## # ... with 53,909 more rows
</CODE></PRE>

---

# Make a scatterplot

.pull-left[

We wish to make a linear model to predict carat using volume. Let us check that there exists a
linear relationship between the two variables.

```r
ggplot(diamonds, aes(x=volume, y=carat))+
  geom_point()
```

There are actually .red[**two**] points with identical carat and volume values where the volume is
abnormally large for such low carat values. (See output from previous slide).

For now, we will ignore these points.

]

.pull-right[

]

---

# Fit the model

```r
carat_lm <- lm(carat ~ volume, data=diamonds)

summary(carat_lm)
```

```
## 
## Call:
## lm(formula = carat ~ volume, data = diamonds)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -4.6600 -0.0081 -0.0016 0.0056 1.0073 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) -3.324e-03 3.037e-04 -10.95 <2e-16 ***
## volume 6.170e-03 2.015e-06 3062.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03582 on 53917 degrees of freedom
## Multiple R-squared: 0.9943,	Adjusted R-squared: 0.9943 
## F-statistic: 9.378e+06 on 1 and 53917 DF, p-value: < 2.2e-16
```

The equation of the fitted line is:

`$$\widehat{Y}_{i} \,=\, -3.324*10^{-3} \,+\, 6.170*10^{-3}X_{i}$$`

---

# Create 80% confidence intervals for the parameter estimates

Produce 80% interval estimates for the intercept and effect of 1 unit change in diamond volume, i.e.
confidence intervals for `$\beta_{0}$` and `$\beta_{1}$`.

```r
confint(carat_lm, level=0.80)
```

```
##                     10 %         90 %
## (Intercept) -0.003712794 -0.002934451
## volume       0.006167081  0.006172245
```

---

# Hypothesis testing

Test the one sided null hypothesis that the slope is less than 0.005.

`$$H_{0}: \beta_{1} \leq 0.005, \quad H_{A}: \beta_{1} > 0.005$$`

The value of the test statistic is computed as:

`$$T \,=\, \frac{6.170*10^{-3} \,-\, 5*10^{-3}}{2.015*10^{-6}}$$`

```r
tstat <- carat_lm %>%
 tidy() %>%
 filter(term == "volume") %>%
 summarise(tstat = (estimate - 0.005) / (std.error)) %>%
 pull(tstat)

tstat
```

```
## [1] 580.5642
```

```r
tstat_df <- glance(carat_lm) %>%
 pull(df.residual)

tstat_df
```

```
## [1] 53917
```

---

# Hypothesis testing

Since this is an upper-tailed test, the `$p$`-value is the area to the right of `tstat`.

```r
pt(tstat, df=tstat_df, lower.tail=FALSE)
```

```
## [1] 0
```

Since the `$p$`-value is less than `$\alpha = 0.05$`, we reject the null hypothesis and conclude that
the true value of the slope parameter is greater than 0.005.

---

# Test for model usefulness

Test the null hypothesis that the regression model in (b) is not relevant.

`$$H_{0}: \beta_{1} \,=\, 0, \quad H_{A}: \beta_{1} \neq 0$$`

For the one parameter case, this test can be carried out via `$t$`-test.

```r
tidy(carat_lm) %>%
  filter(term == "volume")
```

<PRE class="fansi fansi-output"><CODE>## # A tibble: 1 x 5
## term estimate std.error statistic p.value
## &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 volume 0.00617 0.00000201 3062. 0
</CODE></PRE>

The `$p$`-value of this test is zero. Since the `$p$`-value is less than `$\alpha=0.05$`, we can reject
the null hypotheis and conclude that our model is useful.

---

# Test for model usefulness

Test the null hypothesis that the regression model in (b) is not relevant.

`$$H_{0}: \beta_{1} \,=\, 0, \quad H_{A}: \beta_{1} \neq 0$$`

For one or parameters, this test can be carried out via `$F$`-test.

```r
anova(carat_lm) %>%
  tidy()
```

<PRE class="fansi fansi-output"><CODE>## # A tibble: 2 x 6
## term df sumsq meansq statistic p.value
## &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 volume 1 12033. 12033. 9377824. 0
## 2 Residuals 53917 69.2 0.00128 NA NA
</CODE></PRE>

The `$p$`-value is zero so we can once again reject the null hypothesis and conclude that our model is
useful.

---

# Test for model usefulness

Test the null hypothesis that the regression model in (b) is not relevant.

`$$H_{0}: \beta_{1} \,=\, 0, \quad H_{A}: \beta_{1} \neq 0$$`

We could have actually read the results of the `$F$`-test from the model summary.

```r
## Call:
## lm(formula = carat ~ volume, data = diamonds)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -4.6600 -0.0081 -0.0016 0.0056 1.0073 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) -3.324e-03 3.037e-04 -10.95 <2e-16 ***
## volume 6.170e-03 2.015e-06 3062.32 <2e-16 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## 
## Residual standard error: 0.03582 on 53917 degrees of freedom
## Multiple R-squared: 0.9943,	Adjusted R-squared: 0.9943 
*## F-statistic: 9.378e+06 on 1 and 53917 DF, p-value: < 2.2e-16
```

---

# Test for model usefulness

Test the null hypothesis that the regression model in (b) is not relevant.

`$$H_{0}: \beta_{1} \,=\, 0, \quad H_{A}: \beta_{1} \neq 0$$`

Or using `broom::glance()`:

```r
glance(carat_lm) %>%
  select(statistic, df, df.residual, p.value) %>%
  rename(
    fvalue = statistic,
    df1 = df,
    df2 = df.residual
  )
```

<PRE class="fansi fansi-output"><CODE>## # A tibble: 1 x 4
## fvalue df1 df2 p.value
## &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt;
## 1 9377824. 1 53917 0
</CODE></PRE>