class: language-r layout: true --- # Lab 6 — November 8 ```r library(tidyverse) ``` <PRE class="fansi fansi-message"><CODE>## -- <span style='font-weight: bold;'>Attaching packages</span> --------------------------------------- tidyverse 1.3.1 -- </CODE></PRE><PRE class="fansi fansi-message"><CODE>## <span style='color: #00BB00;'>v</span> <span style='color: #0000BB;'>ggplot2</span> 3.3.6 <span style='color: #00BB00;'>v</span> <span style='color: #0000BB;'>purrr </span> 0.3.4 ## <span style='color: #00BB00;'>v</span> <span style='color: #0000BB;'>tibble </span> 3.1.6 <span style='color: #00BB00;'>v</span> <span style='color: #0000BB;'>dplyr </span> 1.0.9 ## <span style='color: #00BB00;'>v</span> <span style='color: #0000BB;'>tidyr </span> 1.2.0 <span style='color: #00BB00;'>v</span> <span style='color: #0000BB;'>stringr</span> 1.4.0 ## <span style='color: #00BB00;'>v</span> <span style='color: #0000BB;'>readr </span> 2.1.2 <span style='color: #00BB00;'>v</span> <span style='color: #0000BB;'>forcats</span> 0.5.1 </CODE></PRE><PRE class="fansi fansi-message"><CODE>## -- <span style='font-weight: bold;'>Conflicts</span> ------------------------------------------ tidyverse_conflicts() -- ## <span style='color: #BB0000;'>x</span> <span style='color: #0000BB;'>dplyr</span>::<span style='color: #00BB00;'>filter()</span> masks <span style='color: #0000BB;'>stats</span>::filter() ## <span style='color: #BB0000;'>x</span> <span style='color: #0000BB;'>dplyr</span>::<span style='color: #00BB00;'>lag()</span> masks <span style='color: #0000BB;'>stats</span>::lag() </CODE></PRE> ```r library(broom) theme_set(theme_bw()) ``` --- # Diamonds data set ```r data(diamonds) diamonds ``` <PRE class="fansi fansi-output"><CODE>## <span style='color: #555555;'># A tibble: 53,940 x 10</span> ## carat cut color clarity depth table price x y z ## <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><ord></span> <span style='color: #555555; font-style: italic;'><ord></span> <span style='color: #555555; font-style: italic;'><ord></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><int></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><dbl></span> ## <span style='color: #555555;'> 1</span> 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## <span style='color: #555555;'> 2</span> 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## <span style='color: #555555;'> 3</span> 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## <span style='color: #555555;'> 4</span> 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 ## <span style='color: #555555;'> 5</span> 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## <span style='color: #555555;'> 6</span> 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 ## <span style='color: #555555;'> 7</span> 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 ## <span style='color: #555555;'> 8</span> 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 ## <span style='color: #555555;'> 9</span> 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 ## <span style='color: #555555;'>10</span> 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 ## <span style='color: #555555;'># ... with 53,930 more rows</span> </CODE></PRE> --- # Create a variable for the volume ```r diamonds <- diamonds %>% mutate(volume = x * y * z) diamonds ``` <PRE class="fansi fansi-output"><CODE>## <span style='color: #555555;'># A tibble: 53,940 x 11</span> ## carat cut color clarity depth table price x y z volume ## <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><ord></span> <span style='color: #555555; font-style: italic;'><ord></span> <span style='color: #555555; font-style: italic;'><ord></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><int></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><dbl></span> ## <span style='color: #555555;'> 1</span> 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 38.2 ## <span style='color: #555555;'> 2</span> 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 34.5 ## <span style='color: #555555;'> 3</span> 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 38.1 ## <span style='color: #555555;'> 4</span> 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 46.7 ## <span style='color: #555555;'> 5</span> 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 51.9 ## <span style='color: #555555;'> 6</span> 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 38.7 ## <span style='color: #555555;'> 7</span> 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 38.8 ## <span style='color: #555555;'> 8</span> 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 42.3 ## <span style='color: #555555;'> 9</span> 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 36.4 ## <span style='color: #555555;'>10</span> 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 38.7 ## <span style='color: #555555;'># ... with 53,930 more rows</span> </CODE></PRE> --- # Remove zero-volume diamonds ```r diamonds <- diamonds %>% filter(volume > 0) ``` --- # One diamond has a suspiciously high volume There is one observation with an extremely large volume. ```r diamonds %>% arrange(desc(volume)) ``` <PRE class="fansi fansi-output"><CODE>## <span style='color: #555555;'># A tibble: 53,920 x 11</span> ## carat cut color clarity depth table price x y z volume ## <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><ord></span> <span style='color: #555555; font-style: italic;'><ord></span> <span style='color: #555555; font-style: italic;'><ord></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><int></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><dbl></span> ## <span style='color: #555555;'> 1</span> 2 Premium H SI2 58.9 57 <span style='text-decoration: underline;'>12</span>210 8.09 58.9 8.06 <span style='text-decoration: underline;'>3</span>841. ## <span style='color: #555555;'> 2</span> 0.51 Very Good E VS1 61.8 54.7 <span style='text-decoration: underline;'>1</span>970 5.12 5.15 31.8 839. ## <span style='color: #555555;'> 3</span> 0.51 Ideal E VS1 61.8 55 <span style='text-decoration: underline;'>2</span>075 5.15 31.8 5.12 839. ## <span style='color: #555555;'> 4</span> 5.01 Fair J I1 65.5 59 <span style='text-decoration: underline;'>18</span>018 10.7 10.5 6.98 790. ## <span style='color: #555555;'> 5</span> 4.5 Fair J I1 65.8 58 <span style='text-decoration: underline;'>18</span>531 10.2 10.2 6.72 698. ## <span style='color: #555555;'> 6</span> 4.13 Fair H I1 64.8 61 <span style='text-decoration: underline;'>17</span>329 10 9.85 6.43 633. ## <span style='color: #555555;'> 7</span> 4.01 Premium I I1 61 61 <span style='text-decoration: underline;'>15</span>223 10.1 10.1 6.17 632. ## <span style='color: #555555;'> 8</span> 4 Very Good I I1 63.3 58 <span style='text-decoration: underline;'>15</span>984 10.0 9.94 6.31 628. ## <span style='color: #555555;'> 9</span> 4.01 Premium J I1 62.5 62 <span style='text-decoration: underline;'>15</span>223 10.0 9.94 6.24 621. ## <span style='color: #555555;'>10</span> 3.67 Premium I I1 62.4 56 <span style='text-decoration: underline;'>16</span>193 9.86 9.81 6.13 593. ## <span style='color: #555555;'># ... with 53,910 more rows</span> </CODE></PRE> --- # One diamond has a suspiciously high volume There is one observation with an extremely large volume. We can extract this row using: ```r diamonds %>% slice_max(volume) ``` <PRE class="fansi fansi-output"><CODE>## <span style='color: #555555;'># A tibble: 1 x 11</span> ## carat cut color clarity depth table price x y z volume ## <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><ord></span> <span style='color: #555555; font-style: italic;'><ord></span> <span style='color: #555555; font-style: italic;'><ord></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><int></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><dbl></span> ## <span style='color: #555555;'>1</span> 2 Premium H SI2 58.9 57 <span style='text-decoration: underline;'>12</span>210 8.09 58.9 8.06 <span style='text-decoration: underline;'>3</span>841. </CODE></PRE> --- # One diamond has a suspiciously high volume We can extract the volume value using: ```r diamonds %>% slice_max(volume) %>% pull(volume) ``` ``` ## [1] 3840.598 ``` We don't actually need the exact value to remove it from the data set. We can just use an arbitrary cutoff such as 3000. --- # Remove this diamond from the data To remove this observation, we can use: ```r diamonds <- diamonds %>% filter(volume < 3000) ``` Check our work: ```r diamonds %>% arrange(desc(volume)) ``` <PRE class="fansi fansi-output"><CODE>## <span style='color: #555555;'># A tibble: 53,919 x 11</span> ## carat cut color clarity depth table price x y z volume ## <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><ord></span> <span style='color: #555555; font-style: italic;'><ord></span> <span style='color: #555555; font-style: italic;'><ord></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><int></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><dbl></span> ## <span style='color: #555555;'> 1</span> 0.51 Very Good E VS1 61.8 54.7 <span style='text-decoration: underline;'>1</span>970 5.12 5.15 31.8 839. ## <span style='color: #555555;'> 2</span> 0.51 Ideal E VS1 61.8 55 <span style='text-decoration: underline;'>2</span>075 5.15 31.8 5.12 839. ## <span style='color: #555555;'> 3</span> 5.01 Fair J I1 65.5 59 <span style='text-decoration: underline;'>18</span>018 10.7 10.5 6.98 790. ## <span style='color: #555555;'> 4</span> 4.5 Fair J I1 65.8 58 <span style='text-decoration: underline;'>18</span>531 10.2 10.2 6.72 698. ## <span style='color: #555555;'> 5</span> 4.13 Fair H I1 64.8 61 <span style='text-decoration: underline;'>17</span>329 10 9.85 6.43 633. ## <span style='color: #555555;'> 6</span> 4.01 Premium I I1 61 61 <span style='text-decoration: underline;'>15</span>223 10.1 10.1 6.17 632. ## <span style='color: #555555;'> 7</span> 4 Very Good I I1 63.3 58 <span style='text-decoration: underline;'>15</span>984 10.0 9.94 6.31 628. ## <span style='color: #555555;'> 8</span> 4.01 Premium J I1 62.5 62 <span style='text-decoration: underline;'>15</span>223 10.0 9.94 6.24 621. ## <span style='color: #555555;'> 9</span> 3.67 Premium I I1 62.4 56 <span style='text-decoration: underline;'>16</span>193 9.86 9.81 6.13 593. ## <span style='color: #555555;'>10</span> 3.65 Fair H I1 67.1 53 <span style='text-decoration: underline;'>11</span>668 9.53 9.48 6.38 576. ## <span style='color: #555555;'># ... with 53,909 more rows</span> </CODE></PRE> --- # Make a scatterplot .pull-left[ We wish to make a linear model to predict carat using volume. Let us check that there exists a linear relationship between the two variables. ```r ggplot(diamonds, aes(x=volume, y=carat))+ geom_point() ``` There are actually .red[**two**] points with identical carat and volume values where the volume is abnormally large for such low carat values. (See output from previous slide). For now, we will ignore these points. ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-11-1.svg" style="display: block; margin: auto;" /> ] --- # Fit the model ```r carat_lm <- lm(carat ~ volume, data=diamonds) summary(carat_lm) ``` ``` ## ## Call: ## lm(formula = carat ~ volume, data = diamonds) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.6600 -0.0081 -0.0016 0.0056 1.0073 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -3.324e-03 3.037e-04 -10.95 <2e-16 *** ## volume 6.170e-03 2.015e-06 3062.32 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.03582 on 53917 degrees of freedom ## Multiple R-squared: 0.9943, Adjusted R-squared: 0.9943 ## F-statistic: 9.378e+06 on 1 and 53917 DF, p-value: < 2.2e-16 ``` The equation of the fitted line is: `$$\widehat{Y}_{i} \,=\, -3.324*10^{-3} \,+\, 6.170*10^{-3}X_{i}$$` --- # Create 80% confidence intervals for the parameter estimates Produce 80% interval estimates for the intercept and effect of 1 unit change in diamond volume, i.e. confidence intervals for `\(\beta_{0}\)` and `\(\beta_{1}\)`. ```r confint(carat_lm, level=0.80) ``` ``` ## 10 % 90 % ## (Intercept) -0.003712794 -0.002934451 ## volume 0.006167081 0.006172245 ``` --- # Hypothesis testing Test the one sided null hypothesis that the slope is less than 0.005. `$$H_{0}: \beta_{1} \leq 0.005, \quad H_{A}: \beta_{1} > 0.005$$` The value of the test statistic is computed as: `$$T \,=\, \frac{6.170*10^{-3} \,-\, 5*10^{-3}}{2.015*10^{-6}}$$` ```r tstat <- carat_lm %>% tidy() %>% filter(term == "volume") %>% summarise(tstat = (estimate - 0.005) / (std.error)) %>% pull(tstat) tstat ``` ``` ## [1] 580.5642 ``` ```r tstat_df <- glance(carat_lm) %>% pull(df.residual) tstat_df ``` ``` ## [1] 53917 ``` --- # Hypothesis testing Since this is an upper-tailed test, the `\(p\)`-value is the area to the right of `tstat`. ```r pt(tstat, df=tstat_df, lower.tail=FALSE) ``` ``` ## [1] 0 ``` Since the `\(p\)`-value is less than `\(\alpha = 0.05\)`, we reject the null hypothesis and conclude that the true value of the slope parameter is greater than 0.005. --- # Test for model usefulness Test the null hypothesis that the regression model in (b) is not relevant. `$$H_{0}: \beta_{1} \,=\, 0, \quad H_{A}: \beta_{1} \neq 0$$` For the one parameter case, this test can be carried out via `\(t\)`-test. ```r tidy(carat_lm) %>% filter(term == "volume") ``` <PRE class="fansi fansi-output"><CODE>## <span style='color: #555555;'># A tibble: 1 x 5</span> ## term estimate std.error statistic p.value ## <span style='color: #555555; font-style: italic;'><chr></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><dbl></span> ## <span style='color: #555555;'>1</span> volume 0.006<span style='text-decoration: underline;'>17</span> 0.000<span style='text-decoration: underline;'>002</span>01 <span style='text-decoration: underline;'>3</span>062. 0 </CODE></PRE> The `\(p\)`-value of this test is zero. Since the `\(p\)`-value is less than `\(\alpha=0.05\)`, we can reject the null hypotheis and conclude that our model is useful. --- # Test for model usefulness Test the null hypothesis that the regression model in (b) is not relevant. `$$H_{0}: \beta_{1} \,=\, 0, \quad H_{A}: \beta_{1} \neq 0$$` For one or parameters, this test can be carried out via `\(F\)`-test. ```r anova(carat_lm) %>% tidy() ``` <PRE class="fansi fansi-output"><CODE>## <span style='color: #555555;'># A tibble: 2 x 6</span> ## term df sumsq meansq statistic p.value ## <span style='color: #555555; font-style: italic;'><chr></span> <span style='color: #555555; font-style: italic;'><int></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><dbl></span> ## <span style='color: #555555;'>1</span> volume 1 <span style='text-decoration: underline;'>12</span>033. <span style='text-decoration: underline;'>12</span>033. 9<span style='text-decoration: underline;'>377</span>824. 0 ## <span style='color: #555555;'>2</span> Residuals <span style='text-decoration: underline;'>53</span>917 69.2 0.001<span style='text-decoration: underline;'>28</span> <span style='color: #BB0000;'>NA</span> <span style='color: #BB0000;'>NA</span> </CODE></PRE> The `\(p\)`-value is zero so we can once again reject the null hypothesis and conclude that our model is useful. --- # Test for model usefulness Test the null hypothesis that the regression model in (b) is not relevant. `$$H_{0}: \beta_{1} \,=\, 0, \quad H_{A}: \beta_{1} \neq 0$$` We could have actually read the results of the `\(F\)`-test from the model summary. ```r ## Call: ## lm(formula = carat ~ volume, data = diamonds) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.6600 -0.0081 -0.0016 0.0056 1.0073 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -3.324e-03 3.037e-04 -10.95 <2e-16 *** ## volume 6.170e-03 2.015e-06 3062.32 <2e-16 *** ## --- ## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ## ## Residual standard error: 0.03582 on 53917 degrees of freedom ## Multiple R-squared: 0.9943, Adjusted R-squared: 0.9943 *## F-statistic: 9.378e+06 on 1 and 53917 DF, p-value: < 2.2e-16 ``` --- # Test for model usefulness Test the null hypothesis that the regression model in (b) is not relevant. `$$H_{0}: \beta_{1} \,=\, 0, \quad H_{A}: \beta_{1} \neq 0$$` Or using `broom::glance()`: ```r glance(carat_lm) %>% select(statistic, df, df.residual, p.value) %>% rename( fvalue = statistic, df1 = df, df2 = df.residual ) ``` <PRE class="fansi fansi-output"><CODE>## <span style='color: #555555;'># A tibble: 1 x 4</span> ## fvalue df1 df2 p.value ## <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><dbl></span> <span style='color: #555555; font-style: italic;'><int></span> <span style='color: #555555; font-style: italic;'><dbl></span> ## <span style='color: #555555;'>1</span> 9<span style='text-decoration: underline;'>377</span>824. 1 <span style='text-decoration: underline;'>53</span>917 0 </CODE></PRE>