class: language-r layout: true --- # Lab 2 — September 27 ```r library(tidyverse) ``` <PRE class="fansi fansi-message"><CODE>## -- <span style='font-weight: bold;'>Attaching packages</span> --------------------------------------- tidyverse 1.3.1 -- </CODE></PRE><PRE class="fansi fansi-message"><CODE>## <span style='color: #00BB00;'>v</span> <span style='color: #0000BB;'>ggplot2</span> 3.3.6 <span style='color: #00BB00;'>v</span> <span style='color: #0000BB;'>purrr </span> 0.3.4 ## <span style='color: #00BB00;'>v</span> <span style='color: #0000BB;'>tibble </span> 3.1.6 <span style='color: #00BB00;'>v</span> <span style='color: #0000BB;'>dplyr </span> 1.0.9 ## <span style='color: #00BB00;'>v</span> <span style='color: #0000BB;'>tidyr </span> 1.2.0 <span style='color: #00BB00;'>v</span> <span style='color: #0000BB;'>stringr</span> 1.4.0 ## <span style='color: #00BB00;'>v</span> <span style='color: #0000BB;'>readr </span> 2.1.2 <span style='color: #00BB00;'>v</span> <span style='color: #0000BB;'>forcats</span> 0.5.1 </CODE></PRE><PRE class="fansi fansi-message"><CODE>## -- <span style='font-weight: bold;'>Conflicts</span> ------------------------------------------ tidyverse_conflicts() -- ## <span style='color: #BB0000;'>x</span> <span style='color: #0000BB;'>dplyr</span>::<span style='color: #00BB00;'>filter()</span> masks <span style='color: #0000BB;'>stats</span>::filter() ## <span style='color: #BB0000;'>x</span> <span style='color: #0000BB;'>dplyr</span>::<span style='color: #00BB00;'>lag()</span> masks <span style='color: #0000BB;'>stats</span>::lag() </CODE></PRE> ```r library(gghighlight) theme_set(theme_bw()) ``` --- # Obtaining the flowers data set ```r if (!dir.exists("./data")) { dir.create("./data") } download.file("https://www150.statcan.gc.ca/n1/tbl/csv/32100246-eng.zip", destfile="./data/flowers.zip") unzip("./data/flowers.zip", exdir="./data") ``` ```r file.rename("./data/32100246.csv", "./data/flowers.csv") file.rename("./data/32100246_MetaData.csv", "./data/flowers_metadata.csv") ``` ```r flowers <- read_csv("./data/flowers.csv") ``` <PRE class="fansi fansi-message"><CODE>## <span style='font-weight: bold;'>Rows: </span><span style='color: #0000BB;'>1540</span> <span style='font-weight: bold;'>Columns: </span><span style='color: #0000BB;'>16</span> ## <span style='color: #00BBBB;'>--</span> <span style='font-weight: bold;'>Column specification</span> <span style='color: #00BBBB;'>--------------------------------------------------------</span> ## <span style='font-weight: bold;'>Delimiter:</span> "," ## <span style='color: #BB0000;'>chr</span> (9): GEO, DGUID, Flowers and plants, Output, UOM, SCALAR_FACTOR, VECTOR,... ## <span style='color: #00BB00;'>dbl</span> (5): REF_DATE, UOM_ID, SCALAR_ID, VALUE, DECIMALS ## <span style='color: #BBBB00;'>lgl</span> (2): SYMBOL, TERMINATED ## ## <span style='color: #00BBBB;'>i</span> Use `spec()` to retrieve the full column specification for this data. ## <span style='color: #00BBBB;'>i</span> Specify the column types or set `show_col_types = FALSE` to quiet this message. </CODE></PRE> --- # Cleaning up ```r flowers <- flowers %>% rename(year = REF_DATE, location = GEO, type = `Flowers and plants`) ``` - Although not asked, I am also renaming the `Flowers and plants` variable to `type`, because we really shouldn't have spaces in variable names - Variable names that violate the naming conventions can be referenced by wrapping them in backticks as above --- # Flower sales in Canada .pull-left[ ```r can_flower_sales <- flowers %>% filter(location == "Canada", Output == "Sales") ggplot(can_flower_sales, aes(x=year, y=VALUE, group=type, colour=type))+ geom_line() ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-7-1.svg" style="display: block; margin: auto;" /> ] --- # Why is this plot terrible? .pull-left[ - Axis and legend names are not capitalised consistently: `VALUE`, `year`, `type` - Legend is taking up too much plotting space - Values in the legend contain unnecessary information - Unclear what the y-axis, `VALUE`, is measuring - Scientific notation on the y-axis requires the reader to do math ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-8-1.svg" style="display: block; margin: auto;" /> ] --- # Improve the plot — modify `type` values We will modify the values in the `type` column of the original data set. Alternatively, you can create a copy of the data set and modify the copy instead. The functions `str_remove` and `str_to_sentence` are found in the [stringr](https://stringr.tidyverse.org/) package. Finally, we can re-obtain the required data for this question. .pull-left[ ```r flowers <- flowers %>% mutate( type = str_remove(type, pattern="Total\\s"), type = str_remove(type, pattern="\\s\\[.*\\]"), type = str_to_sentence(type) ) ``` - Remove the phrase "Total" with the space that follows, from the `type` values - Remove the space before the square brackets, the square brackets, and anything contained within, from the `type` values - Capitalise the first letter of each value ```r can_flower_sales <- flowers %>% filter(location == "Canada", Output == "Sales") ``` ] -- .pull-right[ Check our modifications: ```r can_flower_sales %>% distinct(type) ``` <PRE class="fansi fansi-output"><CODE>## <span style='color: #555555;'># A tibble: 5 x 1</span> ## type ## <span style='color: #555555; font-style: italic;'><chr></span> ## <span style='color: #555555;'>1</span> Potted plants ## <span style='color: #555555;'>2</span> Cuttings ## <span style='color: #555555;'>3</span> Cut flowers ## <span style='color: #555555;'>4</span> Ornamental bedding plants ## <span style='color: #555555;'>5</span> Vegetable bedding plants </CODE></PRE> ] --- # Improve the plot — change scale of y-axis To work around the scientific notation, change the unit of the y-axis from dollars to millions of dollars. ```r can_flower_sales <- can_flower_sales %>% mutate(VALUE = VALUE / 1e6) ``` --- # Improve the plot — make the new plot .pull-left[ Starting with the base plot: ```r ggplot(can_flower_sales, aes(x=year, y=VALUE, group=type, colour=type))+ geom_line() ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-14-1.svg" style="display: block; margin: auto;" /> ] --- # Improve the plot — make the new plot .pull-left[ Label the lines on the plot: ```r ggplot(can_flower_sales, aes(x=year, y=VALUE, group=type, colour=type))+ geom_line()+ gghighlight(label_params = list(alpha=0.6)) ``` - By default, `gghighlight::gghighlight` highlights and labels five objects - Coincidentally, we have exactly five lines here - Transparency is added to the label so that we can still see the line underneath ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-16-1.svg" style="display: block; margin: auto;" /> ] --- # Improve the plot — make the new plot .pull-left[ Fix axis labels and add titles ```r ggplot(can_flower_sales, aes(x=year, y=VALUE, group=type, colour=type))+ geom_line()+ gghighlight(label_params = list(alpha=0.6))+ labs( x="Year", y="Total sales (million dollars)", title="Plant sales in Canada", subtitle="2007 to 2020" ) ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-18-1.svg" style="display: block; margin: auto;" /> ] --- # Flower production in Canada in 2020 .pull-left[ ```r can_flower_prod <- flowers %>% filter(year == 2020, location == "Canada", Output == "Production (number)") ggplot(can_flower_prod, aes(x=type, y=VALUE))+ geom_col() ``` - Note that `geom_col` is used when the counts already exist in your data (in this case, `VALUE`) - `geom_bar` is used when you do not already have the counts; in this case, [ggplot2](https://ggplot2.tidyverse.org/) will perform the counting - Recall that we had previously modified the values of the `type` variable to remove the extra clutter ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-20-1.svg" style="display: block; margin: auto;" /> ] --- # How can we improve this plot? .pull-left[ - Consistent capitalisation of axis names - Scale `VALUE` so that there is no scientific notation - Depending on the size of the output plot, the `type` labels may end up overlapping; flip the axes so that this won't happen - Notice that the `type` axis is sorted by alphabetical order; reorder the bars so that they appear "in order" - Give a more descriptive label for `VALUE` and add a title to the plot ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-21-1.svg" style="display: block; margin: auto;" /> ] --- # Improve the plot — make the new plot .pull-left[ Flip the axes. Since [ggplot2](https://ggplot2.tidyverse.org/) v3.3.0+, we can just swap the aesthetic variables instead of using `coord_flip`. ```r ggplot(can_flower_prod, aes(x=VALUE, y=type))+ geom_col() ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-23-1.svg" style="display: block; margin: auto;" /> ] --- # Improve the plot — change scales and ordering To work around the scientific notation, change the unit of `VALUE` from a single unit to millions of units. At the same time, we can reorder `type` by *ascending* value of `VALUE` using `fct_reorder` from the [forcats](https://forcats.tidyverse.org/) package. In reordering the levels of `type`, we are converting it from character (text) to an ordered factor (ordered categorical variable). ```r can_flower_prod <- can_flower_prod %>% mutate( VALUE = VALUE / 1e6, type = fct_reorder(type, VALUE) ) ``` - Since we are making a horizontal bar chart, we reorder `type` by *ascending* value of `VALUE` - If we had decided to stick with a vertical bar chart, we would need to reorder `type` by *descending* value of `VALUE` using: ```r fct_reorder(type, VALUE, .desc=TRUE) ``` --- # Improve the plot — make the new plot .pull-left[ Check our work by re-running the same code that produced the previous plot: ```r ggplot(can_flower_prod, aes(x=VALUE, y=type))+ geom_col() ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-27-1.svg" style="display: block; margin: auto;" /> ] --- # Improve the plot — make the new plot .pull-left[ Fix the axis names and add a title. ```r ggplot(can_flower_prod, aes(x=VALUE, y=type))+ geom_col()+ labs( x="Number produced (millions)", y="Plant type", title="Number of plants produced in Canada in 2020, by plant type" ) ``` - You can add some colour to this plot, but I'm not going to ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-29-1.svg" style="display: block; margin: auto;" /> ]