Soft drink production

Techniques employed

Data wrangling:
- Filtering data
- Working with dates: months, quarters, years
- Grouping and summarising data
- Row-binding data
Data visualisation:
- Line plots, grouped line plots
- Modifying the legend position and appearance

About the data

From 1950 to 1977, soft-drink production was reported on a quarterly basis.

More information on this record can be found here.

The data and accompanying metadata can be downloaded together as a .zip file from here.

From 1976 to 1995, soft-drink production was reported on a monthly basis.

More information on this record can be found here.

The data and accompanying metadata can be downloaded together as a .zip file from here.

Packages

library(tidyverse)
library(lubridate)
theme_set(theme_bw())

Obtaining the data

For these demos, I will be storing all data files in a folder called data with subfolders for each topic.

if (!dir.exists("./data/drinks")) {
  dir.create("./data/drinks", recursive=TRUE)
}

# Quarterly data from 1950 to 1977

download.file(
  "https://www150.statcan.gc.ca/n1/tbl/csv/16100100-eng.zip",
  destfile = "./data/drinks/qdrinks.zip"
)

unzip(
  "./data/drinks/qdrinks.zip",
  exdir = "./data/drinks"
)

# Monthly data from 1976 to 1995

download.file(
  "https://www150.statcan.gc.ca/n1/tbl/csv/16100099-eng.zip",
  destfile = "./data/drinks/mdrinks.zip"
)

unzip(
  "./data/drinks/mdrinks.zip",
  exdir = "./data/drinks"
)

list.files("./data/drinks", pattern="csv")

## [1] "16100099.csv"          "16100099_MetaData.csv" "16100100.csv"         
## [4] "16100100_MetaData.csv"

Reading in the quarterly data

qdrinks <- read_csv("./data/drinks/16100100.csv")

glimpse(qdrinks)

## Rows: 112
## Columns: 15
## $ REF_DATE                                 <chr> "1950-01", "1950-04", "1950-0~
## $ GEO                                      <chr> "Canada", "Canada", "Canada",~
## $ DGUID                                    <lgl> NA, NA, NA, NA, NA, NA, NA, N~
## $ `Standard Classification of Goods (SCG)` <chr> "Soft drinks", "Soft drinks",~
## $ UOM                                      <chr> "Gallons", "Gallons", "Gallon~
## $ UOM_ID                                   <dbl> 142, 142, 142, 142, 142, 142,~
## $ SCALAR_FACTOR                            <chr> "thousands", "thousands", "th~
## $ SCALAR_ID                                <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,~
## $ VECTOR                                   <chr> "v7717", "v7717", "v7717", "v~
## $ COORDINATE                               <dbl> 1.1, 1.1, 1.1, 1.1, 1.1, 1.1,~
## $ VALUE                                    <dbl> 19349, 29730, 31721, 20045, 1~
## $ STATUS                                   <lgl> NA, NA, NA, NA, NA, NA, NA, N~
## $ SYMBOL                                   <lgl> NA, NA, NA, NA, NA, NA, NA, N~
## $ TERMINATED                               <lgl> NA, NA, NA, NA, NA, NA, NA, N~
## $ DECIMALS                                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~

The variables of interest are:

Variable	Description
`REF_DATE`	Year and month of record
`VALUE`	Amount produced in thousand gallons

As the dates are supplied as strings of format yyyy-mm, we can use lubridate::ym() to parse them into proper dates.

quarterly <- qdrinks %>%
  select(REF_DATE, VALUE) %>%
  rename(date = REF_DATE, quarterly_value = VALUE) %>%
  mutate(date = ym(date))

quarterly

## # A tibble: 112 x 2
##    date       quarterly_value
##    <date>               <dbl>
##  1 1950-01-01           19349
##  2 1950-04-01           29730
##  3 1950-07-01           31721
##  4 1950-10-01           20045
##  5 1951-01-01           17398
##  6 1951-04-01           25893
##  7 1951-07-01           28477
##  8 1951-10-01           19923
##  9 1952-01-01           19248
## 10 1952-04-01           26232
## # ... with 102 more rows

Reading in the monthly data

mdrinks <- read_csv("./data/drinks/16100099.csv")

glimpse(mdrinks)

## Rows: 240
## Columns: 15
## $ REF_DATE                                 <chr> "1976-01", "1976-02", "1976-0~
## $ GEO                                      <chr> "Canada", "Canada", "Canada",~
## $ DGUID                                    <lgl> NA, NA, NA, NA, NA, NA, NA, N~
## $ `Standard Classification of Goods (SCG)` <chr> "Soft drinks", "Soft drinks",~
## $ UOM                                      <chr> "Gallons", "Gallons", "Gallon~
## $ UOM_ID                                   <dbl> 142, 142, 142, 142, 142, 142,~
## $ SCALAR_FACTOR                            <chr> "thousands", "thousands", "th~
## $ SCALAR_ID                                <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,~
## $ VECTOR                                   <chr> "v7718", "v7718", "v7718", "v~
## $ COORDINATE                               <dbl> 1.1, 1.1, 1.1, 1.1, 1.1, 1.1,~
## $ VALUE                                    <dbl> 20680, 23392, 21553, 24304, 2~
## $ STATUS                                   <lgl> NA, NA, NA, NA, NA, NA, NA, N~
## $ SYMBOL                                   <lgl> NA, NA, NA, NA, NA, NA, NA, N~
## $ TERMINATED                               <lgl> NA, NA, NA, NA, NA, NA, NA, N~
## $ DECIMALS                                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~

The variables of interest are:

Variable	Description
`REF_DATE`	Year and month of record
`VALUE`	Amount produced in thousand gallons

The monthly data can be processed similar to the quarterly data.

monthly <- mdrinks %>%
  select(REF_DATE, VALUE) %>%
  rename(date = REF_DATE, monthly_value = VALUE) %>%
  mutate(date = ym(date))

monthly

## # A tibble: 240 x 2
##    date       monthly_value
##    <date>             <dbl>
##  1 1976-01-01         20680
##  2 1976-02-01         23392
##  3 1976-03-01         21553
##  4 1976-04-01         24304
##  5 1976-05-01         27791
##  6 1976-06-01         32838
##  7 1976-07-01         32475
##  8 1976-08-01         32503
##  9 1976-09-01         28990
## 10 1976-10-01         23988
## # ... with 230 more rows

We want to expand the quarterly production data by converting our monthly data into quarterly data, and appending it to the end of the original quarterly data. The final result should be a data set that ranges from 1950 to 1995 with production values being reported each quarter.

Converting monthly data to quarterly data

Starting from the monthly data that we have, we create a new variable called qdate to keep track of the quarter that each year-month combination belongs to. Grouping the data by the resulting quarters and summing over the monthly production values results in the desired quarterly production values. Finally, the quarters created under qdate are converted back to full dates in preparation for the row binding of the two quarterly data sets.

quarterly2 <- monthly %>%
  mutate(qdate = quarter(date, with_year=TRUE)) %>%
  group_by(qdate) %>%
  summarise(quarterly_value = sum(monthly_value)) %>%
  mutate(date = yq(qdate)) %>%
  select(date, quarterly_value)

quarterly2

## # A tibble: 80 x 2
##    date       quarterly_value
##    <date>               <dbl>
##  1 1976-01-01           65625
##  2 1976-04-01           84933
##  3 1976-07-01           93968
##  4 1976-10-01           78708
##  5 1977-01-01           73865
##  6 1977-04-01           92938
##  7 1977-07-01           91812
##  8 1977-10-01           76821
##  9 1978-01-01           71130
## 10 1978-04-01           92965
## # ... with 70 more rows

Combining the two quarterly data sets

Since there is an overlap in the dates between the quarterly and monthly data, we need to make sure that all quarters are only included once when row binding quarterly and quarterly2. Let's find the end date from the original quarterly data and keep only the observations from quarterly2 that occur after this date.

quarterly_end_date <- quarterly %>%
  slice_max(date) %>%
  pull(date)

quarterly2 <- quarterly2 %>%
  filter(date > quarterly_end_date)

full_quarterly <- bind_rows(quarterly, quarterly2)

Checking our work

full_quarterly %>%
  slice_min(date)

## # A tibble: 1 x 2
##   date       quarterly_value
##   <date>               <dbl>
## 1 1950-01-01           19349

full_quarterly %>%
  slice_max(date)

## # A tibble: 1 x 2
##   date       quarterly_value
##   <date>               <dbl>
## 1 1995-10-01          110997

Since the data ranges from 1950 to 1995 (46 years), we should have 46 * 4 = 184 rows in our data.

full_quarterly %>%
  nrow()

## [1] 184

Looks good!

Quarterly soft-drink production

In the original data, the production values were in the unit of thousands of gallons. Yet, the values that we have are still quite large. Let's scale all the production values by one thousand so that they will be in the units of million gallons.

full_quarterly <- full_quarterly %>%
  mutate(quarterly_value = quarterly_value / 1e3)

ggplot(full_quarterly, aes(x=date, y=quarterly_value))+
  geom_line()+
  labs(
    x="Date", y="Amount produced (million gallons)",
    title="Quarterly soft-drink production in Canada",
    subtitle="1950 to 1995"
  )

Annual soft-drink production

An alternative to the previous plot is to colour the production values by quarter. In doing so, the x-axis should be the year rather than the full date so that all the quarters of a given year line up.

full_quarterly %>%
  mutate(
    Quarter = as_factor(quarter(date)),
    Year = year(date)
  ) %>%
  ggplot(aes(x=Year, y=quarterly_value, colour=Quarter))+
  geom_line(alpha=0.6)+
  theme(legend.position="bottom")+
  guides(colour = guide_legend(label.position="bottom"))+
  labs(
    x="Year", y="Amount produced (million gallons)",
    title="Annual soft-drink production in Canada by quarter",
    subtitle="1950 to 1995"
  )

Finally, we can also look at the annual production without regard to the quarterly values.

full_quarterly %>%
  mutate(Year = year(date)) %>%
  group_by(Year) %>%
  summarise(annual_value = sum(quarterly_value)) %>%
  ggplot(aes(x=Year, y=annual_value))+
  geom_line()+
  labs(
    x="Year", y="Amount produced (million gallons)",
    title="Annual soft-drink production in Canada",
    subtitle="1950 to 1995"
  )

Notes

In each usage of group_by() followed by summarise(), we never called ungroup(). This is because summarise() automatically drops one grouping level after completion and in each usage, we only grouped by a single variable. As such, after summarising, the data is fully ungrouped.