From 1950 to 1977, soft-drink production was reported on a quarterly basis.
More information on this record can be found here.
The data and accompanying metadata can be downloaded together as a .zip file from here.
From 1976 to 1995, soft-drink production was reported on a monthly basis.
More information on this record can be found here.
The data and accompanying metadata can be downloaded together as a .zip file from here.
For these demos, I will be storing all data files in a folder called data
with subfolders for each topic.
if (!dir.exists("./data/drinks")) {
dir.create("./data/drinks", recursive=TRUE)
}
# Quarterly data from 1950 to 1977
download.file(
"https://www150.statcan.gc.ca/n1/tbl/csv/16100100-eng.zip",
destfile = "./data/drinks/qdrinks.zip"
)
unzip(
"./data/drinks/qdrinks.zip",
exdir = "./data/drinks"
)
# Monthly data from 1976 to 1995
download.file(
"https://www150.statcan.gc.ca/n1/tbl/csv/16100099-eng.zip",
destfile = "./data/drinks/mdrinks.zip"
)
unzip(
"./data/drinks/mdrinks.zip",
exdir = "./data/drinks"
)
list.files("./data/drinks", pattern="csv")
## [1] "16100099.csv" "16100099_MetaData.csv" "16100100.csv"
## [4] "16100100_MetaData.csv"
## Rows: 112
## Columns: 15
## $ REF_DATE <chr> "1950-01", "1950-04", "1950-0~
## $ GEO <chr> "Canada", "Canada", "Canada",~
## $ DGUID <lgl> NA, NA, NA, NA, NA, NA, NA, N~
## $ `Standard Classification of Goods (SCG)` <chr> "Soft drinks", "Soft drinks",~
## $ UOM <chr> "Gallons", "Gallons", "Gallon~
## $ UOM_ID <dbl> 142, 142, 142, 142, 142, 142,~
## $ SCALAR_FACTOR <chr> "thousands", "thousands", "th~
## $ SCALAR_ID <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,~
## $ VECTOR <chr> "v7717", "v7717", "v7717", "v~
## $ COORDINATE <dbl> 1.1, 1.1, 1.1, 1.1, 1.1, 1.1,~
## $ VALUE <dbl> 19349, 29730, 31721, 20045, 1~
## $ STATUS <lgl> NA, NA, NA, NA, NA, NA, NA, N~
## $ SYMBOL <lgl> NA, NA, NA, NA, NA, NA, NA, N~
## $ TERMINATED <lgl> NA, NA, NA, NA, NA, NA, NA, N~
## $ DECIMALS <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
The variables of interest are:
Variable | Description |
---|---|
REF_DATE |
Year and month of record |
VALUE |
Amount produced in thousand gallons |
As the dates are supplied as strings of format yyyy-mm, we can use lubridate::ym()
to parse them into proper dates.
quarterly <- qdrinks %>%
select(REF_DATE, VALUE) %>%
rename(date = REF_DATE, quarterly_value = VALUE) %>%
mutate(date = ym(date))
quarterly
## # A tibble: 112 x 2
## date quarterly_value
## <date> <dbl>
## 1 1950-01-01 19349
## 2 1950-04-01 29730
## 3 1950-07-01 31721
## 4 1950-10-01 20045
## 5 1951-01-01 17398
## 6 1951-04-01 25893
## 7 1951-07-01 28477
## 8 1951-10-01 19923
## 9 1952-01-01 19248
## 10 1952-04-01 26232
## # ... with 102 more rows
## Rows: 240
## Columns: 15
## $ REF_DATE <chr> "1976-01", "1976-02", "1976-0~
## $ GEO <chr> "Canada", "Canada", "Canada",~
## $ DGUID <lgl> NA, NA, NA, NA, NA, NA, NA, N~
## $ `Standard Classification of Goods (SCG)` <chr> "Soft drinks", "Soft drinks",~
## $ UOM <chr> "Gallons", "Gallons", "Gallon~
## $ UOM_ID <dbl> 142, 142, 142, 142, 142, 142,~
## $ SCALAR_FACTOR <chr> "thousands", "thousands", "th~
## $ SCALAR_ID <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,~
## $ VECTOR <chr> "v7718", "v7718", "v7718", "v~
## $ COORDINATE <dbl> 1.1, 1.1, 1.1, 1.1, 1.1, 1.1,~
## $ VALUE <dbl> 20680, 23392, 21553, 24304, 2~
## $ STATUS <lgl> NA, NA, NA, NA, NA, NA, NA, N~
## $ SYMBOL <lgl> NA, NA, NA, NA, NA, NA, NA, N~
## $ TERMINATED <lgl> NA, NA, NA, NA, NA, NA, NA, N~
## $ DECIMALS <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
The variables of interest are:
Variable | Description |
---|---|
REF_DATE |
Year and month of record |
VALUE |
Amount produced in thousand gallons |
The monthly data can be processed similar to the quarterly data.
monthly <- mdrinks %>%
select(REF_DATE, VALUE) %>%
rename(date = REF_DATE, monthly_value = VALUE) %>%
mutate(date = ym(date))
monthly
## # A tibble: 240 x 2
## date monthly_value
## <date> <dbl>
## 1 1976-01-01 20680
## 2 1976-02-01 23392
## 3 1976-03-01 21553
## 4 1976-04-01 24304
## 5 1976-05-01 27791
## 6 1976-06-01 32838
## 7 1976-07-01 32475
## 8 1976-08-01 32503
## 9 1976-09-01 28990
## 10 1976-10-01 23988
## # ... with 230 more rows
We want to expand the quarterly production data by converting our monthly data into quarterly data, and appending it to the end of the original quarterly data. The final result should be a data set that ranges from 1950 to 1995 with production values being reported each quarter.
Starting from the monthly
data that we have, we create a new variable called qdate
to keep track of the quarter that each year-month combination belongs to. Grouping the data by the resulting quarters and summing over the monthly production values results in the desired quarterly production values. Finally, the quarters created under qdate
are converted back to full dates in preparation for the row binding of the two quarterly data sets.
quarterly2 <- monthly %>%
mutate(qdate = quarter(date, with_year=TRUE)) %>%
group_by(qdate) %>%
summarise(quarterly_value = sum(monthly_value)) %>%
mutate(date = yq(qdate)) %>%
select(date, quarterly_value)
quarterly2
## # A tibble: 80 x 2
## date quarterly_value
## <date> <dbl>
## 1 1976-01-01 65625
## 2 1976-04-01 84933
## 3 1976-07-01 93968
## 4 1976-10-01 78708
## 5 1977-01-01 73865
## 6 1977-04-01 92938
## 7 1977-07-01 91812
## 8 1977-10-01 76821
## 9 1978-01-01 71130
## 10 1978-04-01 92965
## # ... with 70 more rows
Since there is an overlap in the dates between the quarterly and monthly data, we need to make sure that all quarters are only included once when row binding quarterly
and quarterly2
. Let's find the end date from the original quarterly data and keep only the observations from quarterly2
that occur after this date.
quarterly_end_date <- quarterly %>%
slice_max(date) %>%
pull(date)
quarterly2 <- quarterly2 %>%
filter(date > quarterly_end_date)
full_quarterly <- bind_rows(quarterly, quarterly2)
full_quarterly %>%
slice_min(date)
## # A tibble: 1 x 2
## date quarterly_value
## <date> <dbl>
## 1 1950-01-01 19349
full_quarterly %>%
slice_max(date)
## # A tibble: 1 x 2
## date quarterly_value
## <date> <dbl>
## 1 1995-10-01 110997
Since the data ranges from 1950 to 1995 (46 years), we should have 46 * 4 = 184 rows in our data.
full_quarterly %>%
nrow()
## [1] 184
Looks good!
In the original data, the production values were in the unit of thousands of gallons. Yet, the values that we have are still quite large. Let's scale all the production values by one thousand so that they will be in the units of million gallons.
full_quarterly <- full_quarterly %>%
mutate(quarterly_value = quarterly_value / 1e3)
An alternative to the previous plot is to colour the production values by quarter. In doing so, the x-axis should be the year rather than the full date so that all the quarters of a given year line up.
full_quarterly %>%
mutate(
Quarter = as_factor(quarter(date)),
Year = year(date)
) %>%
ggplot(aes(x=Year, y=quarterly_value, colour=Quarter))+
geom_line(alpha=0.6)+
theme(legend.position="bottom")+
guides(colour = guide_legend(label.position="bottom"))+
labs(
x="Year", y="Amount produced (million gallons)",
title="Annual soft-drink production in Canada by quarter",
subtitle="1950 to 1995"
)
Finally, we can also look at the annual production without regard to the quarterly values.
full_quarterly %>%
mutate(Year = year(date)) %>%
group_by(Year) %>%
summarise(annual_value = sum(quarterly_value)) %>%
ggplot(aes(x=Year, y=annual_value))+
geom_line()+
labs(
x="Year", y="Amount produced (million gallons)",
title="Annual soft-drink production in Canada",
subtitle="1950 to 1995"
)
In each usage of group_by()
followed by summarise()
, we never called ungroup()
. This is because summarise()
automatically drops one grouping level after completion and in each usage, we only grouped by a single variable. As such, after summarising, the data is fully ungrouped.