Grain export

Techniques employed

Data wrangling:
- Filtering data
- Working with dates: months, years
- Anti-joining data
- Grouping and summarising data
- Reordering factor levels by the values of a function of a secondary variable
Data visualisation:
- Line plots
- Facetting by a secondary variable
- Highlighting levels of interest within a facet
String operations:
- Case insensitive string matching
- Partial string matching
- Removing a substring

About the data

The export of Canadian grains has been collected monthly between 1922 to 2018. This data set includes information on the final destination as well as the type of grain being exported (e.g. wheat, oats, barley, etc.).

More information on this record can be found here.

The data and accompanying metadata can be downloaded together as a .zip file from here.

Packages

library(tidyverse)
library(lubridate)
library(gghighlight)
theme_set(theme_bw())

Obtaining the data

For these demos, I will be storing all data files in a folder called data with subfolders for each topic.

if (!dir.exists("./data/grains")) {
  dir.create("./data/grains", recursive=TRUE)
}

download.file(
  "https://www150.statcan.gc.ca/n1/tbl/csv/32100008-eng.zip",
  destfile = "./data/grains/grains.zip"
)

unzip(
  "./data/grains/grains.zip",
  exdir = "./data/grains"
)

list.files("./data/grains", pattern="csv")

## [1] "32100008.csv"          "32100008_MetaData.csv"

Reading in the data

grains <- read_csv("./data/grains/32100008.csv")

glimpse(grains)

## Rows: 358,803
## Columns: 16
## $ REF_DATE      <chr> "1922-01", "1922-02", "1922-03", "1922-04", "1922-05", "~
## $ GEO           <chr> "Canada", "Canada", "Canada", "Canada", "Canada", "Canad~
## $ DGUID         <chr> "2016A000011124", "2016A000011124", "2016A000011124", "2~
## $ Commodity     <chr> "Flaxseed", "Flaxseed", "Flaxseed", "Flaxseed", "Flaxsee~
## $ Destinations  <chr> "Total exports, all destinations", "Total exports, all d~
## $ UOM           <chr> "Tonnes", "Tonnes", "Tonnes", "Tonnes", "Tonnes", "Tonne~
## $ UOM_ID        <dbl> 288, 288, 288, 288, 288, 288, 288, 288, 288, 288, 288, 2~
## $ SCALAR_FACTOR <chr> "units", "units", "units", "units", "units", "units", "u~
## $ SCALAR_ID     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ VECTOR        <chr> "v125069", "v125069", "v125069", "v125069", "v125069", "~
## $ COORDINATE    <chr> "1.6.1", "1.6.1", "1.6.1", "1.6.1", "1.6.1", "1.6.1", "1~
## $ VALUE         <dbl> 247, 1436, 588, 322, 12241, 5572, 22, 368, 1498, 0, 1025~
## $ STATUS        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ SYMBOL        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ TERMINATED    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ DECIMALS      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~

The variables of interest are:

Variable	Description
`REF_DATE`	Year and month of record
`Commodity`	Type of grains exported
`VALUE`	Amount of grains exported in tonnes
`Destinations`	Destination where grains were exported

grains <- grains %>%
  select(REF_DATE, Commodity, VALUE, Destinations) %>%
  rename(
    date = REF_DATE,
    type = Commodity,
    value = VALUE,
    destination = Destinations
  )

Dividing the data

From calling glimpse() it is clear that a portion of the data contains summaries.

grains %>%
  distinct(destination) %>%
  pull(destination) %>%
  head(n=15)

##  [1] "Total exports, all destinations"    "Total Western Europe"              
##  [3] "United Kingdom"                     "Gibraltar"                         
##  [5] "Belgium"                            "Denmark"                           
##  [7] "Greece"                             "Iceland"                           
##  [9] "Italy"                              "Netherlands"                       
## [11] "Norway"                             "Portugal"                          
## [13] "Sweden"                             "Total Eastern Europe"              
## [15] "Union of Soviet Socialist Republic"

Let's divide the data into two chunks: one that contains the summaries, and one without the summaries. We can search for all summaries by looking for values of destination that contain the term "total" (case-insensitive search). We can then isolate the remaining data by anti-joining the original data set with the summaries, which will return all data that did not appear in the summary data. Finally, we can drop any rows that contain missing values.

grains_totals <- grains %>%
  filter(str_detect(destination, pattern="(?i)total")) %>%
  drop_na()

grains <- grains %>%
  anti_join(grains_totals, by="destination") %>%
  drop_na()

Making proper dates

We can make proper dates of the date column by using lubridate::ym(). Let's also create a new variable to keep track of the year of observations and move the date-related columns together.

grains_totals <- grains_totals %>%
  mutate(
    date = ym(date),
    Year = year(date)
  ) %>%
  select(date, Year, type, value, destination)

grains <- grains %>%
  mutate(
    date = ym(date),
    Year = year(date)
  ) %>%
  select(date, Year, type, value, destination)

Creating annual summaries

grains_totals %>%
  group_by(destination) %>%
  slice_min(date) %>%
  ungroup() %>%
  distinct(date)

## # A tibble: 2 x 1
##   date      
##   <date>    
## 1 1966-01-01
## 2 1922-01-01

grains_totals %>%
  group_by(destination) %>%
  slice_max(date) %>%
  ungroup() %>%
  distinct(date)

## # A tibble: 1 x 1
##   date      
##   <date>    
## 1 2018-12-01

It can be seen that the range of the dates in the data set is quite large. Since the data is also occurring monthly, any visualisations that use the raw dates will appear too busy. Therefore, for the following examples, let's only focus on the annual values. Of course, further analysis could be performed at the monthly level or between a smaller range of dates.

grains_totals_annual <- grains_totals %>%
  group_by(destination, type, Year) %>%
  summarise(value = sum(value), .groups="drop")

grains_annual <- grains %>%
  group_by(destination, type, Year) %>%
  summarise(value = sum(value), .groups="drop")

Visualising total annual export to various regions, by grain type

Let's first make another partition for total annual export to all destinations. Instead of filtering on the condition of:

destination == "Total exports, all destinations"

we can use str_detect() to perform partial string matching and do a bit less typing.

grains_totals_annual_all <- grains_totals_annual %>%
  filter(str_detect(destination, "all destinations"))

Plotting the data as is:

ggplot(grains_totals_annual_all, aes(x=Year, y=value, colour=type))+
  geom_line(alpha=0.6)

Improving the plot

It is unclear what the y-axis is measuring
Y-axis is in scientific notation which requires the reader to do math
It may be better to facet by grain type, which will increase the amount of available plotting space by getting rid of the legend
It may also be nice to facet by grain type, while keeping all other grain types greyed out in each facet as a reference
The facetted plot can be made more convenient to read if the facetted grain types were ordered by their respective all-time maximums

Avoiding scientific notation on the y-axis

To avoid scientific notation on the y-axis, we can scale the values by one million. Thus, the new y-values will represent the amount exported in millions of tonnes.

grains_totals_annual_all <- grains_totals_annual_all %>%
  mutate(value = value / 1e6)

Ordering the grain types by their all-time maximums

The type column is the variable whose values are to be reordered. These values will be reordered by the maximum value of each grain type, in descending order.

grains_totals_annual_all <- grains_totals_annual_all %>%
  mutate(type = fct_reorder(type, value, .fun=max, .desc=TRUE))

Building the final plot

ggplot(grains_totals_annual_all, aes(x=Year, y=value, colour=type))+
  geom_line(show.legend=FALSE)+
  gghighlight(use_direct_label=FALSE)+
  facet_wrap(~type)+
  labs(
    y="Amount exported (million tonnes)",
    title="Annual grain amounts exported to all destinations, by grain type",
    subtitle="1922 to 2018 (where data is available)"
  )

Additional plots

grains_totals_annual %>%
  distinct(destination)

## # A tibble: 10 x 1
##    destination                       
##    <chr>                             
##  1 Total Africa                      
##  2 Total Asia                        
##  3 Total Central America and Antilles
##  4 Total Eastern Europe              
##  5 Total exports, all destinations   
##  6 Total Middle East                 
##  7 Total North America               
##  8 Total Oceania                     
##  9 Total South America               
## 10 Total Western Europe

The above plotting procedure can be repeated for the remaining destination totals, as desired.

Visualising total annual wheat export (excluding durum), by region

The plotting procedure is similar to the previous plot, only now, we shall filter the data for grain type "Wheat, excluding durum" and facet by the destination (excluding "Total exports, all destinations"). In addition, we should also remove the word "Total" from the destination names that we will facet upon.

Filtering the data

Here, we wish to include only export data for grains of type "Wheat, excluding durum" and destinations other than "Total exports, all destinations", since the total exports to all destinations was already seen in the previous plot.

total_annual_wheat_export <- grains_totals_annual %>%
  filter(str_detect(type, "Wheat, excluding")) %>%
  filter(str_detect(destination, "all destinations", negate=TRUE))

Avoiding scientific notation on the y-axis

To avoid scientific notation on the y-axis, we can scale the values by one million. Thus, the new y-values will represent the amount exported in millions of tonnes.

total_annual_wheat_export <- total_annual_wheat_export %>%
  mutate(value = value / 1e6)

Adjusting the destination labels

The word "Total" (and the space that comes after it), are to be removed from the destination labels.

total_annual_wheat_export <- total_annual_wheat_export %>%
  mutate(destination = str_remove(destination, pattern="Total "))

Ordering the destinations by their all-time maximums

The destination column is the variable whose values are to be reordered. These values will be reordered by the maximum value of wheat (excluding durum) exported, in descending order.

total_annual_wheat_export <- total_annual_wheat_export %>%
  mutate(destination = fct_reorder(destination, value, .fun=max, .desc=TRUE))

Building the plot

ggplot(total_annual_wheat_export, aes(x=Year, y=value, colour=destination))+
  geom_line(show.legend=FALSE)+
  gghighlight(use_direct_label=FALSE)+
  facet_wrap(~destination)+
  labs(
    y="Amount exported (million tonnes)",
    title="Annual amounts of wheat exported, by destination",
    subtitle="1971 to 2018 (where data is available)"
  )

Additional plots

grains_totals_annual %>%
  distinct(type)

## # A tibble: 9 x 1
##   type                  
##   <chr>                 
## 1 Barley                
## 2 Canola (rapeseed)     
## 3 Durum wheat           
## 4 Flaxseed              
## 5 Malt                  
## 6 Oats                  
## 7 Rye                   
## 8 Wheat flour           
## 9 Wheat, excluding durum

The above plotting procedure can be repeated for the remaining grain types, as desired.

Visualising total annual export to the United States, by grain type

grains_annual %>%
  filter(str_detect(destination, pattern="United States")) %>%
  distinct(destination)

## # A tibble: 3 x 1
##   destination                  
##   <chr>                        
## 1 United States                
## 2 United States Oceania        
## 3 Virgin Islands, United States

For the following, let us only consider the United States mainland (which I assume includes Alaska) and extract the pertinent data.

grains_annual_export_US <- grains_annual %>%
  filter(destination == "United States")

ggplot(grains_annual_export_US, aes(x=Year, y=value, colour=type))+
  geom_line(alpha=0.6)

Data preparation

The data preparation procedure is identical to the previous plots. We will need to:

Scale the values on the y-axis to avoid scientific notation
Reorder the grain types by their all-time maximums, in descending order

grains_annual_export_US <- grains_annual_export_US %>%
  mutate(
    value = value / 1e6,
    type = fct_reorder(type, value, .fun=max, .desc=TRUE)
  )

Building the plot

ggplot(grains_annual_export_US, aes(x=Year, y=value, colour=type))+
  geom_line(show.legend=FALSE)+
  gghighlight(use_direct_label=FALSE)+
  facet_wrap(~type)+
  labs(
    y="Amount exported (million tonnes)",
    title="Annual grain amounts exported to the United States, by grain type",
    subtitle="1966 to 2018 (where data is available)"
  )