9 Re-cleaning the G7 speeches

The G7 consists of seven countries: Canada, France, Germany, Italy, Japan, the United Kingdom, and the United States.

From the results of the models produced in the proof of concept section, and the cleaning of the G10 and G20 speeches, a few adjustments needed to be made for the cleaning of the G7 speeches. This chapter's contents are mostly identical to those in Cleaning text for G7 countries, with the additions of:

More careful removal of introductory remarks and section headers.
A generalised text cleaning function, clean_general(), found in R/clean_by_country.R for cleaning the text of speeches that are less problematic.
Removal of links that don't start with any of: http, https, or www.
Improved handling of punctuation removal.
Remove of mentions of slides, figures, and graphs.
Omitting the removal of stray letters.

9.1 Initialisation

library(tidyverse)
library(pins)
library(pinsqs)
library(AzureStor)

source(here::here("R", "azure_init.R"))

speeches_board <- storage_endpoint("https://cbspeeches1.dfs.core.windows.net/", token=token) %>%
  storage_container(name = "cbspeeches") %>%
  board_azure(path = "data-speeches")

9.2 Filter speeches to G7 countries

g7_members <- c("Canada", "France", "Germany", "Italy", "Japan", "United Kingdom", "United States")

speeches <- speeches_board %>%
  pin_qread("speeches-with-country") %>%
  filter(country %in% g7_members)

9.3 Fix one date

There was one speech from the United States whose date should be December 2023, not December 2024, as this corpus only goes up to January 2024.

data_update <- tribble(
  ~doc, ~date,
  "r240109a", ymd("2023-12-08")
)

speeches <- speeches %>%
  rows_update(data_update, by="doc")

9.4 Repairs and removals

9.4.1 Remove introductions

Previously, introductory content that gave a brief description of the speech, along with the first sentence of the speech, were removed. Now, the first sentence of each speech is only removed if a gratitude word is detected.

speeches <- speeches %>%
  mutate(
    text = if_else(
      str_detect(first_sentence, pattern="(\\*\\s){3}"),
      if_else(
        str_detect(first_sentence, pattern="(?i)thank|acknowledge|honou?r|grateful|pleas|welcome|delight"),
        str_remove(text, pattern="^[^.]+\\."),
        str_remove(text, pattern="^.*(\\*\\s){3}")
      ),
      str_remove(text, pattern="^[^.]+\\.")
    ),
    text = str_squish(text)
  )

9.4.2 Remove section headers

speeches <- speeches %>%
  mutate(text = str_remove_all(text, "(Introduction|Closing remarks|Conclusion) (?=[:upper:])"))

9.4.3 Remove references section

A general text cleaning function, found in R/clean_by_country.R, was applied to remove the references section and any other concluding remarks that were commonly found among speeches.

source(here::here("R", "clean_by_country.R"))

speeches <- speeches %>%
  mutate(text = clean_general(text))

9.4.4 Miscellaneous removals

Mentions of "BIS central bankers' speeches" within speeches were removed.

speeches <- speeches %>%
  mutate(text = str_remove_all(text, "(?i)BIS central bankers' speeches"))

9.4.5 Repair typos

Repair potential typos of Italy (appearing as Italty).

speeches <- speeches %>%
  mutate(text = str_replace_all(text, "Italty", "Italy"))

9.4.6 Remove mentions of own institution and country

It is of greater interest when a central bank mentions another central bank or another country. Therefore, all self-mentions of the bank, country, and inhabitants were removed. For example, for Canada, words to remove would include: Bank of Canada, BoC, Canada, Canada's, and Canadian. The removal patterns corresponding to each bank are stored in inst/data-misc/bank_country_regex_patterns.csv.

bank_country_regex_patterns <- read_delim(
  here::here("inst", "data-misc", "bank_country_regex_patterns.csv"),
  delim = ",",
  escape_backslash = TRUE
) %>%
  filter(country %in% g7_members) %>%
  select(country, regex_pattern)

speeches <- speeches %>%
  left_join(bank_country_regex_patterns, by="country") %>%
  mutate(text = str_remove_all(text, regex_pattern)) %>%
  select(-regex_pattern)

9.5 General cleaning

9.5.2 Normalisation of select ngrams into acronyms

"Central Bank Digital Currency" is a particular 4-gram of interest and can be converted to its abbreviated form.

speeches <- speeches %>%
  mutate(text = str_replace_all(text, "(?i)Central Bank Digital Currency", "CBDC"))

9.5.3 Remove non-ascii characters, emails, social media handles, and links

The only addition to this chunk of code is the last line, where links that do not begin with http, https, or www, are removed, e.g. google.ca.

speeches <- speeches %>%
  mutate(
    text = str_remove_all(text, "[:^ascii:]"),
    text = str_remove_all(text, "([[:alnum:]_.\\-]+)?@[[:alnum:].\\-]+"),
    text = str_remove_all(text, "https?://\\S+"),
    text = str_remove_all(text, "www\\.\\S+"),
    text = str_remove_all(text, "[A-Za-z]+\\.[A-Za-z]+(\\.[A-Za-z]+)*")
  )

9.5.4 Remove/replace stray and/or excessive punctuation

A few minor changes here opting for the replacement of punctuation sequences with spaces, instead of their removal.

speeches <- speeches %>%
  mutate(
    text = str_remove_all(text, "(\\* )+"),
    text = str_replace_all(text, "\\?|!", "."),
    text = str_remove_all(text, ","),
    text = str_remove_all(text, "\""),
    text = str_replace_all(text, "'{2,}", "'"),
    text = str_remove_all(text, "\\B'(?=[:alpha:])"),
    text = str_remove_all(text, "(?<=[:alpha:])'\\B"),
    text = str_remove_all(text, "\\B'\\B"),
    text = str_replace_all(text, "\\.{3}", "."),
    text = str_replace_all(text, " \\. ", " "),
    text = str_replace_all(text, "-", " "),
    text = str_replace_all(text, "_", " "),
    text = str_remove_all(text, "\\(|\\)|\\{|\\}|\\[|\\]|\\||;|:|\\+")
  )

9.5.5 Remove numerical quantities

References to figures, slides, and graphs were removed, in addition to dollar signs, percent signs, and other numerical quantities.

speeches <- speeches %>%
  mutate(
    text = str_remove_all(text, "(Figure|Slide|Graph) [:digit:]+"),
    text = str_remove_all(text, "\\$"),
    text = str_remove_all(text, "%"),
    text = str_remove_all(text, "\\b[:digit:]+([.,]+[:digit:]+)*\\b")
  )

9.5.6 Remove excessive whitespace

Excessive whitespace resulting from previous replacements was removed.

speeches <- speeches %>%
  mutate(text = str_squish(text))

9.5.7 Remove unneeded columns

speeches <- speeches %>%
  select(-first_sentence)

9.6 Save the data

Writing the data to the pin board:

speeches_board %>%
  pin_qsave(
    speeches,
    "speeches-g7-cleaned",
    title = "speeches for g7 countries, cleaned"
  )

Making a separate copy of the metadata as well:

speeches_metadata <- speeches %>%
  select(doc, date, institution, country)

speeches_board %>%
  pin_qsave(
    speeches_metadata,
    "speeches-g7-metadata",
    title = "metadata for g7 speeches"
  )

8 Explorations with LDA

10 Cleaning the G10 speeches