12 Cleaning the BIS speeches

The cleaning of the BIS speeches is nearly identical to that of G7 and G10 as the text is quite standard.

12.1 Initialisation


source(here::here("R", "azure_init.R"))

speeches_board <- storage_endpoint("https://cbspeeches1.dfs.core.windows.net/", token=token) %>%
  storage_container(name = "cbspeeches") %>%
  board_azure(path = "data-speeches")

12.2 Filter for BIS speeches

speeches <- speeches_board %>%
  pin_qread("speeches-with-country") %>%
  filter(country == "Other_BIS")

12.3 Repairs and removals

12.3.1 Remove introductions

As before, the first sentence of each speech is only removed if a gratitude word is detected. Otherwise, only the brief speech description is removed.

speeches <- speeches %>%
    text = if_else(
      str_detect(first_sentence, pattern="(\\*\\s){3}"),
        str_detect(first_sentence, pattern="(?i)thank|acknowledge|honou?r|grateful|pleas|welcome|delight"),
        str_remove(text, pattern="^[^.]+\\."),
        str_remove(text, pattern="^.*(\\*\\s){3}")
      str_remove(text, pattern="^[^.]+\\.")
    text = str_squish(text)

12.3.2 Remove section headers

speeches <- speeches %>%
  mutate(text = str_remove_all(text, "(Introduction|Closing remarks|Conclusion) (?=[:upper:])"))

12.3.3 Remove references section

The references and concluding remarks can be removed using the clean_general() function.

source(here::here("R", "clean_by_country.R"))

speeches <- speeches %>%
  mutate(text = clean_general(text))

12.3.4 Miscellaneous removals

Mentions of "BIS central bankers' speeches" within speeches were removed.

speeches <- speeches %>%
  mutate(text = str_remove_all(text, "(?i)BIS central bankers' speeches"))

12.3.5 Remove mentions of own institution and country

It is of greater interest when a central bank mentions another central bank or another country. Therefore, all self-mentions of the bank, country, and inhabitants were removed. For example, for Canada, words to remove would include: Bank of Canada, BoC, Canada, Canada's, and Canadian. The removal patterns corresponding to each bank are stored in inst/data-misc/bank_country_regex_patterns.csv.

bis_regex_pattern <- read_delim(
  here::here("inst", "data-misc", "bank_country_regex_patterns.csv"),
  delim = ",",
  escape_backslash = TRUE
) %>%
  filter(country == "Other_BIS") %>%

speeches <- speeches %>%
  mutate(text = str_remove_all(text, bis_regex_pattern))

12.4 General cleaning

12.4.2 Normalisation of select ngrams into acronyms

"Central Bank Digital Currency" is a particular 4-gram of interest and can be converted to its abbreviated form.

speeches <- speeches %>%
  mutate(text = str_replace_all(text, "(?i)Central Bank Digital Currency", "CBDC"))

12.4.4 Remove/replace stray and/or excessive punctuation

A few minor changes here opting for the replacement of punctuation sequences with spaces, instead of their removal.

speeches <- speeches %>%
    text = str_remove_all(text, "(\\* )+"),
    text = str_replace_all(text, "\\?|!", "."),
    text = str_remove_all(text, ","),
    text = str_remove_all(text, "\""),
    text = str_replace_all(text, "'{2,}", "'"),
    text = str_remove_all(text, "\\B'(?=[:alpha:])"),
    text = str_remove_all(text, "(?<=[:alpha:])'\\B"),
    text = str_remove_all(text, "\\B'\\B"),
    text = str_replace_all(text, "\\.{3}", "."),
    text = str_replace_all(text, " \\. ", " "),
    text = str_replace_all(text, "-", " "),
    text = str_replace_all(text, "_", " "),
    text = str_remove_all(text, "\\(|\\)|\\{|\\}|\\[|\\]|\\||;|:|\\+")

12.4.5 Remove numerical quantities

References to figures, slides, and graphs were removed, in addition to dollar signs, percent signs, and other numerical quantities.

speeches <- speeches %>%
    text = str_remove_all(text, "(Figure|Slide|Graph) [:digit:]+"),
    text = str_remove_all(text, "\\$"),
    text = str_remove_all(text, "%"),
    text = str_remove_all(text, "\\b[:digit:]+([.,]+[:digit:]+)*\\b")

12.4.6 Remove excessive whitespace

Excessive whitespace resulting from previous replacements was removed.

speeches <- speeches %>%
  mutate(text = str_squish(text))

12.4.7 Remove unneeded columns

speeches <- speeches %>%

12.5 Save the data

Writing the data to the pin board:

speeches_board %>%
    title = "speeches for BIS, cleaned"

Making a separate copy of the metadata as well:

speeches_metadata <- speeches %>%
  select(doc, date, institution, country)

speeches_board %>%
    title = "metadata for BIS speeches"