12 Cleaning the BIS speeches
The cleaning of the BIS speeches is nearly identical to that of G7 and G10 as the text is quite standard.
12.1 Initialisation
library(tidyverse)
library(pins)
library(pinsqs)
library(AzureStor)
source(here::here("R", "azure_init.R"))
speeches_board <- storage_endpoint("https://cbspeeches1.dfs.core.windows.net/", token=token) %>%
storage_container(name = "cbspeeches") %>%
board_azure(path = "data-speeches")
12.3 Repairs and removals
12.3.1 Remove introductions
As before, the first sentence of each speech is only removed if a gratitude word is detected. Otherwise, only the brief speech description is removed.
speeches <- speeches %>%
mutate(
text = if_else(
str_detect(first_sentence, pattern="(\\*\\s){3}"),
if_else(
str_detect(first_sentence, pattern="(?i)thank|acknowledge|honou?r|grateful|pleas|welcome|delight"),
str_remove(text, pattern="^[^.]+\\."),
str_remove(text, pattern="^.*(\\*\\s){3}")
),
str_remove(text, pattern="^[^.]+\\.")
),
text = str_squish(text)
)
12.3.2 Remove section headers
speeches <- speeches %>%
mutate(text = str_remove_all(text, "(Introduction|Closing remarks|Conclusion) (?=[:upper:])"))
12.3.3 Remove references section
The references and concluding remarks can be removed using the clean_general()
function.
12.3.4 Miscellaneous removals
Mentions of "BIS central bankers' speeches" within speeches were removed.
speeches <- speeches %>%
mutate(text = str_remove_all(text, "(?i)BIS central bankers' speeches"))
12.3.5 Remove mentions of own institution and country
It is of greater interest when a central bank mentions another central bank or another country.
Therefore, all self-mentions of the bank, country, and inhabitants were removed. For example, for
Canada, words to remove would include: Bank of Canada, BoC, Canada, Canada's, and Canadian. The
removal patterns corresponding to each bank are stored in
inst/data-misc/bank_country_regex_patterns.csv
.
bis_regex_pattern <- read_delim(
here::here("inst", "data-misc", "bank_country_regex_patterns.csv"),
delim = ",",
escape_backslash = TRUE
) %>%
filter(country == "Other_BIS") %>%
pull(regex_pattern)
speeches <- speeches %>%
mutate(text = str_remove_all(text, bis_regex_pattern))
12.4 General cleaning
12.4.2 Normalisation of select ngrams into acronyms
"Central Bank Digital Currency" is a particular 4-gram of interest and can be converted to its abbreviated form.
speeches <- speeches %>%
mutate(text = str_replace_all(text, "(?i)Central Bank Digital Currency", "CBDC"))
12.4.4 Remove/replace stray and/or excessive punctuation
A few minor changes here opting for the replacement of punctuation sequences with spaces, instead of their removal.
speeches <- speeches %>%
mutate(
text = str_remove_all(text, "(\\* )+"),
text = str_replace_all(text, "\\?|!", "."),
text = str_remove_all(text, ","),
text = str_remove_all(text, "\""),
text = str_replace_all(text, "'{2,}", "'"),
text = str_remove_all(text, "\\B'(?=[:alpha:])"),
text = str_remove_all(text, "(?<=[:alpha:])'\\B"),
text = str_remove_all(text, "\\B'\\B"),
text = str_replace_all(text, "\\.{3}", "."),
text = str_replace_all(text, " \\. ", " "),
text = str_replace_all(text, "-", " "),
text = str_replace_all(text, "_", " "),
text = str_remove_all(text, "\\(|\\)|\\{|\\}|\\[|\\]|\\||;|:|\\+")
)
12.4.5 Remove numerical quantities
References to figures, slides, and graphs were removed, in addition to dollar signs, percent signs, and other numerical quantities.
speeches <- speeches %>%
mutate(
text = str_remove_all(text, "(Figure|Slide|Graph) [:digit:]+"),
text = str_remove_all(text, "\\$"),
text = str_remove_all(text, "%"),
text = str_remove_all(text, "\\b[:digit:]+([.,]+[:digit:]+)*\\b")
)
12.4.6 Remove excessive whitespace
Excessive whitespace resulting from previous replacements was removed.
speeches <- speeches %>%
mutate(text = str_squish(text))