1 Data setup
In this chapter, the speeches are obtained from the Dropbox (csv
format), some minor adjustments
are applied to the text
and author
fields, and the data is re-saved for further processing in
the subsequent chapters.
1.1 Initialisation
library(tidyverse)
library(pins)
library(pinsqs)
library(AzureStor)
source(here::here("R", "azure_init.R"))
speeches_board <- storage_endpoint("https://cbspeeches1.dfs.core.windows.net/", token=token) %>%
storage_container(name = "cbspeeches") %>%
board_azure(path = "data-speeches")
1.2 Minor adjustments to raw data
Read in speeches from the Dropbox:
speeches <- read_csv("~/boc_speeches/lda/iteration_04/speeches_processed.csv")
Perform some minor adjustments:
- Rename the column
processed_text
to simplytext
. - Remove the time component of the dates.
- Perform normalization of text and author names to their ASCII form.
- Encode in utf-8.
- Remove excessive spaces in
text
andauthor
fields. - Fix select author names for consistency.
speeches <- speeches %>%
rename(text = processed_text) %>%
mutate(
date = as_date(date),
text = stringi::stri_trans_general(text, "Greek-Latin"),
text = stringi::stri_trans_general(text, "Latin-ASCII"),
text = utf8::as_utf8(text),
text = str_squish(text),
author = stringi::stri_trans_general(author, "Greek-Latin"),
author = stringi::stri_trans_general(author, "Latin-ASCII"),
author = utf8::as_utf8(author),
author = str_squish(author)
) %>%
mutate(
author = str_replace_all(author, "Angelovska-Bezoska", "Angelovska-Bezhoska"),
author = str_replace_all(author, "Angelovska Bezhoska", "Angelovska-Bezhoska"),
text = str_replace_all(text, "Angelovska-Bezoska", "Angelovska-Bezhoska"),
text = str_replace_all(text, "Angelovska Bezhoska", "Angelovska-Bezhoska")
)