6 Final pre-processing
This chapter documents the final pre-processing steps before the text can be transformed into the required document-term matrices and term-document matrices.
6.1 Initialisation
library(tidyverse)
library(tidytext)
library(SnowballC)
library(pins)
library(pinsqs)
library(AzureStor)
source(here::here("R", "azure_init.R"))
speeches_board <- storage_endpoint("https://cbspeeches1.dfs.core.windows.net/", token=token) %>%
storage_container(name = "cbspeeches") %>%
board_azure(path = "data-speeches")
The list of stop words used was derived from the Snowball stop word lexicon, but with negation terms removed. The code used to obtain this stop word list can be found here.
6.2 Pre-processing
The usual pre-processing steps were performed, including:
- Unnesting into tokens. Lowercasing occurs at this step.
- Removal of non-negative stop words.
- Stemming of words.
speeches <- speeches %>%
unnest_tokens(output=word, input=text) %>%
anti_join(nonneg_snowball, by="word") %>%
mutate(wordstem = wordStem(word))
A final check was performed to verify that there were no stemmed tokens that were spaces or empty strings, as this can result in unusable models downstream.
speeches %>%
filter(wordstem == " ")
speeches %>%
filter(stringi::stri_isempty(wordstem))
Making a quick checkpoint:
6.3 Create document-term matrix
The document-term matrix is required for topic models via the {topicmodels}
package.