6 Final pre-processing
This chapter documents the final pre-processing steps before the text can be transformed into the required document-term matrices and term-document matrices.
6.1 Initialisation
library(tidyverse)
library(tidytext)
library(stringa)
library(SnowballC)
library(pins)
library(pinsqs)
library(AzureStor)
source(here::here("R", "azure_init.R"))
speeches_board <- storage_endpoint("https://cbspeeches1.dfs.core.windows.net/", token=token) %>%
storage_container(name = "cbspeeches") %>%
board_azure(path = "data-speeches")
The list of stop words used was derived from the Snowball stop word lexicon (obtained from
tidytext::stop_words
), but with negation terms removed. The full data set of stop words with
negation terms removed (includes stop words from other lexicons) can be found in
stringa::nonneg_stop_words
. The code used to obtain this stop word list can be found
here.
6.2 Pre-processing
The usual pre-processing steps were performed, including:
- Unnesting into tokens. Lowercasing occurs at this step.
- Removal of non-negative stop words.
- Stemming of words.
speeches <- speeches %>%
unnest_tokens(output=word, input=text) %>%
anti_join(nonneg_snowball, by="word") %>%
mutate(wordstem = wordStem(word))
A final check was performed to verify that there were no stemmed tokens that were spaces or empty strings, as this can result in unusable models downstream.
speeches %>%
filter(wordstem == " ")
speeches %>%
filter(stringi::stri_isempty(wordstem))
Making a quick checkpoint:
6.3 Create document-term matrix
The document-term matrix is required for topic models via the {topicmodels}
package.