6 Final pre-processing

This chapter documents the final pre-processing steps before the text can be transformed into the required document-term matrices and term-document matrices.

6.1 Initialisation

library(tidyverse)
library(tidytext)
library(SnowballC)
library(pins)
library(pinsqs)
library(AzureStor)

source(here::here("R", "azure_init.R"))

speeches_board <- storage_endpoint("https://cbspeeches1.dfs.core.windows.net/", token=token) %>%
  storage_container(name = "cbspeeches") %>%
  board_azure(path = "data-speeches")
speeches <- speeches_board %>%
  pin_qread("speeches-g7-with-ngrams")

The list of stop words used was derived from the Snowball stop word lexicon, but with negation terms removed. The code used to obtain this stop word list can be found here.

nonneg_snowball <- read_rds(here::here("inst", "data-misc", "nonneg_snowball.rds"))

6.2 Pre-processing

The usual pre-processing steps were performed, including:

  • Unnesting into tokens. Lowercasing occurs at this step.
  • Removal of non-negative stop words.
  • Stemming of words.
speeches <- speeches %>%
  unnest_tokens(output=word, input=text) %>%
  anti_join(nonneg_snowball, by="word") %>%
  mutate(wordstem = wordStem(word))

A final check was performed to verify that there were no stemmed tokens that were spaces or empty strings, as this can result in unusable models downstream.

speeches %>%
  filter(wordstem == " ")

speeches %>%
  filter(stringi::stri_isempty(wordstem))

Making a quick checkpoint:

speeches_board %>%
  pin_qsave(
    speeches,
    "processed-speeches-g7",
    title = "processed speeches for g7 countries. ready for dtm/tdm conversion."
  )

6.3 Create document-term matrix

The document-term matrix is required for topic models via the {topicmodels} package.

speeches_dtm <- speeches %>%
  count(doc, wordstem) %>%
  cast_dtm(doc, wordstem, n)
speeches_board %>%
  pin_qsave(
    speeches_dtm,
    "speeches-g7-dtm",
    title = "dtm of speeches for g7 countries"
  )

6.4 Create term-document matrix

The term-document matrix (as a plain matrix) is required for NMF models.

speeches_tdm <- speeches %>%
  count(doc, wordstem) %>%
  cast_tdm(wordstem, doc, n) %>%
  as.matrix()
speeches_board %>%
  pin_qsave(
    speeches_tdm,
    "speeches-g7-tdm",
    title = "tdm of speeches for g7 countries"
  )