Text Mining is Fun (with R)!

5 min readAug 22, 2017

I am a big fan of Julia Silge and David Robinson, and after reading their book Text Mining with R, I decided to put their ideas into practice.

Charles Dickens wrote fourteen and a half novels. I am going to analyze five of them — “A Tale of Two Cities”, “Great Expectations”, “A Christmas Carol in Prose; Being a Ghost Story of Christmas”, “Oliver Twist” and “Hard Times”.

Project Gutenberg offers over 53,000 free books. I downloaded Dickens’ novels in UTF-8 encoded texts from there, using the gutenbergr package developed by David Robinson. In addition, I will be using the following packages for this project:

library(dplyr)
library(tm.plugin.webmining)
library(purrr)
library(tidytext)
library(gutenbergr)
library(ggplot2)

The downloading of Dickens’ five novels by Project Gutenberg ID numbers looks like so:

dickens <- gutenberg_download(c(98, 1400, 46, 730, 786))

The unnest_tokens package is used to split each row so that there is one token (word) in each row of the new data frame (tidy_dickens), it then removes stop words with an anti_join function.

tidy_dickens <- dickens %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

After removing the stop words, the following is a list of word starts that appear most frequently:

### A tibble: 19,634 × 2
##     word     n
##    <chr> <int>
##1    time  1218
##2    hand   918
##3   night   835
##4  looked   814
##5    head   813
##6  oliver   766
##7    dear   751
##8     joe   718
##9    miss   702
##10    sir   697
### ... with 19,624 more rows

Let’s visualize it:

Sentiment

tidytext package contains several sentiment lexicons, I am using “bing” for the following tasks.

bing_word_counts <- tidy_dickens %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
bing_word_counts### A tibble: 3,145 × 3
##     word sentiment     n
##    <chr>     <chr> <int>
##1    miss  negative   702
##2    poor  negative   350
##3    dark  negative   299
##4    hard  negative   223
##5    dead  negative   218
##6  strong  positive   203
##7    love  positive   202
##8    fell  negative   198
##9   death  negative   194
##10   cold  negative   192
### ... with 3,135 more rows

Here we have the list of words that contribute to the sentiment categories.

The word “miss” is the most frequent negative word here, but it is used to describe unmarried women in Dickens’ novels. In particular, Miss Havisham is a significant character in “Great Expectations”. Dickens describes her as looking like “the witch of the place”. So, in this case, probably “miss”” should be a negative word.

Oh poor Pip!

Using a word cloud is usually a good idea to identify trends and patterns that would otherwise be unclear or difficult to see in a tabular format. In particular, it compares most frequently used positive and negative words.

Relationships between words

Many interesting text analyses are based on the relationships of words. When we examine pairs of two consecutive words, they are often called “bigrams”. Each token now represents a bigram using the following lines of code:

dickens_bigrams <- dickens %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)
dickens_bigrams### A tibble: 616,994 × 2
##   gutenberg_id          bigram
##          <int>           <chr>
##1            46     a christmas
##2            46 christmas carol
##3            46        carol in
##4            46        in prose
##5            46     prose being
##6            46         being a
##7            46         a ghost
##8            46     ghost story
##9            46        story of
##10           46    of christmas
### ... with 616,984 more rows

After filtering out stop words, what are the most frequent bigrams?

Apparently, names are the most commonly paired words in Dickens’ novels. There are also some pairings of common nouns such as “wine shop” from “A Tale of Two Cities” and “oliver twist”.

“A Tale of Two Cities”

To continue, I downloaded the plain text file for “A Tale of Two Cities”, left out the Project Gutenberg header and footer information, and then concatenated on the lines into paragraphs. Now let’s have a look a few lines:

##[1] "I. The Period   It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light,"                                                                                    ##[2] "it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way-- in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only."

It looks right. Now apply the NRC sentiment dictionary to this novel. It creates a data frame combining the line number of the book with the sentiment score, extracting positive and negative scores for visualization.

tale_nrc$negative <- -tale_nrc$negative
pos_neg <- tale_nrc %>% select(linenumber, positive, negative) %>% 
        melt(id = "linenumber")
names(pos_neg) <- c("linenumber", "sentiment", "value")library(ggthemes)
ggplot(data = pos_neg, aes(x = linenumber, y = value, fill = sentiment)) +
        geom_bar(stat = 'identity', position = position_dodge()) + theme_minimal() +
        ylab("Sentiment") + 
        ggtitle("Positive and Negative Sentiment in A Tale of Two Cities") +
  scale_color_manual(values = c("orange", "blue")) +
  scale_fill_manual(values = c("orange", "blue"))

It seems like the positive and negative scores are almost equal overall, which does make sense given the content of the novel.

This was so much fun! Can we do more text analysis with R? Of course, the CRAN task view on Natural Language Processing collects relevant R packages that supports computational linguists in conducting analyses of speech and language on a variety of levels — setting a focus on words, syntax, semantics, and pragmatics. I hope to try these aspects of Machine learning on text methods, and share my work with you in future posts.

Text Mining is Fun (with R)!

Sentiment

Relationships between words

“A Tale of Two Cities”

Written by Susan Li

Responses (5)