Text Mining is Fun (with R)!

Courtesy of GIPHY

I am a big fan of Julia Silge and David Robinson, and after reading their book Text Mining with R, I decided to put their ideas into practice.

Charles Dickens wrote fourteen and a half novels. I am going to analyze five of them — “A Tale of Two Cities”, “Great Expectations”, “A Christmas Carol in Prose; Being a Ghost Story of Christmas”, “Oliver Twist” and “Hard Times”.

Project Gutenberg offers over 53,000 free books. I downloaded Dickens’ novels in UTF-8 encoded texts from there, using the gutenbergr package developed by David Robinson. In addition, I will be using the following packages for this project:

The downloading of Dickens’ five novels by Project Gutenberg ID numbers looks like so:

The unnest_tokens package is used to split each row so that there is one token (word) in each row of the new data frame (tidy_dickens), it then removes stop words with an anti_join function.

After removing the stop words, the following is a list of word starts that appear most frequently:

Let’s visualize it:


tidytext package contains several sentiment lexicons, I am using “bing” for the following tasks.

### A tibble: 3,145 × 3
## word sentiment n
## <chr> <chr> <int>
##1 miss negative 702
##2 poor negative 350
##3 dark negative 299
##4 hard negative 223
##5 dead negative 218
##6 strong positive 203
##7 love positive 202
##8 fell negative 198
##9 death negative 194
##10 cold negative 192
### ... with 3,135 more rows

Here we have the list of words that contribute to the sentiment categories.

The word “miss” is the most frequent negative word here, but it is used to describe unmarried women in Dickens’ novels. In particular, Miss Havisham is a significant character in “Great Expectations”. Dickens describes her as looking like “the witch of the place”. So, in this case, probably “miss”” should be a negative word.

Courtesy of GIPHY

Oh poor Pip!

Using a word cloud is usually a good idea to identify trends and patterns that would otherwise be unclear or difficult to see in a tabular format. In particular, it compares most frequently used positive and negative words.

Relationships between words

Many interesting text analyses are based on the relationships of words. When we examine pairs of two consecutive words, they are often called “bigrams”. Each token now represents a bigram using the following lines of code:

### A tibble: 616,994 × 2
## gutenberg_id bigram
## <int> <chr>
##1 46 a christmas
##2 46 christmas carol
##3 46 carol in
##4 46 in prose
##5 46 prose being
##6 46 being a
##7 46 a ghost
##8 46 ghost story
##9 46 story of
##10 46 of christmas
### ... with 616,984 more rows

After filtering out stop words, what are the most frequent bigrams?

Apparently, names are the most commonly paired words in Dickens’ novels. There are also some pairings of common nouns such as “wine shop” from “A Tale of Two Cities” and “oliver twist”.

“A Tale of Two Cities”

To continue, I downloaded the plain text file for “A Tale of Two Cities”, left out the Project Gutenberg header and footer information, and then concatenated on the lines into paragraphs. Now let’s have a look a few lines:

It looks right. Now apply the NRC sentiment dictionary to this novel. It creates a data frame combining the line number of the book with the sentiment score, extracting positive and negative scores for visualization.

ggplot(data = pos_neg, aes(x = linenumber, y = value, fill = sentiment)) +
geom_bar(stat = 'identity', position = position_dodge()) + theme_minimal() +
ylab("Sentiment") +
ggtitle("Positive and Negative Sentiment in A Tale of Two Cities") +
scale_color_manual(values = c("orange", "blue")) +
scale_fill_manual(values = c("orange", "blue"))

It seems like the positive and negative scores are almost equal overall, which does make sense given the content of the novel.

This was so much fun! Can we do more text analysis with R? Of course, the CRAN task view on Natural Language Processing collects relevant R packages that supports computational linguists in conducting analyses of speech and language on a variety of levels — setting a focus on words, syntax, semantics, and pragmatics. I hope to try these aspects of Machine learning on text methods, and share my work with you in future posts.

Changing the world, one post at a time. Sr Data Scientist, Toronto Canada. https://www.linkedin.com/in/susanli/