Text Mining is Fun (with R)!
Charles Dickens wrote fourteen and a half novels. I am going to analyze five of them — “A Tale of Two Cities”, “Great Expectations”, “A Christmas Carol in Prose; Being a Ghost Story of Christmas”, “Oliver Twist” and “Hard Times”.
Project Gutenberg offers over 53,000 free books. I downloaded Dickens’ novels in UTF-8 encoded texts from there, using the
gutenbergr package developed by David Robinson. In addition, I will be using the following packages for this project:
The downloading of Dickens’ five novels by Project Gutenberg ID numbers looks like so:
dickens <- gutenberg_download(c(98, 1400, 46, 730, 786))
unnest_tokens package is used to split each row so that there is one token (word) in each row of the new data frame (tidy_dickens), it then removes stop words with an
tidy_dickens <- dickens %>%
unnest_tokens(word, text) %>%
After removing the stop words, the following is a list of word starts that appear most frequently:
### A tibble: 19,634 × 2
## word n
## <chr> <int>
##1 time 1218
##2 hand 918
##3 night 835
##4 looked 814
##5 head 813
##6 oliver 766
##7 dear 751
##8 joe 718
##9 miss 702
##10 sir 697
### ... with 19,624 more rows
Let’s visualize it:
tidytext package contains several sentiment lexicons, I am using “bing” for the following tasks.
bing_word_counts <- tidy_dickens %>%
count(word, sentiment, sort = TRUE) %>%
bing_word_counts### A tibble: 3,145 × 3
## word sentiment n
## <chr> <chr> <int>
##1 miss negative 702
##2 poor negative 350
##3 dark negative 299
##4 hard negative 223
##5 dead negative 218
##6 strong positive 203
##7 love positive 202
##8 fell negative 198
##9 death negative 194
##10 cold negative 192
### ... with 3,135 more rows
Here we have the list of words that contribute to the sentiment categories.
The word “miss” is the most frequent negative word here, but it is used to describe unmarried women in Dickens’ novels. In particular, Miss Havisham is a significant character in “Great Expectations”. Dickens describes her as looking like “the witch of the place”. So, in this case, probably “miss”” should be a negative word.
Oh poor Pip!
Using a word cloud is usually a good idea to identify trends and patterns that would otherwise be unclear or difficult to see in a tabular format. In particular, it compares most frequently used positive and negative words.
Relationships between words
Many interesting text analyses are based on the relationships of words. When we examine pairs of two consecutive words, they are often called “bigrams”. Each token now represents a bigram using the following lines of code:
dickens_bigrams <- dickens %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
dickens_bigrams### A tibble: 616,994 × 2
## gutenberg_id bigram
## <int> <chr>
##1 46 a christmas
##2 46 christmas carol
##3 46 carol in
##4 46 in prose
##5 46 prose being
##6 46 being a
##7 46 a ghost
##8 46 ghost story
##9 46 story of
##10 46 of christmas
### ... with 616,984 more rows
After filtering out stop words, what are the most frequent bigrams?
Apparently, names are the most commonly paired words in Dickens’ novels. There are also some pairings of common nouns such as “wine shop” from “A Tale of Two Cities” and “oliver twist”.
“A Tale of Two Cities”
To continue, I downloaded the plain text file for “A Tale of Two Cities”, left out the Project Gutenberg header and footer information, and then concatenated on the lines into paragraphs. Now let’s have a look a few lines:
## "I. The Period It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light," ## "it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way-- in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only."
It looks right. Now apply the NRC sentiment dictionary to this novel. It creates a data frame combining the line number of the book with the sentiment score, extracting positive and negative scores for visualization.
tale_nrc$negative <- -tale_nrc$negative
pos_neg <- tale_nrc %>% select(linenumber, positive, negative) %>%
melt(id = "linenumber")
names(pos_neg) <- c("linenumber", "sentiment", "value")library(ggthemes)
ggplot(data = pos_neg, aes(x = linenumber, y = value, fill = sentiment)) +
geom_bar(stat = 'identity', position = position_dodge()) + theme_minimal() +
ggtitle("Positive and Negative Sentiment in A Tale of Two Cities") +
scale_color_manual(values = c("orange", "blue")) +
scale_fill_manual(values = c("orange", "blue"))
It seems like the positive and negative scores are almost equal overall, which does make sense given the content of the novel.
This was so much fun! Can we do more text analysis with R? Of course, the CRAN task view on Natural Language Processing collects relevant R packages that supports computational linguists in conducting analyses of speech and language on a variety of levels — setting a focus on words, syntax, semantics, and pragmatics. I hope to try these aspects of Machine learning on text methods, and share my work with you in future posts.