Photo credit: Pixabay

Named Entity Recognition with NLTK and SpaCy

NER is used in many fields in Natural Language Processing (NLP)

--

Named entity recognition (NER)is probably the first step towards information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. NER is used in many fields in Natural Language Processing (NLP), and it can help answering many real-world questions, such as:

  • Which companies were mentioned in the news article?
  • Were specified products mentioned in complaints or reviews?
  • Does the tweet contain the name of a person? Does the tweet contain this person’s location?

This article describes how to build named entity recognizer with NLTK and SpaCy, to identify the names of things, such as persons, organizations, or locations in the raw text. Let’s get started!

NLTK

import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

Information Extraction

I took a sentence from The New York Times, “European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices.”

ex = 'European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices'

Then we apply word tokenization and part-of-speech tagging to the sentence.

def preprocess(sent):
sent = nltk.word_tokenize(sent)
sent = nltk.pos_tag(sent)
return sent

Let’s see what we get:

sent = preprocess(ex)
sent
Figure 1

We get a list of tuples containing the individual words in the sentence and their associated part-of-speech.

Now we’ll implement noun phrase chunking to identify named entities using a regular expression consisting of rules that indicate how sentences should be chunked.

Our chunk pattern consists of one rule, that a noun phrase, NP, should be formed whenever the chunker finds an optional determiner, DT, followed by any number of adjectives, JJ, and then a noun, NN.

pattern = 'NP: {<DT>?<JJ>*<NN>}'

Chunking

Using this pattern, we create a chunk parser and test it on our sentence.

cp = nltk.RegexpParser(pattern)
cs = cp.parse(sent)
print(cs)
Figure 2

The output can be read as a tree or a hierarchy with S as the first level, denoting sentence. we can also display it graphically.

Figure 3

IOB tags have become the standard way to represent chunk structures in files, and we will also be using this format.

from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint
iob_tagged = tree2conlltags(cs)
pprint(iob_tagged)
Figure 4

In this representation, there is one token per line, each with its part-of-speech tag and its named entity tag. Based on this training corpus, we can construct a tagger that can be used to label new sentences; and use the nltk.chunk.conlltags2tree() function to convert the tag sequences into a chunk tree.

With the function nltk.ne_chunk(), we can recognize named entities using a classifier, the classifier adds category labels such as PERSON, ORGANIZATION, and GPE.

ne_tree = ne_chunk(pos_tag(word_tokenize(ex)))
print(ne_tree)
Figure 5

Google is recognized as a person. It’s quite disappointing, don’t you think so?

SpaCy

SpaCy’s named entity recognition has been trained on the OntoNotes 5 corpus and it supports the following entity types:

Figure 6 (Source: SpaCy)

Entity

import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

We are using the same sentence, “European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices.”

One of the nice things about Spacy is that we only need to apply nlp once, the entire background pipeline will return the objects.

doc = nlp('European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices')
pprint([(X.text, X.label_) for X in doc.ents])
Figure 7

European is NORD (nationalities or religious or political groups), Google is an organization, $5.1 billion is monetary value and Wednesday is a date object. They are all correct.

Token

During the above example, we were working on entity level, in the following example, we are demonstrating token-level entity annotation using the BILUO tagging scheme to describe the entity boundaries.

Figure 8 (Source: SpaCy)
pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])
Figure 9

"B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set.

Extracting named entity from an article

Now let’s get serious with SpaCy and extracting named entities from a New York Times article, — “F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired.”

from bs4 import BeautifulSoup
import requests
import re
def url_to_string(url):
res = requests.get(url)
html = res.text
soup = BeautifulSoup(html, 'html5lib')
for script in soup(["script", "style", 'aside']):
script.extract()
return " ".join(re.split(r'[\n\t]+', soup.get_text()))
ny_bb = url_to_string('https://www.nytimes.com/2018/08/13/us/politics/peter-strzok-fired-fbi.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region&region=top-news&WT.nav=top-news')
article = nlp(ny_bb)
len(article.ents)

188

There are 188 entities in the article and they are represented as 10 unique labels:

labels = [x.label_ for x in article.ents]
Counter(labels)
Figure 10

The following are three most frequent tokens.

items = [x.text for x in article.ents]
Counter(items).most_common(3)
Figure 11

Let’s randomly select one sentence to learn more.

sentences = [x for x in article.sents]
print(sentences[20])
Figure 12

Let’s run displacy.render to generate the raw markup.

displacy.render(nlp(str(sentences[20])), jupyter=True, style='ent')
Figure 13

One miss-classification here is F.B.I. It is hard, isn’t it?

Using spaCy’s built-in displaCy visualizer, here’s what the above sentence and its dependencies look like:

displacy.render(nlp(str(sentences[20])), style='dep', jupyter = True, options = {'distance': 120})
Figure 14

Next, we verbatim, extract part-of-speech and lemmatize this sentence.

[(x.orth_,x.pos_, x.lemma_) for x in [y 
for y
in nlp(str(sentences[20]))
if not y.is_stop and y.pos_ != 'PUNCT']]
Figure 15
dict([(str(x), x.label_) for x in nlp(str(sentences[20])).ents])
Figure 16

Named entity extraction are correct except “F.B.I”.

print([(x, x.ent_iob_, x.ent_type_) for x in sentences[20]])
Figure 17

Finally, we visualize the entity of the entire article.

Figure 18

Try it yourself. It was fun! Source code can be found on Github. Happy Friday!

--

--