Part of natural language processing is determining the role of each word or token in a body of text. In the world of NLP, we call this process part-of-speech (POS) tagging. The NLTK package comes with a function pos_tag()
that makes this job relatively seamless, and gives us a good starting point. Of course, there may be specific nuances for your particular corpus of data, but this function will get you 90% of the way there. Check out the full code in the embedded canvas below:
Installation and setup
First you need to install and import nltk
in addition to a few dependencies for the functions that we're using.
!pip install nltk
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt') # Need for word_tokenize()
nltk.download('averaged_perceptron_tagger') # Need for pos_tag()
nltk.word_tokenize()
Tokenize Text: Next, in order to actually tag words, we need to tokenize the text, which you can do using nltk
functions as well. You can check out our full post on how to tokenize text using NLTK if you want a more in-depth review.
text = """It is a period of civil war. Rebel spaceships, striking from a hidden base, have won their first victory against the evil Galactic Empire.
During the battle, Rebel spies managed to steal secret plans to the Empire's ultimate weapon, the DEATH STAR, and space station with enough power to destroy an entire planet.
Pursued by the Empire's sinister agents, Princess Leia races home aboard her starship, custodian of the stolen plans that can save her people and restore freedom to the galaxy."""
tokenized_text = word_tokenize(text, language = "english")
nltk.pos_tag(tokenized_text)
POS Tagging with NLTK Example: The only argument you actually need to supply is the tokenized text, which is just a list of strings.
tagged_text = nltk.pos_tag(tokenized_text)
tagged_text
Output:
[('It', 'PRP'),
('is', 'VBZ'),
('a', 'DT'),
('period', 'NN'),
('of', 'IN'),
('civil', 'JJ'),
('war', 'NN'),
('.', '.'),
('Rebel', 'NNP'),
('spaceships', 'NNS'),
(',', ','),
('striking', 'VBG'),
('from', 'IN'),
('a', 'DT'),
('hidden', 'JJ'),
('base', 'NN'),
(',', ','),
...
From the output, you can see there are a variety of acronyms, one associated with each token. These appear to be the part-of-speech tags used in the Penn Treebank Project. But, if you want something more standard, you can use the universal tagset below.
nltk.pos_tag(tokenized_text, tagset = "universal")
NLTK Universal POS Tagset: Make sure you download the "universal tagset"
first, and set the tagset
argument when you use nltk.pos_tag()
.
nltk.download("universal_tagset")
tagged_text2 = nltk.pos_tag(tokenized_text, tagset = "universal")
tagged_text2
Output:
[('It', 'PRON'),
('is', 'VERB'),
('a', 'DET'),
('period', 'NOUN'),
('of', 'ADP'),
('civil', 'ADJ'),
('war', 'NOUN'),
('.', '.'),
('Rebel', 'NOUN'),
('spaceships', 'NOUN'),
(',', '.'),
...
The designations are a bit coarser, but are also more human-readable, so it is up to you what you need for your project.
About
Einblick is an agile data science platform that provides data scientists with a collaborative workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick customers include Cisco, DARPA, Fuji, NetApp and USDA. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.