Part-of-Speech (POS) Tagging with NLTK pos_tag()

Einblick Content Team - May 24th, 2023

Part of natural language processing is determining the role of each word or token in a body of text. In the world of NLP, we call this process part-of-speech (POS) tagging. The NLTK package comes with a function pos_tag() that makes this job relatively seamless, and gives us a good starting point. Of course, there may be specific nuances for your particular corpus of data, but this function will get you 90% of the way there. Check out the full code in the embedded canvas below:

Installation and setup

First you need to install and import nltk in addition to a few dependencies for the functions that we're using.

!pip install nltk

import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt') # Need for word_tokenize()
nltk.download('averaged_perceptron_tagger') # Need for pos_tag()

Tokenize Text: nltk.word_tokenize()

Next, in order to actually tag words, we need to tokenize the text, which you can do using nltk functions as well. You can check out our full post on how to tokenize text using NLTK if you want a more in-depth review.

text = """It is a period of civil war. Rebel spaceships, striking from a hidden base, have won their first victory against the evil Galactic Empire.

During the battle, Rebel spies managed to steal secret plans to the Empire's ultimate weapon, the DEATH STAR, and space station with enough power to destroy an entire planet.

Pursued by the Empire's sinister agents, Princess Leia races home aboard her starship, custodian of the stolen plans that can save her people and restore freedom to the galaxy."""

tokenized_text = word_tokenize(text, language = "english")

POS Tagging with NLTK Example: nltk.pos_tag(tokenized_text)

The only argument you actually need to supply is the tokenized text, which is just a list of strings.

tagged_text = nltk.pos_tag(tokenized_text)
tagged_text

Output:

[('It', 'PRP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('period', 'NN'),
 ('of', 'IN'),
 ('civil', 'JJ'),
 ('war', 'NN'),
 ('.', '.'),
 ('Rebel', 'NNP'),
 ('spaceships', 'NNS'),
 (',', ','),
 ('striking', 'VBG'),
 ('from', 'IN'),
 ('a', 'DT'),
 ('hidden', 'JJ'),
 ('base', 'NN'),
 (',', ','),
 ...

From the output, you can see there are a variety of acronyms, one associated with each token. These appear to be the part-of-speech tags used in the Penn Treebank Project. But, if you want something more standard, you can use the universal tagset below.

NLTK Universal POS Tagset: nltk.pos_tag(tokenized_text, tagset = "universal")

Make sure you download the "universal tagset" first, and set the tagset argument when you use nltk.pos_tag().

nltk.download("universal_tagset")
tagged_text2 = nltk.pos_tag(tokenized_text, tagset = "universal")
tagged_text2

Output:

[('It', 'PRON'),
 ('is', 'VERB'),
 ('a', 'DET'),
 ('period', 'NOUN'),
 ('of', 'ADP'),
 ('civil', 'ADJ'),
 ('war', 'NOUN'),
 ('.', '.'),
 ('Rebel', 'NOUN'),
 ('spaceships', 'NOUN'),
 (',', '.'),
...

The designations are a bit coarser, but are also more human-readable, so it is up to you what you need for your project.

About

Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.

Start using Einblick

Pull all your data sources together, and build actionable insights on a single unified platform.

  • All connectors
  • Unlimited teammates
  • All operators