Part of natural language processing is determining the role of each word or token in a body of text. In the world of NLP, we call this process part-of-speech (POS) tagging. The NLTK package comes with a function
pos_tag() that makes this job relatively seamless, and gives us a good starting point. Of course, there may be specific nuances for your particular corpus of data, but this function will get you 90% of the way there. Check out the full code in the embedded canvas below:
Installation and setup
First you need to install and import
nltk in addition to a few dependencies for the functions that we're using.
!pip install nltk import nltk from nltk.tokenize import word_tokenize nltk.download('punkt') # Need for word_tokenize() nltk.download('averaged_perceptron_tagger') # Need for pos_tag()
Next, in order to actually tag words, we need to tokenize the text, which you can do using
nltk functions as well. You can check out our full post on how to tokenize text using NLTK if you want a more in-depth review.
text = """It is a period of civil war. Rebel spaceships, striking from a hidden base, have won their first victory against the evil Galactic Empire. During the battle, Rebel spies managed to steal secret plans to the Empire's ultimate weapon, the DEATH STAR, and space station with enough power to destroy an entire planet. Pursued by the Empire's sinister agents, Princess Leia races home aboard her starship, custodian of the stolen plans that can save her people and restore freedom to the galaxy.""" tokenized_text = word_tokenize(text, language = "english")
POS Tagging with NLTK Example:
The only argument you actually need to supply is the tokenized text, which is just a list of strings.
tagged_text = nltk.pos_tag(tokenized_text) tagged_text
[('It', 'PRP'), ('is', 'VBZ'), ('a', 'DT'), ('period', 'NN'), ('of', 'IN'), ('civil', 'JJ'), ('war', 'NN'), ('.', '.'), ('Rebel', 'NNP'), ('spaceships', 'NNS'), (',', ','), ('striking', 'VBG'), ('from', 'IN'), ('a', 'DT'), ('hidden', 'JJ'), ('base', 'NN'), (',', ','), ...
From the output, you can see there are a variety of acronyms, one associated with each token. These appear to be the part-of-speech tags used in the Penn Treebank Project. But, if you want something more standard, you can use the universal tagset below.
NLTK Universal POS Tagset:
nltk.pos_tag(tokenized_text, tagset = "universal")
Make sure you download the
"universal tagset" first, and set the
tagset argument when you use
nltk.download("universal_tagset") tagged_text2 = nltk.pos_tag(tokenized_text, tagset = "universal") tagged_text2
[('It', 'PRON'), ('is', 'VERB'), ('a', 'DET'), ('period', 'NOUN'), ('of', 'ADP'), ('civil', 'ADJ'), ('war', 'NOUN'), ('.', '.'), ('Rebel', 'NOUN'), ('spaceships', 'NOUN'), (',', '.'), ...
The designations are a bit coarser, but are also more human-readable, so it is up to you what you need for your project.
Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.