Removing stop words is an important step of processing any text data, particularly for tasks like sentiment analysis, where stop words have little semantic meaning, but can bloat your corpus. In this post, we'll go over how to remove and customize stop words using NLTK
.
nltk
and load corpus
Install You can install and load nltk
like any other Python package.
!pip install nltk
import nltk
For our example, we'll pull from one of the built-in corpora that nltk
makes available: the entirety of the Shakespeare play, Julius Caesar. The text is available via the gutenberg
corpus reader object. In order to access the corpora, you first need to download the corpus reader, and then you can import it.
# Must download first before importing!
nltk.download("gutenberg")
from nltk.corpus import gutenberg
Next, you can see which corpora are available with the fileids()
command.
gutenberg.fileids()
Output:
['austen-emma.txt',
'austen-persuasion.txt',
'austen-sense.txt',
'bible-kjv.txt',
'blake-poems.txt',
'bryant-stories.txt',
'burgess-busterbrown.txt',
'carroll-alice.txt',
'chesterton-ball.txt',
'chesterton-brown.txt',
'chesterton-thursday.txt',
'edgeworth-parents.txt',
'melville-moby_dick.txt',
'milton-paradise.txt',
'shakespeare-caesar.txt',
'shakespeare-hamlet.txt',
'shakespeare-macbeth.txt',
'whitman-leaves.txt']
Let's print the raw text:
print(gutenberg.raw("shakespeare-caesar.txt"))
Output:
[The Tragedie of Julius Caesar by William Shakespeare 1599]
Actus Primus. Scoena Prima.
Enter Flauius, Murellus, and certaine Commoners ouer the Stage.
Flauius. Hence: home you idle Creatures, get you home:
Is this a Holiday? What, know you not
(Being Mechanicall) you ought not walke
Vpon a labouring day, without the signe
Of your Profession? Speake, what Trade art thou?
Car. Why Sir, a Carpenter
...
The corpus reader object includes a few handy functions to speed up text processing, such as the words()
function, which will return a list of all the words in a given text file. For our purposes, we'll then take a subset of the first 500 words of that list:
# Get words from Julius Caesar as a list of strings
caesar = gutenberg.words("shakespeare-caesar.txt")
# Get the first 500 words
caesar_sub = caesar[0:500]
caesar_sub
Output:
['[', 'The', 'Tragedie', 'of', 'Julius', 'Caesar', ...]
From the output, we can already see that there are a few "words" we might want to exclude from certain NLP tasks, such as the square bracket and the words "of" and "the," which have little meaning in isolation.
stopwords
Load Next, we'll download and load the stopwords
much as we did with gutenberg
. Then, we'll take a look at the English stop words:
# Again, make sure to download before importing!
nltk.download("stopwords")
from nltk.corpus import stopwords
stopwords.words('english')
Output:
['i',
'me',
'my',
'myself',
...
'd',
'll',
'm',
'o',
're',
...
'won',
"won't",
'wouldn',
"wouldn't"]
Included in the list of stop words are beginnings and ends of contractions, such as 'd' (I'd), and 'won' (won't). Also note that the list of stop words are all lowercase!
Remove stop words
To remove the stop words, we'll do the following:
- Save the stop words as a list
- Check each word in the Julius Caesar corpus against the list of stop words
- Only save the words NOT in the list of stop words
NOTE: in the code chunk below we convert each word to its lowercase form before checking against the list. This is because the uppercase, titlecase, and lowercase versions of the same word are all considered unique to Python, but not to humans.
# Save English stopwords
sw_eng = stopwords.words('english')
# Remove stopwords, and save lowercase words
caesar_no_sw = [x.lower() for x in caesar_sub if x.lower() not in sw_eng]
caesar_no_sw
Output:
['[',
'tragedie',
'julius',
'caesar',
'william',
'shakespeare',
...
':',
'saw',
'chariot',
'appeare',
',',
'haue']
Remove custom stop words
As you can see from the above output, there are still some "words" that were not in the stop words list that we might want to remove, such as extraneous punctuation marks, "William", and "Shakespeare." We can do this with simple list functions like extend()
. Then we'll follow the same steps as above to remove these custom words.
# Add more words to stopwords list, based on context
new_sw = ['[', ']', ':', ',', '.', '!', '?', 'shakespeare', 'william', '(', ')']
sw_eng.extend(new_sw)
# Remove stopwords, and save lowercase words
caesar_no_sw = [x.lower() for x in caesar_sub if x.lower() not in sw_eng]
caesar_no_sw
Output:
['tragedie',
'julius',
'caesar',
'1599',
'actus',
...
'rome',
'saw',
'chariot',
'appeare',
'haue']
The list looks better now! The command len(caesar_no_sw)
returns 236, indicating that we have removed 264 words from the original list! Upon closer examination, we can remove more stop words, or move on to further processing steps.
About
Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.