Removing stop words is an important step of processing any text data, particularly for tasks like sentiment analysis, where stop words have little semantic meaning, but can bloat your corpus. In this post, we'll go over how to remove and customize stop words using
nltk and load corpus
You can install and load
nltk like any other Python package.
!pip install nltk import nltk
For our example, we'll pull from one of the built-in corpora that
nltk makes available: the entirety of the Shakespeare play, Julius Caesar. The text is available via the
gutenberg corpus reader object. In order to access the corpora, you first need to download the corpus reader, and then you can import it.
# Must download first before importing! nltk.download("gutenberg") from nltk.corpus import gutenberg
Next, you can see which corpora are available with the
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
Let's print the raw text:
[The Tragedie of Julius Caesar by William Shakespeare 1599] Actus Primus. Scoena Prima. Enter Flauius, Murellus, and certaine Commoners ouer the Stage. Flauius. Hence: home you idle Creatures, get you home: Is this a Holiday? What, know you not (Being Mechanicall) you ought not walke Vpon a labouring day, without the signe Of your Profession? Speake, what Trade art thou? Car. Why Sir, a Carpenter ...
The corpus reader object includes a few handy functions to speed up text processing, such as the
words() function, which will return a list of all the words in a given text file. For our purposes, we'll then take a subset of the first 500 words of that list:
# Get words from Julius Caesar as a list of strings caesar = gutenberg.words("shakespeare-caesar.txt") # Get the first 500 words caesar_sub = caesar[0:500] caesar_sub
['[', 'The', 'Tragedie', 'of', 'Julius', 'Caesar', ...]
From the output, we can already see that there are a few "words" we might want to exclude from certain NLP tasks, such as the square bracket and the words "of" and "the," which have little meaning in isolation.
Next, we'll download and load the
stopwords much as we did with
gutenberg. Then, we'll take a look at the English stop words:
# Again, make sure to download before importing! nltk.download("stopwords") from nltk.corpus import stopwords stopwords.words('english')
['i', 'me', 'my', 'myself', ... 'd', 'll', 'm', 'o', 're', ... 'won', "won't", 'wouldn', "wouldn't"]
Included in the list of stop words are beginnings and ends of contractions, such as 'd' (I'd), and 'won' (won't). Also note that the list of stop words are all lowercase!
Remove stop words
To remove the stop words, we'll do the following:
- Save the stop words as a list
- Check each word in the Julius Caesar corpus against the list of stop words
- Only save the words NOT in the list of stop words
NOTE: in the code chunk below we convert each word to its lowercase form before checking against the list. This is because the uppercase, titlecase, and lowercase versions of the same word are all considered unique to Python, but not to humans.
# Save English stopwords sw_eng = stopwords.words('english') # Remove stopwords, and save lowercase words caesar_no_sw = [x.lower() for x in caesar_sub if x.lower() not in sw_eng] caesar_no_sw
['[', 'tragedie', 'julius', 'caesar', 'william', 'shakespeare', ... ':', 'saw', 'chariot', 'appeare', ',', 'haue']
Remove custom stop words
As you can see from the above output, there are still some "words" that were not in the stop words list that we might want to remove, such as extraneous punctuation marks, "William", and "Shakespeare." We can do this with simple list functions like
extend(). Then we'll follow the same steps as above to remove these custom words.
# Add more words to stopwords list, based on context new_sw = ['[', ']', ':', ',', '.', '!', '?', 'shakespeare', 'william', '(', ')'] sw_eng.extend(new_sw) # Remove stopwords, and save lowercase words caesar_no_sw = [x.lower() for x in caesar_sub if x.lower() not in sw_eng] caesar_no_sw
['tragedie', 'julius', 'caesar', '1599', 'actus', ... 'rome', 'saw', 'chariot', 'appeare', 'haue']
The list looks better now! The command
len(caesar_no_sw) returns 236, indicating that we have removed 264 words from the original list! Upon closer examination, we can remove more stop words, or move on to further processing steps.
Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.