Removing stop words with NLTK

Einblick Content Team - May 10th, 2023

Removing stop words is an important step of processing any text data, particularly for tasks like sentiment analysis, where stop words have little semantic meaning, but can bloat your corpus. In this post, we'll go over how to remove and customize stop words using NLTK.

Install nltk and load corpus

You can install and load nltk like any other Python package.

!pip install nltk

import nltk

For our example, we'll pull from one of the built-in corpora that nltk makes available: the entirety of the Shakespeare play, Julius Caesar. The text is available via the gutenberg corpus reader object. In order to access the corpora, you first need to download the corpus reader, and then you can import it.

# Must download first before importing!
nltk.download("gutenberg")
from nltk.corpus import gutenberg

Next, you can see which corpora are available with the fileids() command.

gutenberg.fileids()

Output:

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

Let's print the raw text:

print(gutenberg.raw("shakespeare-caesar.txt"))

Output:

[The Tragedie of Julius Caesar by William Shakespeare 1599]
Actus Primus. Scoena Prima.
Enter Flauius, Murellus, and certaine Commoners ouer the Stage.
  Flauius. Hence: home you idle Creatures, get you home:
Is this a Holiday? What, know you not
(Being Mechanicall) you ought not walke
Vpon a labouring day, without the signe
Of your Profession? Speake, what Trade art thou?
  Car. Why Sir, a Carpenter
  ...

The corpus reader object includes a few handy functions to speed up text processing, such as the words() function, which will return a list of all the words in a given text file. For our purposes, we'll then take a subset of the first 500 words of that list:

# Get words from Julius Caesar as a list of strings
caesar = gutenberg.words("shakespeare-caesar.txt")

# Get the first 500 words
caesar_sub = caesar[0:500]

caesar_sub

Output:

['[', 'The', 'Tragedie', 'of', 'Julius', 'Caesar', ...]

From the output, we can already see that there are a few "words" we might want to exclude from certain NLP tasks, such as the square bracket and the words "of" and "the," which have little meaning in isolation.

Load stopwords

Next, we'll download and load the stopwords much as we did with gutenberg. Then, we'll take a look at the English stop words:

# Again, make sure to download before importing!
nltk.download("stopwords")
from nltk.corpus import stopwords

stopwords.words('english')

Output:

['i',
 'me',
 'my',
 'myself',
 ...
 'd',
 'll',
 'm',
 'o',
 're',
 ...
  'won',
 "won't",
 'wouldn',
 "wouldn't"]

Included in the list of stop words are beginnings and ends of contractions, such as 'd' (I'd), and 'won' (won't). Also note that the list of stop words are all lowercase!

Remove stop words

To remove the stop words, we'll do the following:

  1. Save the stop words as a list
  2. Check each word in the Julius Caesar corpus against the list of stop words
  3. Only save the words NOT in the list of stop words

NOTE: in the code chunk below we convert each word to its lowercase form before checking against the list. This is because the uppercase, titlecase, and lowercase versions of the same word are all considered unique to Python, but not to humans.

# Save English stopwords
sw_eng = stopwords.words('english')

# Remove stopwords, and save lowercase words
caesar_no_sw = [x.lower() for x in caesar_sub if x.lower() not in sw_eng]

caesar_no_sw

Output:

['[',
 'tragedie',
 'julius',
 'caesar',
 'william',
 'shakespeare',
 ...
  ':',
 'saw',
 'chariot',
 'appeare',
 ',',
 'haue']

Remove custom stop words

As you can see from the above output, there are still some "words" that were not in the stop words list that we might want to remove, such as extraneous punctuation marks, "William", and "Shakespeare." We can do this with simple list functions like extend(). Then we'll follow the same steps as above to remove these custom words.

# Add more words to stopwords list, based on context
new_sw = ['[', ']', ':', ',', '.', '!', '?', 'shakespeare', 'william', '(', ')']
sw_eng.extend(new_sw)

# Remove stopwords, and save lowercase words
caesar_no_sw = [x.lower() for x in caesar_sub if x.lower() not in sw_eng]

caesar_no_sw

Output:

['tragedie',
 'julius',
 'caesar',
 '1599',
 'actus',
 ...
 'rome',
 'saw',
 'chariot',
 'appeare',
 'haue']

The list looks better now! The command len(caesar_no_sw) returns 236, indicating that we have removed 264 words from the original list! Upon closer examination, we can remove more stop words, or move on to further processing steps.

About

Einblick is an agile data science platform that provides data scientists with a collaborative workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick customers include Cisco, DARPA, Fuji, NetApp and USDA. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.

Start using Einblick

Pull all your data sources together, and build actionable insights on a single unified platform.

  • All connectors
  • Unlimited teammates
  • All operators