Using NLTK's tokenize package: word_tokenizer and sent_tokenizer examples

Einblick Content Team - May 10th, 2023

Tokenization in the context of natural language processing is the process of breaking up text, such as essays and paragraphs, into smaller units that can be more easily processed. These smaller units are called tokens. In this post we'll review two functions from the nltk.tokenize package: word_tokenize() and sent_tokenize() so you can start processing your text data.

Install nltk and load corpus

You can install and load nltk like any other Python package. For our example, we'll pull the Shakespeare play, Julius Caesar from the various corpora available directly through nltk.

To access the corpora, you first need to download the corpus reader, gutenberg, and then you can import it.

!pip install nltk

import nltk

# Must download first before importing!
nltk.download("gutenberg")
from nltk.corpus import gutenberg

Next, we'll save the text of the play as a string called caesar.

caesar = gutenberg.raw("shakespeare-caesar.txt")
print(caesar)

Output:

[The Tragedie of Julius Caesar by William Shakespeare 1599]
Actus Primus. Scoena Prima.
Enter Flauius, Murellus, and certaine Commoners ouer the Stage.
  Flauius. Hence: home you idle Creatures, get you home:
Is this a Holiday? What, know you not
(Being Mechanicall) you ought not walke
Vpon a labouring day, without the signe
Of your Profession? Speake, what Trade art thou?
...

Import nltk.tokenize

Next, we'll import the tokenize module, and see what functions we have available to us.

import nltk.tokenize
dir(nltk.tokenize)

Output:

['BlanklineTokenizer',
 'LegalitySyllableTokenizer',
 'LineTokenizer',
 'MWETokenizer',
 'NLTKWordTokenizer',
 'PunktSentenceTokenizer',
 'RegexpTokenizer',
 'ReppTokenizer',
...
  'string_span_tokenize',
 'texttiling',
 'toktok',
 'treebank',
 'util',
 'word_tokenize',
 'wordpunct_tokenize']

As you can see, there are a number of tokenizers. For this post, we'll focus on two: word_tokenize() and sent_tokenize().

First, we'll import these functions:

from nltk.tokenize import word_tokenize, sent_tokenize

word_tokenize(text, language) Example

The word_tokenize() function takes in text and a language, and returns a list of "words" by breaking up the text based on whitespace and punctuation. The language parameter is the name for the Punkt corpus of NLTK. The default is "english."

# Need to download punkt to run word_tokenize()
nltk.download('punkt')
word_tokenize(caesar, language = "english")

Output:

['[',
 'The',
 'Tragedie',
 'of',
 'Julius',
 'Caesar',
 ...
 'Enter',
 'Flauius',
 ',',
 'Murellus',
 ',',
 'and',
 'certaine',
...

Based on the output, you can see that the text has been broken up into words, but there are still punctuation marks included in the list, that will likely need to be removed later. You can do this using the stopwords corpus from NLTK, and customizing as needed.

sent_tokenize(text, language) Example

The sent_tokenize() function takes in text and a language, and returns a list of "sentences" by breaking up the text based on punctuation. The language parameter is the name for the Punkt corpus of NLTK. The default is "english."

sent_tokenize(caesar, language = "english")

Output:

['[The Tragedie of Julius Caesar by William Shakespeare 1599]\n\n\nActus Primus.',
 'Scoena Prima.',
 'Enter Flauius, Murellus, and certaine Commoners ouer the Stage.',
 'Flauius.',
 'Hence: home you idle Creatures, get you home:\nIs this a Holiday?',
 'What, know you not\n(Being Mechanicall) you ought not walke\nVpon a labouring day, without the signe\nOf your Profession?',
 'Speake, what Trade art thou?',
 'Car.',
 'Why Sir, a Carpenter\n\n   Mur.',
 'Where is thy Leather Apron, and thy Rule?',
 'What dost thou with thy best Apparrell on?',
 'You sir, what Trade are you?',
 'Cobl.',
 ...

As you can see from the output, some of these sentences are not sentences. This is likely due to characteristics about the original text. Further investigation and processing is required to properly prepare the text for NLP tasks.

About

Einblick is an agile data science platform that provides data scientists with a collaborative workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick customers include Cisco, DARPA, Fuji, NetApp and USDA. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.

Start using Einblick

Pull all your data sources together, and build actionable insights on a single unified platform.

  • All connectors
  • Unlimited teammates
  • All operators