Tokenization in the context of natural language processing is the process of breaking up text, such as essays and paragraphs, into smaller units that can be more easily processed. These smaller units are called tokens. In this post we'll review two functions from the
sent_tokenize() so you can start processing your text data.
nltk and load corpus
You can install and load
nltk like any other Python package. For our example, we'll pull the Shakespeare play, Julius Caesar from the various corpora available directly through
To access the corpora, you first need to download the corpus reader,
gutenberg, and then you can import it.
!pip install nltk import nltk # Must download first before importing! nltk.download("gutenberg") from nltk.corpus import gutenberg
Next, we'll save the text of the play as a string called
caesar = gutenberg.raw("shakespeare-caesar.txt") print(caesar)
[The Tragedie of Julius Caesar by William Shakespeare 1599] Actus Primus. Scoena Prima. Enter Flauius, Murellus, and certaine Commoners ouer the Stage. Flauius. Hence: home you idle Creatures, get you home: Is this a Holiday? What, know you not (Being Mechanicall) you ought not walke Vpon a labouring day, without the signe Of your Profession? Speake, what Trade art thou? ...
Next, we'll import the
tokenize module, and see what functions we have available to us.
import nltk.tokenize dir(nltk.tokenize)
['BlanklineTokenizer', 'LegalitySyllableTokenizer', 'LineTokenizer', 'MWETokenizer', 'NLTKWordTokenizer', 'PunktSentenceTokenizer', 'RegexpTokenizer', 'ReppTokenizer', ... 'string_span_tokenize', 'texttiling', 'toktok', 'treebank', 'util', 'word_tokenize', 'wordpunct_tokenize']
As you can see, there are a number of tokenizers. For this post, we'll focus on two:
First, we'll import these functions:
from nltk.tokenize import word_tokenize, sent_tokenize
word_tokenize(text, language) Example
word_tokenize() function takes in
text and a
language, and returns a list of "words" by breaking up the text based on whitespace and punctuation. The language parameter is the name for the Punkt corpus of NLTK. The default is
# Need to download punkt to run word_tokenize() nltk.download('punkt') word_tokenize(caesar, language = "english")
['[', 'The', 'Tragedie', 'of', 'Julius', 'Caesar', ... 'Enter', 'Flauius', ',', 'Murellus', ',', 'and', 'certaine', ...
Based on the output, you can see that the text has been broken up into words, but there are still punctuation marks included in the list, that will likely need to be removed later. You can do this using the
stopwords corpus from NLTK, and customizing as needed.
sent_tokenize(text, language) Example
sent_tokenize() function takes in
text and a
language, and returns a list of "sentences" by breaking up the text based on punctuation. The language parameter is the name for the Punkt corpus of NLTK. The default is
sent_tokenize(caesar, language = "english")
['[The Tragedie of Julius Caesar by William Shakespeare 1599]\n\n\nActus Primus.', 'Scoena Prima.', 'Enter Flauius, Murellus, and certaine Commoners ouer the Stage.', 'Flauius.', 'Hence: home you idle Creatures, get you home:\nIs this a Holiday?', 'What, know you not\n(Being Mechanicall) you ought not walke\nVpon a labouring day, without the signe\nOf your Profession?', 'Speake, what Trade art thou?', 'Car.', 'Why Sir, a Carpenter\n\n Mur.', 'Where is thy Leather Apron, and thy Rule?', 'What dost thou with thy best Apparrell on?', 'You sir, what Trade are you?', 'Cobl.', ...
As you can see from the output, some of these sentences are not sentences. This is likely due to characteristics about the original text. Further investigation and processing is required to properly prepare the text for NLP tasks.
Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.