Get started with spaCy's Tokenizer()

Einblick Content Team - May 17th, 2023

Tokenization is a critical part of preprocessing text data to ensure you can complete various natural language processing tasks. As in our prior post, which focused on tokenization in NLTK, we'll do a similar walkthrough for spaCy, another popular NLP package in Python. We'll go through a few different ways you can tokenize your text, as well as additional commands you can use to get more information about each token.

Check out the full code in the canvas below, or read on for the walkthrough. Use the table of contents on the left to jump through the different options we show:

Install package and functions

Install and import relevant functions and languages. spaCy offers support for different languages. We'll use English for the following examples.

!pip install spacy
import spacy
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English

Tokenization Version 1: Tokenizer()

Initialize

First set the language so that the Tokenizer() has a vocabulary to pull from. In this case, we'll just use the vocab corpus, so this does not include sensitivity to punctuation. We'll also use part of the opening crawl of Star Wars Episode IV: A New Hope for our text data.

# Initialize Tokenizer()
nlp = English()
tokenizer = Tokenizer(nlp.vocab)
text = "It is a period of civil war. Rebel spaceships, striking from a hidden base, have won their first victory against the evil Galactic Empire."

Now that we've initialized the Tokenizer object, we can use it on our text, and print the tokens.

Tokenize

We'll use the explain() method to see how each token is labeled. Some options are: PREFIX, SPECIAL-1, SPECIAL-2, TOKEN, and SUFFIX.

# Split sentence into tokens
tokens = tokenizer(text)

print(type(tokens))

# Print each token
for token in tokens:
    print(token)

# Print tokenizer pattern
for token in tokenizer.explain(text):
    print(f"{token[1]}: {token[0]}")

Output:

<class 'spacy.tokens.doc.Doc'>
It
is
a
period
of
civil
war.
Rebel
spaceships,
striking
from
a
hidden
base,
...
It: TOKEN
is: TOKEN
a: TOKEN
period: TOKEN
of: TOKEN
civil: TOKEN
war.: TOKEN
Rebel: TOKEN
spaceships,: TOKEN
striking: TOKEN
from: TOKEN
a: TOKEN
hidden: TOKEN
base,: TOKEN
...

As we can see from the output, since the corpus only includes vocabulary, the punctuation is actually attached to certain tokens, rather than being a separate entity. Additionally, every token has been categorized as a TOKEN.

Tokenization Version 2: English().tokenizer

Initialize

In the second case, we'll just use the default tokenizer that is created when a Language object is invoked. This tokenizer contains information about words as well as punctuation.

# Initialize tokenizer
nlp = English()
tokenizer2 = nlp.tokenizer
text = "It is a period of civil war. Rebel spaceships, striking from a hidden base, have won their first victory against the evil Galactic Empire."

Then we'll call the same methods as we did before:

# Split sentence into tokens
tokens = tokenizer2(text)

# Print each token
for token in tokens:
    print(token)

# Print tokenizer pattern
for token in tokenizer2.explain(text):
    print(f"{token[1]}: {token[0]}")

Output:

It
is
a
period
of
civil
war
.
Rebel
spaceships
,
...
It: TOKEN
is: TOKEN
a: TOKEN
period: TOKEN
of: TOKEN
civil: TOKEN
war: TOKEN
.: SUFFIX
Rebel: TOKEN
spaceships: TOKEN
,: SUFFIX

In this case, we can see that the punctuation marks are not attached to words, and are then labeled as SUFFIX, rather than TOKEN.

Tokenization Special Rules: tokenizer.add_special_case("word", case)

For this example, we'll use a stanza from the song "Don't Stop Believin'" by Journey to adjust how the tokenizer handles the word "believin.'"

Before add_special_case:

text = """Don't stop believin'
Hold on to that feelin'
Streetlights, people"""

# Split sentence into tokens
tokens = tokenizer2(text)

for token in tokens:
    print(token)
    
for token in tokenizer2.explain(text):
    print(f"{token[1]}: {token[0]}")

Output:

Do
n't
stop
believin
'
Hold
on
to
that
feelin
'
Streetlights
,
people
Do: SPECIAL-1
n't: SPECIAL-2
stop: TOKEN
believin: TOKEN
': SPECIAL-1
Hold: TOKEN
on: TOKEN
to: TOKEN
that: TOKEN
feelin: TOKEN
': SPECIAL-1
Streetlights: TOKEN
,: SUFFIX
people: TOKEN

From the output, we can see that the apostrophe after "believin'" and "feelin'" have been stripped from the participles, and labeled as separate SPECIAL-1 entities. If this is not the behavior we want--for example, we want to recognize that "believin'" is just the participial form of "believe," we can add a special case.

After add_special_case("string", case):

We have to import attributes ORTH, and NORM, which allow us to look for a particular substring, and set what the normalized form is, respectively. We set a case dictionary, which then gets passed into the add_special_case() function.

In the below example, we want to split "believin'" into two tokens: "believ", and "in'"

from spacy.attrs import ORTH, NORM

text = """Don't stop believin'
Hold on to that feelin'
Streetlights, people"""

# Set case rules
case = [{ORTH: "believ", NORM: "believe"}, {ORTH: "in'", NORM: "ing"}]

# Add case to tokenizer
tokenizer2.add_special_case("believin'", case)

Now that we've done the setup, we can check that it's worked as expected:

# Split sentence into tokens
tokens = tokenizer2(text)

for token in tokens:
    print(token)
    
for token in tokenizer2.explain(text):
    print(f"{token[1]}: {token[0]}")

Output:

Do
n't
stop
believ
in'
Hold
on
to
that
feelin
'
Streetlights
,
people
Do: SPECIAL-1
n't: SPECIAL-2
stop: TOKEN
believ: SPECIAL-1
in': SPECIAL-2
Hold: TOKEN
on: TOKEN
to: TOKEN
that: TOKEN
feelin: TOKEN
': SPECIAL-1
Streetlights: TOKEN
,: SUFFIX
people: TOKEN

From the output, we can see that "believ" is now considered SPECIAL-1 rather than a TOKEN, and "in'" is now SPECIAL-2.

About

Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.

Start using Einblick

Pull all your data sources together, and build actionable insights on a single unified platform.

  • All connectors
  • Unlimited teammates
  • All operators