Text processing with the Keras Tokenizer class, methods, and attributes

Einblick Content Team - June 5th, 2023

Keras is an open-source deep learning API written on top of TensorFlow. Keras is designed to be user-friendly and modular, built for the developer interested in building machine learning models and apps. One of the preprocessing functionalities that keras provides for natural language processing (NLP) is tokenization. This post will provide a quick start for keras' Tokenizer() class. If you're interested in NLP but using other libraries like spaCy or NLTK, check out our other Python tutorials.

Installation and Setup

To get started, make sure that both keras and tensorflow are installed, and then import the Tokenizer class, as below.

# Installation and setup
!pip install tensorflow
!pip install keras
from keras.preprocessing.text import Tokenizer

Then we also save our corpus as a list. In this case, we're using the opening crawl text from the three Star Wars trilogies.

# Save corpus as list of documents
corpus = ["""Turmoil has engulfed the Galactic Republic. The taxation of trade routes to outlying star systems is in dispute.
...
Meanwhile, Supreme Leader KYLO REN rages in search of the phantom Emperor, determined to destroy any threat to his power...."""]

Initialize and fit Tokenizer()

Next, initialize the Tokenizer and use the fit_on_texts() function. For the fit_on_texts() function you only need to pass in one argument: your list of strings, in this case, the variable, corpus.

# Initialize and fit
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)

Explore Tokenizer Instance Attributes: word_index, word_docs, word_counts

A Tokenizer instance has several interesting attributes that can help you explore your documents initially.

word_index

The word_index attribute provides a dictionary of unique identifiers for each word in the corpus.

# Get corpus word index
word_index = tokenizer.word_index

word_index

Output:

{'the': 1,
 'of': 2,
 'to': 3,
 'has': 4,
 'a': 5,
 ...
 'rages': 323,
 'search': 324,
 'phantom': 325,
 'determined': 326,
 'any': 327}

word_docs

The word_docs attribute provides a dictionary of the words in the documents and the number of documents in which each word appears.

# Get dictionary of words by number of documents in which they appear
word_docs = tokenizer.word_docs

sorted(word_docs.items())

Output:

[('a', 8),
 ('aboard', 1),
 ('absence', 1),
 ('across', 1),
 ('against', 3),
 ...
 ('with', 5),
 ('won', 1),
 ('world', 1),
 ('young', 1),
 ('“but', 1)]

word_counts

The word_counts attribute provides a dictionary of the frequency of each word in the entire corpus.

# Get dictionary of words by frequency of appearance in corpus
word_counts = tokenizer.word_counts

sorted(word_counts.items())

Output:

[('a', 16),
 ('aboard', 1),
 ('absence', 1),
 ('across', 1),
 ('against', 3),
 ...
 ('with', 5),
 ('won', 1),
 ('world', 1),
 ('young', 1),
 ('“but', 1)]

Encoding Text for NLP Tasks: tokenizer.texts_to_matrix(corpus, mode)

Now that we have explored the data a bit, we can convert the texts to a matrix for further NLP tasks like sentiment analysis and topic modeling. The function texts_to_matrix() can accomplish this goal. Alternatively you can use two functions: texts_to_sequences() and sequences_to_matrix() to achieve the same result.

There are four available modes both texts_to_matrix() and sequences_to_matrix():

  • "binary" (default): does the document contain the given word (1 = yes, 0 = no)
  • "count": how many times does the given word appear in the document?
  • "tfidf": what is the TF-IDF value for the given word-document pair
  • "freq": what is the proportion of the count of each word divided by the length of the sequence for that document

Below we provide examples for mode = "tfidf" and mode = "binary".

tokenizer.texts_to_matrix(corpus, mode = 'tfidf')

# Convert texts to matrix of TF-IDF values
tfidf_matrix = tokenizer.texts_to_matrix(corpus, mode = 'tfidf')

tfidf_matrix

Output:

array([[0.        , 2.18095229, 1.79190166, ..., 0.        , 0.],
       [0.        , 2.18095229, 1.67487786, ..., 0.        , 0.],
       [0.        , 1.97655152, 0.64185389, ..., 0.        , 0.],
       ...,
       [0.        , 1.89084388, 1.0867531 , ..., 0.        , 0.],
       [0.        , 2.05215102, 1.34700245, ..., 0.        , 0.],
       [0.        , 1.97655152, 1.53165231, ..., 1.70474809, 1.70474809]])

tokenizer.texts_to_matrix(corpus, mode = 'binary')

# Convert texts to matrix of binary values
binary_matrix = tokenizer.texts_to_matrix(corpus, mode = 'binary')

binary_matrix

Output:

array([[0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       ...,
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 1., 1., 1.]])

About

Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.