Keras is an open-source deep learning API written on top of TensorFlow. Keras is designed to be user-friendly and modular, built for the developer interested in building machine learning models and apps. One of the preprocessing functionalities that keras
provides for natural language processing (NLP) is tokenization. This post will provide a quick start for keras' Tokenizer()
class. If you're interested in NLP but using other libraries like spaCy or NLTK, check out our other Python tutorials.
Installation and Setup
To get started, make sure that both keras
and tensorflow
are installed, and then import the Tokenizer
class, as below.
# Installation and setup
!pip install tensorflow
!pip install keras
from keras.preprocessing.text import Tokenizer
Then we also save our corpus as a list. In this case, we're using the opening crawl text from the three Star Wars trilogies.
# Save corpus as list of documents
corpus = ["""Turmoil has engulfed the Galactic Republic. The taxation of trade routes to outlying star systems is in dispute.
...
Meanwhile, Supreme Leader KYLO REN rages in search of the phantom Emperor, determined to destroy any threat to his power...."""]
Tokenizer()
Initialize and fit Next, initialize the Tokenizer
and use the fit_on_texts()
function. For the fit_on_texts()
function you only need to pass in one argument: your list of strings, in this case, the variable, corpus
.
# Initialize and fit
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
Tokenizer
Instance Attributes: word_index
, word_docs
, word_counts
Explore A Tokenizer instance has several interesting attributes that can help you explore your documents initially.
word_index
The word_index
attribute provides a dictionary of unique identifiers for each word in the corpus.
# Get corpus word index
word_index = tokenizer.word_index
word_index
Output:
{'the': 1,
'of': 2,
'to': 3,
'has': 4,
'a': 5,
...
'rages': 323,
'search': 324,
'phantom': 325,
'determined': 326,
'any': 327}
word_docs
The word_docs
attribute provides a dictionary of the words in the documents and the number of documents in which each word appears.
# Get dictionary of words by number of documents in which they appear
word_docs = tokenizer.word_docs
sorted(word_docs.items())
Output:
[('a', 8),
('aboard', 1),
('absence', 1),
('across', 1),
('against', 3),
...
('with', 5),
('won', 1),
('world', 1),
('young', 1),
('“but', 1)]
word_counts
The word_counts
attribute provides a dictionary of the frequency of each word in the entire corpus.
# Get dictionary of words by frequency of appearance in corpus
word_counts = tokenizer.word_counts
sorted(word_counts.items())
Output:
[('a', 16),
('aboard', 1),
('absence', 1),
('across', 1),
('against', 3),
...
('with', 5),
('won', 1),
('world', 1),
('young', 1),
('“but', 1)]
tokenizer.texts_to_matrix(corpus, mode)
Encoding Text for NLP Tasks: Now that we have explored the data a bit, we can convert the texts to a matrix for further NLP tasks like sentiment analysis and topic modeling. The function texts_to_matrix()
can accomplish this goal. Alternatively you can use two functions: texts_to_sequences()
and sequences_to_matrix()
to achieve the same result.
There are four available modes both texts_to_matrix()
and sequences_to_matrix()
:
"binary"
(default): does the document contain the given word (1 = yes, 0 = no)"count"
: how many times does the given word appear in the document?"tfidf"
: what is the TF-IDF value for the given word-document pair"freq"
: what is the proportion of the count of each word divided by the length of the sequence for that document
Below we provide examples for mode = "tfidf"
and mode = "binary"
.
tokenizer.texts_to_matrix(corpus, mode = 'tfidf')
# Convert texts to matrix of TF-IDF values
tfidf_matrix = tokenizer.texts_to_matrix(corpus, mode = 'tfidf')
tfidf_matrix
Output:
array([[0. , 2.18095229, 1.79190166, ..., 0. , 0.],
[0. , 2.18095229, 1.67487786, ..., 0. , 0.],
[0. , 1.97655152, 0.64185389, ..., 0. , 0.],
...,
[0. , 1.89084388, 1.0867531 , ..., 0. , 0.],
[0. , 2.05215102, 1.34700245, ..., 0. , 0.],
[0. , 1.97655152, 1.53165231, ..., 1.70474809, 1.70474809]])
tokenizer.texts_to_matrix(corpus, mode = 'binary')
# Convert texts to matrix of binary values
binary_matrix = tokenizer.texts_to_matrix(corpus, mode = 'binary')
binary_matrix
Output:
array([[0., 1., 1., ..., 0., 0., 0.],
[0., 1., 1., ..., 0., 0., 0.],
[0., 1., 1., ..., 0., 0., 0.],
...,
[0., 1., 1., ..., 0., 0., 0.],
[0., 1., 1., ..., 0., 0., 0.],
[0., 1., 1., ..., 1., 1., 1.]])
About
Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.