Processing text data with scikit-learn's TfidfVectorizer()

Einblick Content Team - June 1st, 2023

Scikit-learn's TF-IDF Vectorizer (Term Frequency - Inverse Document Frequency) turns raw documents into a matrix of TF-IDF features. This process combines the CountVectorizer and TF-IDF Transformer.

The resulting numerical representation of the textual data describes the importance of words within each document relative to all other documents in a corpus. These values can then be used to perform various NLP tasks such as text classification and topic modeling.

Setup: import function and corpus

In order to run the TfidfVectorizer() function, you only need to import the one function from scikit-learn's feature_extraction module.

For our sample corpus, we used the opening crawls of the three Star Wars trilogies.

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["""Turmoil has engulfed the Galactic Republic. The taxation of trade routes to outlying star systems is in dispute...the guardians of peace and justice in the galaxy, to settle the conflict....""",
...
          , """The dead speak! The galaxy has heard a mysterious broadcast, a threat of REVENGE in the sinister voice of the late EMPEROR PALPATINE...Meanwhile, Supreme Leader KYLO REN rages in search of the phantom Emperor, determined to destroy any threat to his power...."""]

Initialize and Transform

To use the TF-IDF vectorizer, you only need two lines of code and the corpus as input.

# Initialize tf-idf vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform based on corpus
X = vectorizer.fit_transform(corpus)

Now we can take a look at the features that were extracted.

# Get list of features (tokens)
tokens = vectorizer.get_feature_names_out()
tokens

Output:

array(['aboard', 'absence', 'across', 'against', 'agents', 'alarming',
       'all', 'ally', 'although', 'amidala', 'an', 'and', 'any', 'are',
       'armored', 'army', 'as', 'ashes', 'assist', 'attacks', ... 
       'vanished', 'victory', 'vile', 'voice', 'vote', 'war', 'weapon',
       'when', 'where', 'whereabouts', 'while', 'will', 'with', 'won',
       'world', 'young'], dtype=object)

If however, we try to take a look at the matrix on its own, we learn it is a sparse matrix, which makes sense. There are way more tokens than documents, but each token likely does not show up in every document.

print(X.shape)
X

Output:

(9, 323)
Out[12]: 
<9x323 sparse matrix of type '<class 'numpy.float64'>'
	with 531 stored elements in Compressed Sparse Row format>

Convert to DataFrame

So we can convert the matrix to a DataFrame using the pandas library.

import pandas as pd

# Convert sparse matrix to DataFrame
df_tfidfvect = pd.DataFrame(data = X.toarray(), index = ["Ep1", "Ep2", "Ep3", "Ep4", "Ep5", "Ep6", "Ep7", "Ep8", "Ep9"], columns = tokens)
df_tfidfvect

Output:

    aboard	    absence	    across	    against
Ep1	0.000000	0.000000	0.000000	0.000000
Ep2	0.000000	0.000000	0.000000	0.000000
Ep3	0.000000	0.000000	0.000000	0.000000
Ep4	0.123809	0.000000	0.000000	0.090921
Ep5	0.000000	0.000000	0.122771	0.000000
Ep6	0.000000	0.000000	0.000000	0.000000
Ep7	0.000000	0.121165	0.000000	0.000000
Ep8	0.000000	0.000000	0.000000	0.094784
Ep9	0.000000	0.000000	0.000000	0.098088
9 rows × 323 columns

The values are how weighted each word is for that given document.

Additional Arguments

You can adjust the vectorizer with the following arguments:

  • stop_words: either 'english' or a list of stop words to exclude from the final results
  • max_df: This argument sets a maximum threshold for the number of documents in which a particular word must appear in order to be included in the vectorizer's vocabulary. This is useful if you want to exclude common words that don't add much meaning or value but are often seen throughout your collection of documents (for example "the", "a" etc).
  • min_df: Similarly to max_df, this argument sets an absolute minimum requirement for how many documents a word must appear in before being added to the vocabulary list.
  • max_features: sets a maximum number of features to retain in the final output, the function will keep the top n features ordered by term frequency.

About

Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.