Before running any data science model, whether its a linear regression, decision tree, random forest, or any other model, it's important to properly prepare your data. In this post, we'll go over the process of one-hot encoding categorical variables using scikit-learn's OneHotEncoder()
function. The data we're using comes from a paper called "Using Data Mining to Predict Secondary School Student Performance" by P. Cortez and A. Silva, published in 2008. We've used a subset to only include information about students' math scores.
Basic Syntax: OneHotEncoder().fit(X)
Before processing the data, we subset for the X variables of interest. Since we're predicting math scores, we include demographic information about the students, including school, gender, address (binary: 'U' - urban or 'R' - rural), and family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3).
X = df[["school", "sex", "address", "famsize"]]
X.head()
Output:
school sex address famsize
0 GP F U GT3
1 GP F U GT3
2 MS F U LE3
3 GP F U LE3
4 GP F U GT3
Now we can create a OneHotEncoder()
instance. We've enumerated the default values for a few variables, but you can create an instance without passing any values to the function.
categories
: categories per feature, can be determined automatically, or specified as a list wherecategories[i]
contains the categories in the ith columndrop
: whether and how to drop one of the categories per feature. Options are'first'
,'if_binary'
, andNone
. You can also pass in an array that specifies for each feature, which category to drop.handle_unknown
: how to handle unknown categories. Options are'error'
,'ignore'
, and'infrequent_if_exist'
After you initialize the instance, you need to fit the object to X.
from sklearn.preprocessing import OneHotEncoder
# Initialize OneHotEncoder, and fit to X
enc = OneHotEncoder(categories = "auto", drop = None,
handle_unknown = "error")
enc.fit(X)
# Print results
print("Categories:")
print(enc.categories_)
print("Feature names:")
print(enc.get_feature_names_out())
# One-hot encode X array -> returns a sparse matrix
enc.transform(X)
Output:
[array(['GP', 'MS'], dtype=object), array(['F', 'M'], dtype=object), array(['R', 'U'], dtype=object), array(['GT3', 'LE3'], dtype=object)]
Feature names:
['school_GP' 'school_MS' 'sex_F' 'sex_M' 'address_R' 'address_U' 'famsize_GT3' 'famsize_LE3']
Out[30]:
<395x8 sparse matrix of type '<class 'numpy.float64'>'
with 1580 stored elements in Compressed Sparse Row format>
We've printed the categories_
and get_feature_names_out()
attribute and function results, respectively, to learn more about how the OneHotEncoder
worked.
We see that there were 4 categorical variables identified, which resulted in 8 columns.
enc.transform(X).toarray()
Next, we can transform the original X DataFrame using the OneHotEncoder
. As we saw above, if we do this, we get a sparse matrix. So we use the toarray()
function as a next step. We then convert to a DataFrame for easier manipulation. This last step is optional.
import pandas as pd
# Convert sparse matrix to array, and then DataFrame
X_ohe = enc.transform(X).toarray()
X_ohe = pd.DataFrame(X_ohe)
X_ohe.head()
Output:
0 1 2 3 4 5 6 7
0 1.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0
1 1.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0
2 0.0 1.0 1.0 0.0 0.0 1.0 0.0 1.0
3 1.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0
4 1.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0
Now we have our categorical features, converted into several numeric binary columns, which can be processed by your data science or machine learning model of choice.
Data Source
P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. [Web Link]
About
Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.