Data cleaning with Python: pandas, numpy, visualizations, and text data [Updated 2023]

Becca Weng - November 11th, 2022

Data cleaning can be a daunting task. Forbes found that data scientists spend 80% of their time preparing their data, and this article is still a top search result in 2022. Data cleaning as part of data preparation can involve many steps, tools, time, and resources. In this article, we’ll simplify the data cleaning process, and focus on how to clean data in Python using built-in packages and commands.

Although working with data is grounded in technical expertise, as Gartner posited at their annual Data and Analytics Summit in 2022, as data practitioners, we need to balance both science AND art. There is no one-size-fits-all solution or approach to data cleaning. There are general guidelines and steps to follow, but exactly how to execute these steps will come with experience, mentorship, and trial-and-error. Read on to learn more about data cleaning with Python.

What is data cleaning?

Data cleaning with PythonData cleaning with Python

Put simply, data cleaning, sometimes called data cleansing, data wrangling, or data scrubbing, is the process of getting data ready for further analysis. As the field of data science continues to evolve and change, these terms are likely going to solidify in meaning, but for now, it is important to understand that data cleaning is a preparatory step for data analysis, data science, and machine learning.

Data cleaning can include removing duplicate data, handling missing data, handling outliers, standardizing formatting, transforming variables, and more. As you clean data, you may discover more information that leads you to other data cleaning steps. Even when you have begun exploratory data analysis, sometimes it is necessary to return to tasks that would fall under the data cleaning umbrella to ensure that your final models are built with the highest quality data possible. After all, even a great machine learning pipeline cannot save dirty data, and worse, confidence in a machine learning model, without awareness of the quality of the underlying data can result in misleading insights and the spread of inaccurate conclusions.

How to clean data?

Data cleaning is an iterative process. Based on the goals of various stakeholders, and what you uncover as you work with the data, you may need to repeat steps before moving on to further analysis and machine learning. For example, let’s say you’re working with customer data, and you realize that email addresses have been formatted incorrectly, once you standardize the email addresses, you may notice an unusual number of transactions coming from a handful of emails. Then you might uncover that there are duplicate rows of data representing the same transaction multiple times.

As important as data cleaning is, checking that your data cleaning has been successful to the best of your ability, is also important. Checking in with domain experts, and those that may be more informed about real-world events that could have affected the data–for example, data breaches, service outages, or new business partnerships. Below, we’ve provided a summary of some of the steps you might take when cleaning data. If you want to read more about data cleaning, check out our comprehensive data cleaning guide.

The main parts of data cleaning we’ll touch upon here are:

  1. Data profiling
  2. Ensuring consistent formatting
  3. Type conversion
  4. Removing duplicate data
  5. Handling missing data
  6. Handling outliers

Data cleaning in Einblick

All of the steps outlined above and described below in this article can be done quickly and natively in Einblick, which provides an intuitive and visual collaborative data science canvas. Einblick works to speed up the data cleaning process, which can be tedious and repetitive. We have created cells, which package commonly used code snippets and processes, so that data scientists can spend less time cleaning data, and more time extracting meaningful insights, and driving business decisions. These cells can all be found in the toolbar on the top of your Einblick canvas. For example:

  1. Profiler: immediately see the distributions of and relevant statistics about your dataset, as well as information about missing data, and unique values.
  2. Chart: create dynamic visualizations, including scatter plots, histograms, heat maps, and line charts, to help you spot outliers
  3. Filter: filters your data based on various boolean conditions

If you are a programming purist, however, Einblick also allows you to import Jupyter notebooks, and code in Python cells right on the expansive canvas. Working in a canvas removes the constraints of a linear notebook, where you have to scroll through dozens of Python cells to compare model outputs or data visualizations. Einblick also allows you to easily explore your data, run statistical tests, and build and tune machine learning models using AutoML, and various other features so that you can do your entire data science workflow in an unbounded canvas. If you're already familiar with parts of data cleaning, feel free to skip ahead to view the relevant Python packages and related functions.

Try Einblick for Free

Data profiling

As a first step in data cleaning, it is important to profile your data. Data profiling is the process of getting a summary of your data. For example, any key descriptive statistics, the count of observations, understanding what types of data are stored in each column, if there are any missing values or if there is data that seems abnormal. Once you have your data profile, you’ll already start noticing things about your data that will help guide you to your next steps or next lines of code. It may also be important at this step to consult a domain expert to check if what you’re noticing is normal or not.

Ensure consistent formatting

Consistent formatting of variables can be an issue if you are bringing data from multiple sources or tables together. For example, if you’re analyzing small businesses in a metropolitan area based on customer reviews as well as sales data, the customer review data may include addresses in one format, while the sales data may include addresses in another format. If you want to combine these two datasets, you’ll need to standardize the formatting of the addresses. You may need to break up the addresses into multiple columns, one for each of the components, such as building number, street name, city, state, and zip code. Then you could bring the components back together in a consistent way across both datasets, and finally merge the datasets together.

Type conversion

Another common task when cleaning data is type conversion. For example, if you’re still working on this small business sales data, you may be interested in understanding factors that contribute to sales. If that’s the case, you’ll want to check the columns involved in measuring sales, for example, number of transactions, revenue, or other variables. When examining the data, you may find that the revenue variable includes characters denoting the currency, such as $. This would indicate that the revenue is stored as a string, so you will likely need to strip the $ character off each value in the column, and then convert the numbers from string values to int or float values.

Remove duplicate data

As you’re examining your data, you may also notice that there are some duplicate observations. For example, you might notice transactions that have the exact same timestamp, for the same business, and the same exact product or service. If you’re working in e-commerce, this is possible, as two people can purchase from a website at the same time, but if you’re only looking at transactions in a physical store, this is highly unlikely. If you’re lucky, each observation may have a unique ID associated with it, but if not, you’ll have to determine which subset of columns to use to identify unique observations.

Handle missing data

As you’re exploring your data, it is always important to check if you have any missing values in your dataset. There are many potential causes of missing data–data was never recorded, data was corrupted, certain questions in a survey are automatically skipped based on prior answers, data was incorrectly joined or merged. Depending on the variable and the use-case, you might want to remove all observations with missing values, or you might want to impute or replace values based on a rule. For example, you can replace missing values with the mean or median of a continuous variable. For categorical variables, you can replace missing values with the most common value in the column. Sometimes it is best not to drop or replace with another value, but indicate that the value is missing, for example with a value of “Unknown” or “0.”

Handle outliers

When going through your data, you will sometimes identify values that seem extreme, these are called outliers. It is important to handle outliers properly as they can affect your analysis, for example, outliers will affect the mean and thus the standard deviation calculated, which in turn is used for things like linear regression. We expand on how to check for outliers in our data cleaning guide.

How to clean data with Python

Python library logosPython library logos

One of the most popular programming languages in the data science and machine learning spaces is Python. Python is open source, versatile, flexible, and has a robust community that can help support your team’s work. Python also has a number of packages that offer great functionality in the data science workflow. You can use various packages, which get updated regularly, to create data visualizations, clean data, and build machine learning models.

In this post, we’ll be covering some commonly used functions from pandas, numpy, matplotlib, seaborn, plotly, and built-in string methods. The list is not exhaustive, and does not include all possible arguments in each function, but is hopefully more approachable and compact than documentation can be. Refer to the official documentation for more specific use-cases.

We’ve embedded an Einblick canvas so you can explore some of the code directly via our Python cells, and we’ve linked to documentation below. The canvas includes the classic iris dataset, as well as a Netflix dataset downloaded from Kaggle. Happy cleaning!

Pandas: profiling, subsetting, and handling missing values

import pandas as pd

pandas, referenced via the alias pd, is a free, open source Python library that is key to the data analysis and data science workflow. Using pandas, you manipulate DataFrame objects, commonly seen as df, which are 2-D structures that behave similarly to a table. You can manipulate labeled columns, and index by row number. Unlike NumPy arrays, you can store multiple different kinds of data in one DataFrame object, making them a popular choice for storing data while performing data transformations or other data cleaning tasks. Below, we offer a quick run-through of commonly used pandas functions, what they return, and use-cases. You can check the links to documentation for more details on the available arguments and syntax.

Data profiling with pandas

  • df.shape: returns the dimensions of the given DataFrame, df, as a tuple
  • df.head(): returns the first n rows of the given DataFrame, df. The default value of n is 5.
    • This is a useful function right when you load in your data, so you can get a quick sense of what your data looks like, and what the data types may be.
  • df.describe(): returns the descriptive statistics, such as count, mean, standard deviation, quantiles, minimum, and maximum of all the continuous variables in the given DataFrame, df
  • df[“column”].value_counts(): returns a Series with the counts of unique values in the given column
  • df.dtypes: returns the data types of each column in the DataFrame, df
  • df.groupby(): splits the DataFrame, df, applies a function that you specify, and then combines the results again
    • For example df.groupby([“col1”]).mean() will group the DataFrame by the categories in col1, and then each row will represent the mean of the given column for that category.

Subsetting data with pandas

The pandas documentation provides an in-depth tutorial for subsetting pandas DataFrames in different ways. We highlight two below. Subsetting data can be a way of identifying outliers. If you want to learn more options, check out the official pandas write-up.

  • df[[“col1”, “col4”, “col9”]]: use double square brackets [[]] to select specific columns from a dataframe
    • You can use single brackets to select one column
      • df[“column”]
  • df[df[“column”] ==  “value”]: use boolean selection to select certain rows from your data according to a particular condition. You can use different boolean operators, like >, =, <, >=, and <=.
    • This is a convenient line of code if you need to separate a large dataset into different smaller datasets, or if you’re only interested in running analysis on a particular group

Handling missing values

  • df.isna(), df.isnull(): returns an object of the same dimension as the DataFrame, df, indicating where there are missing values. NA values, such as None or numpy.NaN will be mapped as True.
  • df.notna(): the inverse of df.isna() and df.isnull() such that NA values will be mapped as False.
  • df.dropna(): drops rows or columns with missing values based on certain constraints.
    • For example, the how argument has two options–if you select any, then if any NA values exist, drop that row or column–if you select all, then only drop that row or column if ALL values are NA.
  • df.fillna(): fill missing values in a particular way.
    • You can fill all missing values with the same value, for example 0, or 9999, or “Unknown”. You can also use a dictionary to fill missing values based on the column name, or you can propagate existing values forward or backward, among other ways.

Transforming and augmenting your data

  • df.replace(): replace values in a particular way
    • The use-case is a bit different from df.fillna(), for example, if you want to code t-shirt sizes, S, M, L, into numbers, 1, 2, 3, you would use df.replace() instead of df.fillna().
  • df[“new_column”] = some expression: create a new column in a DataFrame simply by using square brackets, naming the column, and setting that new column equal to some expression. You can also create a new column based on an existing column.

NumPy: scientific computing with Python

import numpy as np

NumPy, known via the alias np, is a key Python library that is free, open source, and built for scientific computing. It was first released in 2005, built off of the Numeric and Numarray libraries. Although NumPy skews towards the field of mathematics, it is a fundamental part of the Python ecosystem, and therefore any data scientist must be familiar with the basics of numpy. The NumPy library augments Python with the ndarray object, which is an n-dimensional array, as well as highly efficient methods to operate on arrays and matrices. In brief, NumPy arrays take up less memory than a similar data structure using Python lists would, and many functions are a lot faster using NumPy array. Other popular libraries such as pandas, matplotlib, and scikit-learn make use of NumPy’s core offerings.

Here are just a few of the functions from the numpy library that may be useful in your data cleaning process:

  • np.log(): takes an array-like input, and returns an array consisting of the natural logarithm, of each element in the input
    • This function is useful if you need to transform a variable that follows a power law into a more normal distribution, adding validity to the results of any analysis. If you log transform your data, remember that any insights or conclusions drawn are based on the natural logarithm of the initial variable. For example, if you’re running a simple linear regression model, and you have log-transformed the y variable, a one-unit increase in the x variable is associated with some increase in the logarithm of the y variable.
    • Keep in mind that you cannot take the logarithm of 0, so if you have any values that are 0, you may need to drop those values or handle them in some other way.
  • np.isnan(): returns a boolean array indicating which elements are NaN. This functions similarly to df.isna() or df.isnull() but for NumPy objects.
  • np.where(): returns certain values when a condition is met, and other values when a condition is not met. np.where() functions similarly to an if-else statement applied to a given data structure.
    • This function may be useful if you are trying to replace values under certain conditions, or if you noticed something odd about data collection, or a data entry error that must be corrected.
    • You can also use this kind of conditional to identify outlier data.

Visualizations: `matplotlib`, `seaborn`, `plotly`

Data visualizations are a really crucial part of data cleaning as they can help you to glean insights via plots that would be much harder to extract just by looking at summary statistics or rows and rows of data. There are many Python packages dedicated to or predominantly used for creating data visualizations. Depending on the purpose of your visualizations–sharing with teammates versus presenting to stakeholders–one package may be more useful over another. In this section of the article, we’ll focus on matplotlib, seaborn, and plotly. All offer the ability to create many different kinds of plots, from scatterplots, histograms, and bar charts to boxplots, pie charts, and heat maps. There are other more advanced packages that focus on more specific kinds of visualizations, such as geoplot and geopandas for geospatial data, and networkx for network visualizations.

Matplotlib functions and syntax

import matplotlib.pyplot as plt
  • plt.scatter(x, y): produces a scatter plot of x vs. y, where x and y are continuous variables. You can tune the plot by selecting the color and shape of the marker, width of the line, and other aspects of the plot.
  • plt.bar(x): produces a bar plot, used for categorical variables. You can customize the plot using different arguments, such as the width/height of the bars, color, tick label, among other aspects.
  • plt.barh(x): produces a horizontal bar plot, used for categorical variables. You can customize the plot like you can the traditional plt.bar(x).
  • plt.hist(x, bins): plots a histogram, showing the distribution of x, based on the bins provided–will automatically bin data if the bins variable is not specified.
  • plt.boxplot(x): creates a box and whiskers plot
  • plt.pie(x): creates a pie chart. Although not typically used in traditional statistical analysis, pie charts can be illustrative in presentations, or for sharing work.
  • plt.xlabel(), plt.ylabel(): sets the label for the x-axis, and y-axis of a plot respectively.
    • NOTE: there is slightly different syntax when creating multiple subplots in matplotlib. We’ll go over that syntax below.
  • plt.title(): sets the title for a plot.
    • NOTE: there is slightly different syntax when creating multiple subplots in matplotlib. We’ll go over that syntax below.
  • plt.show(): shows any open plots.
  • fig, ax = plt.subplots(): creates a figure, with a set of subplots, which you can individually customize. You can specify the number of rows and columns that you want in your figure.
    • Ex: fig, axs = plt.subplots(2, 2) creates 4 plots in a 2 x 2 grid.
    • You can then use axs[0,0] to refer to the subplot in the upper left, and axs[0,1] to refer to the subplot in the upper right, and so on. For example, axs[0,0].scatter(x, y) would create a scatter plot in the upper left.
    • You can also use a tuple to assign different variable names to each subplot, for example, fig, (ax1, ax2, ax3, ax4) = plt.subplots(2,2).
    • There are a ton of different functions you can use to customize your subplots. You can find a comprehensive list in the documentation on the Axes class. Here are some basic functions for labels:
    • fig.suptitle(): adds a title for the entire group of subplots
    • fig.tight_layout(): adjusts spacing around subplots
import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2)

# Create scatter plot in first Axes object
ax1.scatter(x = df.sepal_length, y = df.petal_length)
ax1.set_title("Scatter Plot of Sepal Length vs. Petal Length")
ax1.set_xlabel("Sepal Length")
ax1.set_ylabel("Petal Length")

# Create histogram in second Axes object
ax2.hist(x = df.sepal_length)
ax2.set_title("Histogram of Sepal Length")
ax2.set_xlabel("Sepal Length")
ax2.set_ylabel("Count")

fig.tight_layout()
plt.show()

Seaborn functions and syntax

The seaborn library is based on matplotlib, and creates graphs that can appear a bit more polished.

import seaborn as sns

Common arguments among seaborn functions

  • data specifies the full dataset, which is typically a pandas DataFrame or a numpy array
  • x specifies which attribute from the data variable will be on the x-axis
  • y specifies which attribute, if any, from the data variable will be on the y-axis
  • hue specifies the color of a data point or part of the plot based on the value in a given column. Ex: sns.scatterplot(data=penguins, x=”flipper_length_mm”, y=”bill_length_mm”,  hue=”species”) creates a scatterplot of penguin flipper length (mm) vs. bill length (mm), and the color of each data point will represent the species of the penguin

Common seaborn functions

  • sns.scatterplot(data, x, y): creates a scatterplot of x vs. y.
  • sns.pairplot(data): creates a grid of plots showing the pairwise relationships of continuous variables in a given dataset. This is helpful when profiling your data in the beginning to get a snapshot of the overall distributions.
  • sns.barplot(data, x, y): creates a bar plot. If vertical, x represents the categories on the x-axis, and y represents the height of the bar on the y-axis.
  • sns.histplot(data, x, bins): plots a univariate or bivariate histogram. A histogram is a common way to illustrate the distribution of one or more continuous variables.
    • You can also customize the bins your histogram uses, and a number of other plot attributes.
  • sns.displot(data, x, y, kind): creates a distribution plot.
    • There are three kinds of plots available via the kind argument: histograms, kernel density estimates (KDEs), and empirical cumulative distribution functions (EDCFs). You can also create bivariate histograms and KDE plots if you input values for both x and y.
  • sns.lineplot(data, x, y): draw a line plot showing the relationship between x and y.
  • sns.boxplot(data, x, y): draw a box-and-whiskers plot to show distributions of a continuous variable for different groups. Either x or y can contain the categorical or continuous data.

Plotly Express functions and syntax

Plotly is a popular package for more interactive visualizations, so in the event you are preparing visualizations for an interactive data application, or for a presentation, you may want to consider plotly.express. The syntax for Plotly Express functions resembles seaborn syntax quite a bit, except the data_frame argument replaces the data argument, and the color argument replaces the hue argument for assigning color to data in plots.

import plotly.express as px

Text data: string methods

Traditional data cleaning and visualizations are great when working with categorical and continuous data, but increasingly, we need to be able to handle text data. For example, in the case of analyzing small business data, ratings that go from 1 to 5 stars can give us some information, but there is a lot more information that can be extracted from written customer reviews. To handle this kind of text data, we can use some methods native to Python to help us clean and format data so it is ready for natural language processing and machine learning algorithms.

Built-in Python string methods and syntax

One of Python’s built-in data types is textual data, referred to as strings, and abbreviated as str. If you have data stored as a str object, then you can manipulate that data with a variety of string functions. You can refer to the full list of string methods on the Python website, but we have pulled out a few functions that may be particularly useful when cleaning data.

  • str.strip([chars]): returns a copy of the given string with the specified leading and trailing characters removed. If no arguments are given, the function will default to removing all leading and trailing whitespace.
    • Ex: “     space    ”.strip() returns “space”
    • This function can be particularly useful if there is extraneous information at the beginning or end of strings from how the data was collected.
  • str.split(sep): return a list of the words in the string, using the sep argument as the delimiter string. If no sep value is provided, any whitespace is treated as one separator.
    • This function can be useful in cases where you want to separate out information from one string into several variables. For example, if customer names are stored as “last name, first name”, you can use this function to extract just the first names and just the last names.
  • str.splitlines(): returns a list of the lines in the string, breaking at line boundaries. The full list of newline characters are in the documentation.
  • str.join(iterable): concatenates the strings in the provided iterable, such as a list or a tuple, and returns the result as a string.
    • You can also concatenate strings using the + operator. For example, “abc” + “def” returns abcdef.
  • Changing case: there are several ways to change the case of your text.
    • These can be useful when you are standardizing formatting or trying to check for uniqueness in your data. For example, if you are looking at book data, book titles may be entered differently due to human error, or different editions. You may want to use one of the earlier functions to remove all punctuation, and then change all the strings to lowercase first before checking the number of unique books in the dataset.
    • str.lower(): returns a copy of the string all in lowercase
    • str.upper(): returns a copy of the string all in uppercase
    • str.swapcase(): returns a copy of the string with every lowercase character converted to uppercase, and every uppercase character converted to lowercase.
  • str.replace(old, new): returns a copy of the string with all the old substrings replaced by the new substring.

Frequently asked questions

Data cleaning, sometimes called data cleansing, data wrangling, or data scrubbing, is the process of getting data ready for further analysis.

Typically, you will use libraries like pandas and numpy to aid in your data cleaning efforts. Data cleaning can include, in no particular order:

-Data profiling
-Removing duplicate data
-Handling missing data
-Type conversion
-Handling outliers
-Standardizing formatting
-Transforming variables, and more.

Each dataset will require slightly different iterations of the data cleaning process. As you clean data, you may discover more information that leads you to other data cleaning steps.

Yes, Python is a high-level, general-purpose programming language, and is a favorite about data scientists, data analysts, and other data professionals. Python has an extensive ecosystem of libraries, packages, and modules that make not only data cleaning easier, but the rest of the data science process from creating visualizations to building predictive models. The most foundational packages for data cleaning include pandas and numpy. You may also use visualization packages such as matplotlib, seaborn, or plotly.

By using Python, you gain access to a variety of packages, such as pandas and numpy, that can help you clean your data, which is the multi-step, iterative process of getting data ready for further analysis.

There are several Python libraries, packages, and modules used for data cleaning. Two of the most popular and commonly used are pandas and numpy. As data cleaning is iterative, you may also need to visualize your data using packages like matplotlib, seaborn, or plotly, among others.

Your models and predictions are only as good as the data they were trained on. Even a great machine learning pipeline cannot save dirty data. Confidence in a machine learning model, without awareness of the quality of the underlying data can result in misleading insights and the spread of inaccurate conclusions.

Clean data is the result of a process called data cleaning, in which data is prepared for further analysis. Dirty data is the initial or raw data that is collected from various sources. Data cleaning is a multi-step, iterative process, that can include type conversion, variable transformation, handling duplicates, handling outliers, and handling missing data, among other steps.

Start using Einblick

Pull all your data sources together, and build actionable insights on a single unified platform.

  • All connectors
  • Unlimited teammates
  • All operators