At Einblick we aren't just obsessed with making great data science software, we are also determined to provide our users with the very best information about all things data science. Whether it's data preparation or going in-depth on the best steps to take to transform your data into something actionable, we have you covered. With that in mind let’s go over a comprehensive review of data cleaning.
What is data cleaning?
Data cleaning is an essential process for any business that relies on data to make decisions. Data cleaning is the process of identifying and correcting inaccuracies and inconsistencies in your data. This process is essential to improve the quality of data used for decision making, analytics, and reporting. Data cleaning is also sometimes referred to as data scrubbing or data wrangling.
Data quality assessment
The first thing we want to do is evaluate the state of the data we collect. No need to reinvent the wheel here because there are already tried and true approaches to data evaluation. The same approach is going to work locally or if you are data cleaning online. Let’s review some standard approaches that aren’t mutually exclusive.
Data type checks
Data type checks might be the most obvious method to check for errors but sometimes it's easy to gloss over. You want to make sure all the data you are expecting to be numerical is in fact numerical and not stored as text. Sometimes data can be stored as text when it really should be numerical, which can cause problems downstream if not dealt with.
The same goes for dates, you want to check that the data type is in fact date and not stored as text. This can cause all sorts of problems, especially if you are trying to do time-based analysis.
Even if you are new to data science, you can probably guess that descriptive statistics are statistics that describe the data. Running descriptive statistics on your dataset and eyeballing it to see if anything stands out as being obviously wrong is a great first step no matter what kind of data you are working with. Calculating descriptive statistics is often a part of data profiling. Descriptive statistics can also be used to identify patterns and trends in data that you cannot identify simply by looking at a large table of data.
Here are some descriptive statistics that are typically calculated for numerical data, along with their descriptions.
Standard descriptive statistics
- Count (n): the number of observations in the dataset
- Minimum: smallest value in the dataset
- Maximum: largest value in the dataset
- Measures of central tendency: describe the typical value in a dataset
- Mean: sum of all data points divided by the total number of data points or observations
- Median: middle value once all observations are ordered
- Mode: value that occurs the most frequently in a dataset
- Measures of spread: describe how similar or different the values of a variable are in a dataset
- Variance: sum of the squared difference from the mean, divided by n
- Standard deviation: the square root of the variance
- Range: difference between the smallest and largest values in a dataset
- Interquartile range (IQR): difference between the 1st and 3rd quartiles or the 25th and 75th percentile of the dataset
Advanced descriptive statistics
- Percentiles and quartiles: used to split up a dataset into even groups
- Skewness: a measure of asymmetry in the data, a perfectly symmetrical distribution will have a skewness measure of 0. Data can be positively or negatively skewed.
- Kurtosis: a measure of the “heaviness” of the data’s tails or outliers
Whether you just need that standard values or want to get extra fancy, getting in the habit of running descriptive statistics every time you prepare your data means that ideally you have an automated workflow setup to do that automatically. Einblick has you covered there. Our profiler cell is a great first step to understanding the basic characteristics of all the numerical data in any given dataset.
Beyond the basic statistics, you can use our Python cell and relevant Python packages to calculate any more advanced descriptive statistics you may need.
Check for duplicates
Sometimes bad data can make it into your dataset through human error or data entry issues. Duplicate observations are one example of this, which can really throw off your results if you aren't careful. Identifying and dealing with duplicate observations is an important step in any data cleaning process.
If your dataset has an ID variable for each observation, you can use that variable to quickly identify duplicate entries, for example in a SQL dataset. If two observations have the same ID then they are also likely duplicates and should be removed from the dataset. One way to get at this information is to use SQL or Python functions to find data with duplicate key values.
But a lot of the times, identifying duplicates is not so easy. Once you find a pair of observations that have exactly the same values for all variables, this could indicate a larger systemic issue about redundancy in your data.
However you choose to check for duplicates, make sure to rerun your descriptive statistics on that subset of the data to see if any major differences come about. Data cleaning is often recursive in that you don’t just apply various data cleaning methods as a blanket approach but often clean data by nesting approaches and iterating on the process.
Identify and deal with missing values
Missing values are another type of bad or incomplete data that can make its way into your dataset. Missing values can be caused by a variety of things, such as errors in data entry or data collection, or issues when migrating data between applications or data storage solutions, or simply missing information because it doesn’t exist.
Missing values can cause problems with data analysis and modeling, especially if there are a lot of them. Additionally, it is important to determine if missing values are occurring systematically, for example, if you are running a project about water quality in a particular geographic region, there could be one part of the region that tends to have missing data. This could indicate a larger issue, and could vastly skew the results of any analysis. Regardless of how they got there, missing values can cause problems with your analysis and should be dealt with as part of your cleaning data for effective data science.
Sometimes missing data is not just a blank cell in a table, but may be coded as a null value, a
NaN or something else. If you have not personally collected the data, it is also important to check for any existing data that is actually missing. For example, if examining survey data, you may encounter values that fall outside a normal range. Let’s say you’re looking at ages of animals in a zoo, and you see a lot of animals with a listed height of 9999 or 0–this would seem odd given the context. Perhaps the zoo does not know how old these animals are. Even though there is a value available in the dataset, including these values in your statistical analysis would likely be inaccurate.
Depending on the severity of your missing data problem, you can remove any observations with missing values, code them in a particular way, or replace the values with synthetic data, among other solutions.
Check for outliers
An outlier is an observation that is extremely distant from other observations. They can be caused by errors in data entry or data collection, or simply by the fact that they are genuinely unusual observations; what constitutes an outlier is specific to each situation because it is relative to the distribution of the values of that variable.
Outliers can cause problems with data analysis and modeling, and should be identified and dealt with as part of your data cleaning process. Mainly you need to determine if you will label an outlier as real or remove it from the dataset.
Typically you can throw out some extreme values but it depends on the situation. For example, in B2B SaaS sales you might have lots of free and individual users and a few enterprise accounts. In that case you might make a case that the outliers point to two distinct groups of data that shouldn’t be pooled together but definitely not ignored or thrown out completely.
Another option is to transform them, such as by taking the log of the values. This can be a good option if the outliers are due to errors in data entry or data collection, or if, in some cases, you know that your data is distributed in a skewed manner. For example, looking at wealth distribution, you typically see a handful of extremes, and typically when handling wealth as a variable, or any variable that follows a power-law distribution, you will see data analysts and data scientists taking the logarithm. In that case, it is important to keep in mind when interpreting the results of your analysis that your variable is no longer measured in currency, but in the logarithm of the currency.
What’s the best way to identify outliers?
Even though outlier status is somewhat subjective, there are a few different standardized ways to start identifying outliers. One of the most common is to use standard deviation. For example, based on your data, if an observation is more than 3 standard deviations from the mean, you can label it as an outlier. Another method is to use percentiles, if an observation is more than 1.5 times the interquartile range smaller than the 25th percentile or 1.5 times the interquartile range larger than the 75th percentile, you can label it an outlier. A great way to visualize these kinds of outliers is with the use of box plots.
If you have a multivariate model, you may want to determine outliers based on multiple dimensions. The most common method for identifying outliers based on multiple variables is Mahalanobis distance. You can also alter a method like the Isolation Forest algorithm to improve performance on high-dimensional data.
Data visualization is a great method for assessing data quality and the last type of quality assessment check we’ll go over. Data visualization allows you to see your data in a way that is not possible with other methods, such as descriptive statistics. Data visualization can also be used to identify patterns and trends in data that might not be obvious using other methods.
There are a variety of different ways to visualize data, but one of the most common is to use a histogram. A histogram is a graphical representation of the distribution of a variable. Histograms can be used to identify outliers and other patterns in data. You can also use box plots, scatterplots, and heat maps to learn more about your data visually. If you have Einblick you are going to be able to save that chart into a dashboard, and schedule the visualization to update with new data after a given time interval.
Standardize your data cleaning process
So you might have gone into this as a seasoned data scientist thinking there wasn't much you could take away from the data quality assessment section. If you skipped ahead it might be worth a visit. Data quality assessment is important but it can be time-consuming. The good news is that it doesn't have to be if you have a process in place and automate as much of it as possible.
The goal here is to have a data cleaning process that you can consistently apply to every dataset that you work with. This process should be repeatable and scalable so that you can easily apply it to datasets of different sizes.
A good data cleaning process will help to ensure that your data is of high quality and free of errors, which is essential for accurate analysis and decision making.
Here's a quick overview of what a data cleaning process might look like:
1. Import the data into a dataframe.
2. Calculate summary descriptive statistics.
3. If anything looks suspicious, pay more attention to those variables.
4. Identify and deal with duplicates.
5. Identify and deal with missing values.
6. Check for and handle outliers.
7…n-2. Iterate on earlier steps.
n-1. Final check of data quality.
n. Save the dataframe.
Data cleaning software
There are a lot of data cleaning software options and while we may go through and do a comprehensive review of data cleaning tools at some point in the future, this section is just a jumping off point to consider how to best automate the data cleaning process and what software program you could use to do that data cleaning.
Data cleaning in Tableau
Tableau is a popular dashboard tool, and has a tool called Tableau Prep made specifically for data cleaning, so if Tableau is already integrated into your company, it can make sense if your data team is just starting out.
- Tableau can handle multiple data sources, including big data
- There’s a bit of a learning curve to using Tableau for data preparation
If you or your company is already using Tableau for analysis, it might make sense to just use the data preparation features that are built in. If you're not using Tableau or you want something that is focused solely on data cleaning, consider one of the other options in this article.
Data cleaning in Excel
Excel is a popular choice for data cleaning because it's familiar to many, even for those with a less technical background.
- Excel is widely used and may be more comfortable as a bridge into a more technical data space for those just starting out
- Excel is not designed for data preparation and can be slow when dealing with large datasets or complex data types, such as text data
If you're already using Excel or other Microsoft Office programs for other purposes, it might make sense to use it for data preparation as well. If you're not using Excel or you want something that is built for data cleaning, and can manage more complex data, consider one of the other options below.
Data cleaning in Alteryx
Alteryx is a popular choice for data preparation because it's designed specifically for data preparation as well as process automation, saving a lot of time on tasks that need to be done repeatedly.
- Alteryx is designed specifically for analytics automation and includes a wide range of solutions for individuals as well as teams
- Alteryx can be cost prohibitive, especially for smaller teams
It can be difficult to decide which Alteryx solution is best, especially with cost being such a large factor for many, if this is the case, consider some of the cheaper solutions or more familiar solutions on this list.
Data cleaning in Trifacta
Trifacta also known as Trifacta Wrangler or Designer Cloud Powered by Trifacta is a popular choice for data preparation because it's designed specifically for data cleaning, data exploration, and transformation and includes a wide range of features.
- Trifacta Wrangler has been in the data space for over a decade, and was acquired by Alteryx in 2022, so there are other integrations that can be leveraged
- Trifacta Wrangler can be expensive, even starting at their first paid tier.
Like Alteryx, with cost being such a large factor for many, if Trifacta is not cost-effective for your team, consider some of the cheaper solutions or more familiar solutions on this list.
Data cleaning in Talend
Talend is a popular choice for data preparation because Talend offers solutions specifically for data quality. You can use Talend’s solutions to profile and clean your data, as well as to standardize formatting.
- Talend was built to handle data, and it offers two main products: Talend Data Fabric and Stitch, but also offers an open source solution called Talend Open Studio
- Talend Open Studio requires a lot of disk space (20GB+), and its other solutions may offer functionality that is redundant in your existing workflow.
If Talend Open Studio takes up too much space, you may want to consider a cloud solution, like Einblick’s SaaS offering.
Data cleaning in Informatica
Informatica offers a solution that specializes in data quality, which lets you profile your data, as well as reuse common data quality checks throughout different workflows.
- Informatica offers many data solutions, from data quality to data governance to a data marketplace where you can publish machine learning models and data pipelines
- Informatica’s pricing is a bit more opaque as you purchase software capacity based in Informatica Processing Units, and this may be harder to pitch to a team or company as an investment
If you're already using Informatica for other purposes, it might make sense to use it for data preparation as well. Otherwise, some of these other solutions that have more transparent pricing may be an easier solution to implement for your team.
Give Einblick a try
As a data scientist, it’s important to get a snapshot of your data as quickly as possible so you can start digging into the data with exploratory data analysis, necessary data transformations, and tuning machine learning models. To ease the process, at Einblick we’ve built some core cells:
- The profiler cell gets you summary statistics on all your numerical variables like mean, minimum, and maximum, as well as counts for the top values of your categorical variables.
- The filter cell lets you subset data quickly and easily. You can then run further analysis on various subsets of the data.
An added benefit of Einblick is that you can also easily create visualizations used in the exploratory data analysis process like histograms, box plots, scatterplots, and heatmaps in the same canvas that you’re building and tuning models. The canvas gives you more freedom than in linear legacy tools to work through your analysis, iterate on code, and share your work with collaborators. You can even import an existing Python notebook into an Einblick canvas to resume work on current projects and leverage Einblick’s unique capabilities, like our progressive computation engine.
Good luck implementing your data cleaning solutions, and let us know if you try out Einblick!
Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.