What is data wrangling?
The term "data wrangling" has been used informally in the software industry for many years, but has only recently been defined in a formal way. Data wrangling is the “process of transforming and mapping data from one "raw" data format” or structure into another format or structure. Data wrangling, sometimes called data munging, is a common data preparation task for many data analysis, machine learning, and artificial intelligence applications.
Part of the data wrangling process is to cleanse, aggregate, or otherwise manipulate data in preparation for analysis, visualization, or storage in a database. But it isn’t synonymous with data cleaning, although the two terms are sometimes used interchangeably, because data wrangling can also involve converting data from one database or file format to another database or file format. Data wrangling is a common task in data preparation, and there are many different tools and techniques that can be used to accomplish this task.
Data wrangling strategies
There are many different types of data wrangling tasks that can be performed, and the specific task that is performed will depend on the specific requirements of the target application or system. Some common types of data wrangling tasks include data cleansing, data aggregation, data normalization, and data transformation. We’ll go over each of these briefly in this article.
The purpose of data wrangling
Data wrangling is performed to achieve three main goals:
- Tidy the dataset
- Transform the data
- Communicate the results visually
Tidy up the dataset
Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables, and data types. The principles of tidy data provide a standard way to organize data values within a dataset.
The rules can be summarized as:
- Each column is a variable.
- Each row is an observation.
- Each type of observational unit forms a table.
That being said, the tidy data standard is not always the most practical way to organize your data. In some cases, it may make more sense to keep your data in a messy format. The tidy data standard is a guideline, not a rule.
But let’s talk about data transformation assuming the most common case and come back to messy data later.
Transform the data
Data transformation is the process of converting data from one format or structure into another format or structure. The process of data transformation is often used to cleanse, aggregate, or otherwise manipulate data in preparation for analysis, visualization, or storage in a database. Data transformation can also be used to convert data from one database or file format to another database or file format. Data transformation is a common task in data wrangling, and there are many different tools and techniques that can be used to accomplish this task.
There are many different types of data transformations that can be performed, and the specific transformation that is performed will depend on the specific requirements of the target application or system. There are many different tools and techniques that can be used for data wrangling, and the specific tool or technique that is used will depend on the specific requirements of the target application or system.
This is the process of identifying and correcting errors and inconsistencies in data. Data cleansing can be performed manually or automatically using software tools. Common data cleansing techniques include:
This is the process of verifying that data is accurate and complete. Data validation can be performed manually or automatically using software tools.
This is the process of transforming data into a consistent format. Data normalization can be performed manually or automatically using software tools.
This is the process of filling in missing values in data. Data imputation can be performed manually or automatically using software tools.
This is the process of combining multiple datasets into a single dataset. Data aggregation can be performed manually or automatically using software tools. Common data aggregation techniques include:
- Data concatenation: This is the process of combining multiple datasets into a single dataset, usually based on rows or columns.
- Data merging: This is the process of combining multiple datasets into a single dataset, based on shared values across multiple datasets.
- Data summarization: This is the process of creating a summary of data. Data summarization can be performed manually or automatically using software tools.
This is the process of selecting a subset of data from a larger dataset. Data filtering can be performed manually or automatically using software tools.
- Data sorting: This is the process of ordering data based on one or more variables. This can help you gain greater insights into the breakdown of your data–whether certain groups are larger than others, and other insights.
- Data joining: This is the process of combining two or more datasets by matching values on common variables. There are several different kinds of joins, including inner, outer, left, and right joins, which can be performed manually or automatically using software tools.
Show off your results
Data visualization is the process of creating visual representations of data. Data visualization is a powerful tool that can be used to understand, communicate, and discover insights in data. Data visualization can be used to display data in a variety of formats, including charts, maps, and tables. Data visualization is an important part of data wrangling, as it can help you to understand, communicate, and discover insights about the data.
There are many different types of data visualizations that can be created, and the specific visualization that is created will depend on the specific data that is being visualized. Some common types of data visualizations include:
Bar charts are one of the most common types of data visualizations. Bar charts are used to show distributions of categorical variables.
Line charts are used to visualize how a value changes according to changes in another continuous variable. For example, line charts can be used to track trends or compare values across different time periods.
Pie charts are used to visualize proportions. Pie charts can be used to see the distribution of categorical variables.
Maps are used to visualize data in a spatial context. Maps can be used to show the distribution of data, or to compare values across different geographical areas.
Messy data: pros and (mostly) cons
There are many different types of messy data, and the specific type of messy data will depend on the specific application or system. Messy data can be caused by errors in data entry, data collection, or data processing. Messy data can also be caused by changes in the structure of data, or by changes in the format of data.
Messy data can be difficult to work with, and can often lead to incorrect results. There are many different techniques that can be used to cleanse, aggregate, or otherwise manipulate data in preparation for analysis, visualization, or storage in a database. Data transformation is a common task in data wrangling, and there are many different tools and techniques that can be used to accomplish this task.
Here are a few kinds of messy data that we’ll go over:
- Column headers are values, not variable names.
- Multiple variables are stored in one column.
- Variables are stored in both rows and columns.
Column headers are values, not variable names
This type of messy data is often seen in data that has been exported from a database or spreadsheet. In this type of data, the column headers are values, not variable names. This can make it difficult to understand what the data represents.
To fix this type of messy data, you will need to rename the columns to match the variable names.
Multiple variables are stored in one column
This type of messy data is often seen in data that has been collected from surveys or other forms. There could have been a problem when multiple datasets from different sources were aggregated. This can make it difficult to understand what the data actually represents.
To fix this type of messy data, you will need to split the column into multiple columns, each containing one variable.
Variables are stored in multiple columns
This type of messy data is often seen in data that has been collected from surveys or other forms. For example, if there are survey questions that are multiple choice, sometimes the answers are all stored in one column as categorical data, but other times the data is stored as multiple columns, one for each category. When running certain models it is important to use one-hot encoding or label encoding to disaggregate categorical variables, but you should still double-check the data to make sure you have a clear understanding of what each column represents.
Messy data pros
In rare cases where messy data is better is usually because messy data takes up less space than tidy data. In big datasets, every byte counts when it comes to storage space and processing time. So if you know that your data will never be used for anything other than storage and retrieval, or if you lack time and resources, perhaps learning how to work with messy data is a good enough temporary solution.
Data wrangling methodology
Start with data discovery
Data discovery is the process of determining what data is available and how it can be used to answer business questions.
During this phase, you will want to take a close look at your data sources and determine what information they contain that can be used for your analysis. You may also want to consider what other data sources might be helpful to include in your analysis.
Do the data prep
Data preparation is an umbrella term that can refer to many steps, including, but not limited to cleaning and formatting your data so it can be used for analysis. This phase is important because it ensures that your data is accurate and ready to be processed by your chosen analytical tool.
During data preparation, you will want to remove any invalid or incorrect data points, as well as any duplicate records. You may also want to standardize the format of your data so it is consistent across all of your data sources. You may also want to perform data profiling early on in your data preparation process so that you get a clear understanding of what your data looks like.
Clean the data
Data cleaning is the task that eliminates everything that’s not useful or relevant to your data analysis. This might include anything from duplicate records to invalid or missing values.
One of the issues that can arise during data cleaning is inconsistent data. Inconsistent data is a type of bad data that doesn’t match up with other data in the dataset and needs to be removed or reformatted or replaced.
For example, if you are collecting a list of customer names, you shouldn’t have any duplicates. But sometimes, customer names aren’t consistent across all of the data sources you’re using; for instance, one source may use a first name then last name format while another might use a last name, first initial format. This makes it more difficult to identify duplicates and incorrect values since they are all recorded differently in each source.
When you’re cleaning your data, it’s important to examine all of the fields in your dataset and make sure they are consistent so every value can be accurately read and processed.
Another issue you may come across during data cleaning is missing values. Missing values are empty fields in your dataset that need to be filled with some sort of default value so the dataset can be used for analysis later on. It’s important to identify these fields early on so you can make sure they are filled with an appropriate value or handled before moving on to your analysis; otherwise, the missing values will skew your results and cause inaccurate insights to be generated from your data.
Now enrich your data
Once you have your data cleaned up, it will be easier to enrich it with additional data that can help to improve your analysis. This step can involve things like adding more fields to your dataset or merging two or more datasets together.
For example, if you have a dataset with customer names and addresses, you may want to enrich it with data from a third-party service that includes customer purchase history or credit scores. This additional data can help you to better understand your customers and target them with more relevant marketing messages.
Enriching your data can also involve adding timestamps to your dataset so you can track changes over time or adding geo-location data so you can see where your data is coming from.
Show them what you got
The final step of data wrangling is to present your findings to your audience. This step can involve creating reports, visualizations, presentations, or dashboards that help to illustrate your findings in a clear and concise way.
It’s important to consider who your audience is and what type of information they are looking for when you’re designing your presentation. You may also want to consider what type of format would be most appropriate for your data; for instance, a data visualization might be more effective than a report for conveying certain types of information.
Once you have your data wrangling process down, you’ll be able to ensure that your data is clean, accurate, and ready to be used for analysis. This will help you to avoid any errors in your analysis and make it easier to generate accurate insights from your data.
Data wrangling vs. Extract Transform Load (ETL)
So after all of that you might be wondering how ETL is different from data wrangling and when you should do what. Data wrangling typically has involved messier data for more ad hoc use-cases. ETL has more typically involved structured or semi-structured data, such as CSV or Excel files, and may be a more systematic process.
Data wrangling is for when you want to prepare your data for analysis and usually involves tasks such as data cleansing, data aggregation, and data transformation. Data wrangling has typically involved programming languages and software tools such as SQL, R or Python. For example, you have a dataset that contains customer names, addresses, and credit scores. You want to clean the data by removing duplicates and invalid values, and you want to standardize the format of the data so it is consistent across all of your data sources.
ETL, on the other hand, is for when you want to prepare your data for storage in a database and usually involves tasks such as data extraction, data transformation, and data loading. ETL is typically performed using software tools such as Informatica, Talend, or Pentaho.
If you were doing ETL with the same dataset from the previous example, you want to extract the data from the data source as raw data, transform it into a structured format with important metadata, and load it into a database so it can be used for analysis later on.
Data wrangling in Einblick
At Einblick, we’re focused on making data science accessible and efficient, removing repetitive and mundane tasks such as painful environment setup, and speeding up the process of tuning machine learning models, among other improvements to the data science workflow. As you can see in our example canvas below, Einblick offers great code, as well as no-code options for you and your team to speed up and tailor different parts of the data science process.
With Einblick, you get everything you love about Python notebooks, such as shared kernel and Python environment and keyboard shortcuts, in addition to a smoother collaboration process, and more visual medium. You can do everything in the data science workflow. You can connect with a number of data sources, from different file types to Amazon S3 buckets, BigQuery, Databricks, Snowflake, and more. In Einblick’s visual canvas, you can also wrangle, clean, and explore your data, as well as build machine learning models, and automate your workflows through our schedules.
Our built-in operators make it easy to perform parts of the data wrangling process. For example:
- Our join cell lets you create different kinds of joins on your datasets easily, quickly, and intuitively.
- Our filter cell lets you subset your data according to different conditions.
As you iterate through the data wrangling steps, you can also use Einblick’s other core functionality to create interactive visualizations, such as histograms, box plots, and scatter plots, instantly using our chart cell. Once you’re satisfied with your data wrangling and data exploration, you can use other Einblick tools, like AutoML, to build and tune machine learning models. Try out Einblick, and let us know what you think.
Frequently asked questions
Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.