You’re reading this page because you already have a sense of how important data preparation is, and will continue to be. So how can you harness the value of data preparation?
We cover everything you need to know to get started with data preparation, so if you’re already a data scientist or you are researching particular sub-areas in this field, this post is for you. Read on for Einblick’s in-depth overview of data preparation.
Raw data is everywhere, and it’s messy, overwhelming, and useless until you, or someone else, sees its value, sorts it and labels it.
Data preparation is a series of steps used to identify, gather, clean and make data available to other stakeholders–from business intelligence (BI) teams to other analytical teams to the executive suite–ensuring every person has the data supply they need to make key decisions.
We develop more robust and successful business propositions and analyses of what’s happening in the world when we can successfully combine relevant datasets. Of course, the data may be highly relevant and complementary across different datasets. Still, when we bring them together, there are gaps and mismatches aplenty–alternative or erroneous spellings, numbers written as words, different formatting standards, and missing data, for example, where participants in a survey have left a question unanswered.
In order to be most useful, data needs to be synchronized and standardized. Even a single dataset can have missing inputs and errors that need to be resolved before the data becomes analyzable and user-friendly, whether for someone interpreting data, like a data scientist or analyst, or an Online Analytical Processing (OLAP) tool.
So, an example of data preparation is as follows:
Geoffrey wants to learn about the relationship between a person’s height and their daily caloric intake. So, he makes an online questionnaire and sends it to participating nutritionists in various countries.
As the results start to come in, his initial excitement is soon replaced with despair as he sees impossibly nonsensical results, alerting him that in some countries, participants have used feet and inches. In others, they’ve used meters and centimeters, with some using numerical characters, others writing out the numbers in words, and variations in the use of commas.
Geoffrey updates the questionnaire, reducing ambiguity by specifying that future participants must use numbers from a drop-down list identifying meters as the required units. He then cleans the data he has collected by converting any imperial units to the metric system and standardizing the input format to match the new, more clearly defined information. Geoffrey is now relieved to see that his data makes more sense and starts to produce some exciting correlations that form the basis of his academic paper.
The following key terms are often misused or used differently. Let’s review them so you can effectively develop and communicate your data preparation strategy.
Data scientists spend most of their time wrangling data, sometimes also referred to as data cleaning or data munging. This can mean a myriad of things: drawing together relevant datasets from different sources, creating new variables, filling in missing data, and transforming the data to make them more valuable through various techniques and operations.
Data cleansing or preprocessing is often overlooked as people can’t wait to get their hands on the data and start analyzing it. But preprocessing should take place before any end user interaction because by preprocessing, we remove and correct all sorts of errors and dramatically change the outputs from any analysis and the time and energy it takes to make that analysis. For example, this might involve removing irrelevant columns, recognizing where data is incomplete, and other actions highlighting further steps required to complete the analysis more robustly and meaningfully.
When engaging multiple datasets, we need to normalize them so that they are comparable and can interact. This involves selecting the desired format for a specific field and then implementing processes to ensure that all datasets conform to that desired format. Once achieved, the data will have a similar appearance and be easily comparable.
Through data vectorization, we draw out specific information from raw data, turning it into numerical vectors, known as feature variables or features in machine learning. These features can then be used by programming languages such as Python or R, or an engine like Apache Spark to build machine learning models and to help us interpret the meaning of the source data.
Imputing data is the process of replacing missing data with reasonable, substituted values. We may have empty fields due to missing observations or inputs or errors in data collection that have made the data unusable. But, if we need a value, we can impute one by drawing on surrounding data to create a reasonable substitute. The more relevant data we draw on, the more accurate and reliable the imputation, and therefore analysis will be. For example, suppose we were studying weather in England, and were missing a value for the amount of rainfall in London in 2015. We could impute a figure by taking the average rainfall quantities for previous and following years, or the average rainfall for 2015 from nearby towns.
We included this section to help you explain the benefits of data preparation to others. We cannot overstate the value of investing in proper data preparation. Well-prepared data ultimately saves time and money for all stakeholders. Still, in the long-term, it creates easily accessible data libraries for as-yet undefined purposes and templated pipelines for data processing to bring in future data as a steady flow or as one-off wrangling exercises.
Most of the time, the data we need for our current activity is found within a load of other data. Through thorough preparation, we draw out only the information we need, making it much quicker, cheaper, and easier to analyze–a more extensive dataset takes proportionally longer to process. This liberates resources for other activities, but it also means that our end users get their data quicker because we took the time to prepare it well in the first place.
The cost and accessibility of storage can significantly impact project budgets, so we want to ensure we have a sound data library. While storage is a total mystery to most end users, it’s crucial that they also understand the importance and implications of managing data correctly. Imagine a library full of books not organized by Dewey decimal, or genre, or author, or any system. It would be chaos. Just like with books, data needs to be organized, labeled, and stored intuitively–on a macro and micro level. With this analogy, anyone can appreciate the value that is immediately created through sound data management.
What are your table names? What are your column names? Is everyone working from the same dataset? I know a few of you will want to fire off a quick email just now to check! Have you experienced an embarrassing disagreement over calculations with colleagues? Different versions of data can lead to confusion as analyses will differ, and this could detract from the focus of a meeting, and waste time.
Uncover missing data and resolve errors before these problems appear in your presentation, email newsletter, or data library. Poor data can be hastily forgotten until a hapless end user draws it out a few years later, only to find themselves unhappy and dissatisfied with your services. Proper data preparation gives our end users the confidence to access and use data, ideally with the help of an in-house data scientist.
Create processes that are stored as templates in your software of choice to maximize the utility of your work. These saved data preparation processes can then be applied to historical data or new data as it comes in. Furthermore, processes can be automated, and as new data becomes available, it can be automatically drawn through your data preparation pipeline, and the clean data can be stored or basic dashboards and visualizations can be created quickly. The automation will save hours of work for your BI teams.
Working with data is usually an iterative process, it is both an art and a science. Sometimes you have to repeat a step, or revisit data transformations after you’ve already built some models or created some visualizations. While there is some variation in terminology or process prioritization, based on industry requirements, datasets, or other factors, the following relatively consistent series of steps will help get you started in preparing data for analysis, ML models, and extracting valuable insights.
Once a project has been scoped out, you first have to identify relevant datasets or data sources that you believe will help you answer stakeholders’ questions. You can check in with the BI team or other appropriate teams and domain experts to ensure that the data will be a good fit for the project requirements before committing to drawing data from the source. Perhaps you only need data for a particular date range or from a particular geographical region to provide meaningful results through analysis. There may also be constraints on the insights you can make based on the original data collection method–for example, was the data collected from a survey or in-depth interviews? If you have survey data, many questions may have been answered on some kind of Likert scale. Perhaps not all the questions were mandatory, so maybe only a small percentage of the participants answered some of the questions. This may also impact the data scientists’ estimation of what the timeline and cost of collecting the identified datasets will be.
We spoke earlier about data cleaning or data cleansing, also referred to as preprocessing. How you cleanse the data will differ based on the variables or features, the kind of model you’re building, and the end user’s purpose with the data. The data required may have a narrow focus, with accuracy being paramount. Or less sensitivity to specific details, but with a requirement for sufficient depth of data over time to give credibility. If you’re building a linear regression model, you might need to take the logarithm of a variable that follows a power law, such as income. If you’re analyzing text data or doing any kind of natural language processing (NLP), you’ll need to use a text featurizer to convert text to numbers that the model will understand.
Once data has been cleaned to the desired specification, it can be integrated with other datasets. The data is made available to BI and other teams, but also by being combined with different datasets within a particular context. This context may be clear between these newly integrated datasets, for example, place names or customer names. Further, the context may come from references to integrated lists gathered through data mining. You might see data integration discussed alongside the term ETL (extract, transform, load) or ELT (extract, load, transform). ETL is the process of taking data from sources, cleaning or modifying it as needed, and then saving it in a data repository, such as a data warehouse, data lake, or data warehouse. Data integration is commonly implemented by the BI software platforms.
Graphs, charts, and tables are vital to anyone running analytics at various stages of the process. For example, when performing exploratory data analysis (EDA), visualizations help us determine which models are appropriate and for us to get preliminary insights. After analysis and model building, key visualizations help convey meaning to end users, and technical and nontechnical stakeholders. Data visualizations can appear in dashboards, presentations, brochures, websites, and in a variety of other places. To create good visualizations, we need to understand the characteristics of the audience, what engages them, and what information they need. A compelling visualization can help the target audience to understand what the data means and to make decisions accordingly or to communicate this meaning with customers or colleagues.
Analysis is a very broad term that can be used to describe simple processes like EDA or it can also mean building complex models such as neural networks, or implementing a k-means clustering algorithm. Whatever analysis you’re doing helps you to shine a light into the data to see what’s going on. You can look for correlations or patterns that tell you something interesting or valuable. This could mean combining multiple datasets in the same analysis, for example, finding a relationship between what food people buy in a given community and how many people in that community have prescriptions for obesity-related illnesses. The ultimate goal of analysis is to tease value from a dataset to shape business strategy, make business decisions, and communicate with customers and suppliers.
We use automation to complete repetitive tasks more efficiently, freeing up our data scientists’ time. For example, we could schedule an automated task to bring in new data once a week from a particular source and apply a computerized process to prepare it and produce a dashboard of crucial features for our BI team. An automated task could be suitable to draw in important written content in a range of languages and automatically standardize it through a natural language processing system, to produce a report in the desired language.
Ok, so we all know data preparation is important, but there are some challenges inherent in this complicated process. First, it can be highly specialized, and the field of data science is dynamic and rapidly changing. Therefore, it can be hard to keep up and invest sufficiently to stay ahead of industry developments. Data scientists must have strong programming and statistical skills and an ability to communicate their highly technical specialty with various stakeholders.
Data preparation has a reputation for being repetitive or mundane, so it can be demoralizing. The field of data preparation is still developing rapidly, and it’s of direct and central importance to any medium or large business. It’s typical for data scientists to find themselves overwhelmed by demand within their organizations, and short-term projects can be prioritized over longer-term automation projects or more experimental projects that could save time and money once they’re systematized.
Should data preparation be a team or part of the IT team? Should it be in-house or outsourced to specialists? As the data preparation industry grows, companies and public sector organizations are grappling with how best to manage resources for scalability. Arguably, data preparation is separate enough from IT, in that the focus is providing and ensuring data quality specifically. A centralized data preparation team could manage a data library, responding to requests from across the organization to source and prepare data. However, as data services can seem mysterious or like a black-box, precise project management is essential, with transparency and clear communication about timescales, deliverables, progress, and expectations.
Further data preparation challenges relate to the authority or bias that different data sources may have. People tend to trust data from within their organization more, a social phenomenon observed intensely in the public sector where public, academic, or privately funded data may receive different statuses. This can make combining data sources challenging or restricted. Some may not be open to engaging with datasets from specific sources, particularly where they seem ideologically misaligned. Some people will blindly disregard data from sources they do not trust, or blindly trust data from sources without thinking critically about each case. Either instance can be damaging to the quality of the data analysis.
Sometimes there is data publicly available, but the origins of the data aren’t clear; the challenge then is determining the source of the data to properly assess credibility. This can be a huge source of frustration for colleagues trying to find data to support a business case or other persuasive or analytical function. In addition, returning to update original data with references or additional information can be time-consuming, as links can change, web pages can go down, or data can become unavailable after a certain timeframe.
As complex as data preparation can seem, there are a number of data science solutions, such as Einblick, out there geared to make the upfront work for data scientists, easier, more intuitive, and more streamlined. For example, with Einblick, you can work in an expansive visual environment, utilizing operators that allow you to reuse commonly used code snippets in the data preparation process. By streamlining tedious tasks like data cleaning, data scientists can spend less time looking up specific code syntax, and more time extracting insights.
Data preparation is a rapidly growing specialty, and IT, BI, and other core teams are increasingly integrating data scientists as part of a strategic and proactive approach to data management. As a result, data libraries, NoSQL databases, data lakes and more are already becoming formalized within the infrastructure of organizations of all sizes, along with the processes surrounding their management and function, including regular automated data updates and preprocessing.
From these central data repositories and spaces, data engineers, data scientists, and analysts can use self-service interfaces to access data and make it available almost immediately to their particular team, department, or end users.
As an emphasis on using data becomes more prolific and common, there is and will be an increasing demand from end users to access data, design visualizations, and build models or dashboards themselves. cross-organizational access to an appropriate level of data, without the ability to affect the data store, so that we maintain confidence that the clean data sets in our in-house library are up-to-date and available to anyone at any time.
Automation is a rapidly growing area and presents a massive opportunity for analytics software vendors, amongst others, to offer incredibly efficient functionality. However, the key to this development is the ability to tailor or create automation easily to suit the specific needs of an organization or individuals within it. Beyond this, AI is cutting edge, with algorithms such as self-learning neural networks opening up new possibilities. However, monitoring these AI deep learning solutions may become challenging to understand in their entirety because of the incredible speed at which they develop and the potential for them to impact such a broad range of industries and activities.
As legislation struggles to keep up with technology, business, and data science, privacy issues concern many. There is a high level of mistrust amongst internet users. Essentially we all exist as linked trails of data on the internet. While privacy policies aim to protect PII (personally identifiable information), the myriad of logins and browser records inevitably create opportunities for data breaches or other security concerns.
It is best practice to only make de-identified data available, but sometimes PII can still be unearthed with enough time and resources. Unfortunately, this creates the risk that if the data is accessed somehow, personal information can be used or sold illegally or with malintent.
It is particularly important to maintain security as data is moved around. Users are invited to sign up for this and that. Regular public data breaches worldwide have highlighted this, with state-sponsored hacking showing us just how valuable data is and how security violations can impact individuals, companies, and even nations.
Due to this sensitivity, regulatory frameworks have been enacted in some cases, but these can feel burdensome for data practitioners. For example, some data may be restricted to the point of losing their value. If there are two primary forms of data, where one tracks more significant trends, and the other supplies specific detail, the precise detail may become restricted to ensure that individuals do not have their privacy infringed. The effect is that end users may not adequately understand the available data’s limitations, making analysis more difficult and inconsistent with the context of the data. Sometimes, the expense of separating private or sensitive data renders datasets commercially unviable, and their potential utility is completely lost.
The first stage is to engage a data scientist, whether a consultant or a full-time team member, to assess the organizational need for data preparation and the costs associated with those processes, considering both outsourced and in-house data management. This assessment should lay out possible pathways to proactively engage data management over the next 5 to 10 years.
Ideally, the data scientist would be able to identify what you need from your team of data scientists and analysts and whether they will be integrated with your BI team, IT team, or another department. Then you have to determine what specific deliverables are meaningful, and how those will be monitored.
An effective plan should consider what skillsets you need to bring in to create a sustainable, scalable, and adequate data preparation infrastructure, what skills you already have in-house, and how you can use the former to develop the latter, and to develop your organizational intelligence regarding data.