Intuition doesn’t always work when it comes to data analysis because no dataset is 100% error-free. It’s probably fair to say that most websites are complicated enough and have enough pathways to a business goal that your team is not 100% confident in every part of an analytics implementation.
The term “data profiling” has actually been around since the dawn of information technology. Back in the 1970s, it was first understood to be the analysis of existing data to uncover relationships that can be used to create predictive models and make decisions regarding specific definitions or classifications.
So how can we summarize data like customer demographics, purchase, or retention with a high degree of confidence? You can apply techniques that anticipate these problems, how to identify them, and what to do to fix the data.
What is data profiling?
Data profiling is the process of examining data from various sources and collecting statistics or summaries about the data. This process can help you check if you have the right kind of data for your problem, as well as ensure data quality. There are lots of analytical tools built to extract, describe, and recognize patterns in the data based on different important characteristics. The end result of this effort is a data profile. A practical application of data profiling usually looks like a dashboard or report calling out important variables to the business.
A summary report of a company’s data, correctly implemented, can be the difference between company growth and stagnation. Data profiling can be used to identify areas where the business may be underperforming, problems in processes, and capacity in areas where there is high demand.
Data science can be an intimidating field because of how technical it is. One easy (and free) way to get started is to try out the data profiler at Einblick. At Einblick, our goal is to remove barriers to the data analysis process by making tedious tasks as easy as possible so data scientists and analysts can focus their time and energy on extracting insights. With Einblick you can use our expansive canvas to explore your data easily with just a few clicks.
Our data profiler is one of our core operators. With just a click and drag, you can immediately see the distribution of each of the variables in your dataset, as well as key statistics like minimum, maximum, missing values, standard deviation, number of unique values, and more. Once in our canvas, you can use our other operators and tools to make the rest of your data cleaning, exploratory data analysis (EDA), and machine learning model building process fast and collaborative.
Before jumping into best practices and common challenges of data profiling, it might be helpful to put data profiling into context of the total process in order to further elucidate what the dependencies of effective data profiling are. (Or feel free to jump ahead.)
Data profiling vs. data wrangling
Data profiling is not the same as data wrangling. Like we already said, data profiling is the process of summarizing relevant data, and ensuring data quality. You can think of data profiling as a crucial preventative step to ensure your data cleaning, predictive and prescriptive analysis, machine learning models, and more are as reliable as possible. But data wrangling is the process of transforming raw data into a more usable format.
Whoever the lucky data profiler is, they will probably assume, rightly or wrongly, that they have all the data they need to speak to the problem they are trying to solve or KPI they are trying to measure.
Ideally, the data wrangler and the data profiler are the same person so they have the ability to make sure not to bring in information that is not relevant to the business problem or KPI that is being measured, nor should it remove information that is critical for the process. The data wrangler should also be cognizant of all possible business reasons for why, how and when data might be changed. Data profiling can be a part of the data wrangling process. As data wrangling occurs, you need to check that the transformations from raw data to formatted data make sense based on the data profile.
Data profiling vs. data cleaning
Data cleaning—like data wrangling, closely related—is also not the same as data profiling. It is the process of identifying and fixing potential problematic information or formatting that could cause the data to be wrong and/or unusable. If it’s helpful, you can think of data profiling as a prerequisite of data cleaning.
Data profiling vs. data integration
Data integration is the process of bringing together disparate pieces of data to place in a coherent database and ensure that it is compatible with the data being used by businesses. Data integration is not the same as data profiling but definitely can make data profiling a lot easier by using software to automate rules that can be applied consistently every time to both pull and display important data.
9 common data profiling mistakes to avoid
Data profiling, which focuses on summarizing data, is not a trivial exercise and can require some effort and critical thinking. Some common mistakes that you need to avoid when developing a data profiling solution are:
- Avoid reinventing the wheel when it comes to processing and organizing data in a meaningful way. Instead, look for existing services that already understand your requirements and can provide the right solution.
- Do not start building your data profiling solution before you have a clear idea of what you need it to do and why. If you don’t go into the data with a well-defined goal, it’s easy to experience “analysis paralysis” and fail to see the forest for the trees.
- Narrow the scope of your question to something manageable. It’s better to focus on one thing in your data rather than on all the things that could be important.
- Consider what data sources you need to pull together before you go out to find solutions. For example, if you have customer demographic data in one place and purchase data somewhere else, make sure you know how to join those together.
- Build a clear and consistent process for collecting and organizing data before you begin digging into it and asking it questions. Using analytics software is only going to be as smart as the process it’s integrated into.
- Document your process. In the short-term, it is going to give everyone confidence that they are comparing apples to apples for different time periods. It also will shorten the onboarding process for the next person to do the same job when you’ve moved onto bigger and better things.
- Don’t get too hung up on the technical details and focus on the data at hand. Thinking about the questions you want to ask the data and how accurate the answers is a big enough job on its own, learn to lean on your IT or development team when it comes to getting data out of more tricky data sources, especially as you are scaling up your data engineering and data science teams.
- Set a regular schedule to keep track of what data profiles you need to update and when. This can be as simple as marking things off your calendar but if you are using a scheduling software like Trello, then use its built-in due dates to help you and your colleagues keep track.
- Lastly, don’t be afraid to ask for help. Ask your colleagues for feedback on your process and be open to making changes based on their suggestions.
Data profiling red flags to watch out for
No matter how careful you are, making mistakes is inevitable. Especially when you are dealing with data sources too unwieldy to visually inspect, you’ll need to keep an eye out for signs that your data profiling efforts are going sideways. When you start out data profiling, you might rely on more manual processes, but as you and your company progress, automated data profiling solutions are likely more scalable and sustainable for growth and accuracy. There are three main kinds of data profiling: structure discovery, content discovery, and relationship discovery. These can be a little bit abstract, so the following examples will help illustrate issues in your data that different data profiling techniques can help you uncover.
Incorrect data format: structure discovery
Data profiling can be your first line of defense–you need to understand your data sources, and what they can give you. Just as human data entry is imperfect, so are automated processes. By letting a robot be your guide through a massive dataset, you’re letting things like sloppy whitelisting of data go neglected. Even worse, if you are using an automated collection and aggregation process, it can easily identify how “normal” your dataset looks, but still classify things like duplicate emails as separate entries since they do not fall under any clear pattern.
Structure discovery is one kind of data profiling that focuses on determining whether your data is consistent and formatted correctly. This kind of data profiling uses basic statistics to inform you about the validity of your data.
Missing and duplicate entries: content discovery
If a lot of the same data pops up across multiple entries then that would indicate an erroneous entry somewhere. There are several ways to identify this kind of problem but the simplest would be checking the data manually using something like Excel or Google Docs. As you become more advanced, you can use programming languages like Python to handle missing data and duplicate entries. If you find several discrepancies between your manual inspection and the process your software is running, it’s probably best to ask for help translating your process into code. Content discovery is another kind of data profiling.
Content discovery focuses on the quality of the data based on the kind of data you are using. For example if you are collecting email addresses, and some of them have an unusual domain name, that could be an indication of poor data quality.
Data silos: relationship discovery
Data silos exist everywhere from different departments within a business to different countries. Data silos occur when data becomes inaccessible to different departments, data engineers, data scientists, business analysts, or other members of the company. The silos then can lower overall data quality as there may be inconsistencies across the data silos.
Relationship discovery is a kind of data profiling that identifies the connections between different datasets.
Data segmentation is how you break up all of the data you have into more logical and manageable groups or segments for various business purposes from marketing to operations. How you segment your data really depends on what you plan to do with your data. Too broad a grouping can lead to missing information and names that fall into several different categories. If, for example, you are trying to find relationships between personal attributes and purchases, you might want to calculate the correlation between several variables. Then perhaps identifying customers who make the same kinds of transactions might be helpful in so far as giving you an idea of where they fall on different segments. Data profiling can help you determine if the different data segments you’ve come up with are high quality.
Frequently asked questions
Einblick is an agile data science platform that provides data scientists with a collaborative workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick customers include Cisco, DARPA, Fuji, NetApp and USDA. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.