At its core, data science is about extracting insights from data. This data can come from a variety of sources, including social media, sensors, transactions, and more. Data science is a combination of art and science, and the best data scientists are those who are able to think creatively about data.
What is data exploration?
Data exploration is the process of analyzing datasets to find patterns and relationships, and is sometimes more formally referred to as exploratory data analysis (EDA). Exploring data can help you to develop hypotheses about how different variables are related. Additionally, it can help you to identify which variables are the most important in predicting a particular outcome. Exploring the data can help you to understand the data better and to develop intuition about how the data behaves.
You can think of data exploration as a task of excavation; you might have some idea of what you hope to find, but you’ll likely find all sorts of interesting statistics, observations, and unexpected treasures along the way. Data exploration requires a sense of curiosity and desire to get to know your data better. This can take a variety of different forms from traditional statistics to visualizations. Even if you don’t use each statistic or visualization in your final model, everything that you learn about your data will improve the quality and completeness of your ultimate analysis.
As you’re exploring your data, you want to be able to move quickly as you generate questions and examine different ideas and trains of thought. As a result, you want a tool that allows you to move freely and agilely. At Einblick, our goal is to remove barriers for data scientists, and a key part of this is making data exploration and EDA as easy as possible. With our uniquely visual and collaborative canvas, users can use our chart cell to create histograms, scatter plots, bar plots, box plots, and other visualizations with just a drag-and-drop, so data scientists can spend less time on these repetitive tasks, and more time on tuning models and extracting insights.
In Einblick, you can also use our profiler cell to get quick summary statistics about each variable in your dataset and use our Python cell to create custom Python code and visualizations. With the shared kernel, you can save variables, create visualizations, and build models all in the same canvas. The 2-D space allows you to drag-and-drop and compare visualizations side-by-side very easily, as well as iterate on different versions of code snippets quickly. You can then feed in any datasets you create into other operators like our key driver analysis and AutoML operators to quickly get results via our progressive computation engine.
If you’re new to data exploration or would like a tutorial, check out our in-depth blog post about EDA in Python.
Data exploration techniques
Let’s go over some common techniques and topics to help you understand your data and communicate insights. These techniques can help develop a more intuitive understanding of data, which in turn allows a more effective explanation of what story the data is telling us. We can break down data exploration into two main categories that have some overlap–visualizations and statistics or numbers. Let’s start with visualizations.
Visual data exploration
The first and perhaps easiest way to explore data is visual data exploration. Wander through looking for patterns hoping that our intuition and previous experience with other data exploration techniques and data analytics approaches will yield positive results. This can mean looking at tables as you sort or filter the data in different ways. This can also mean creating simple charts without manipulating the data, for example, box plots, histograms, or scatterplots to show the distribution of continuous variables. You can also create bar plots to summarize categorical data. Remember when creating visualizations to always include labels and reasonable axes so that your visualizations can be interpreted accurately and easily by other stakeholders, and yourself if you ever need to revisit your work at a later time.
High definition gradients (HDG)
One way to visually explore your data is by using high definition gradients (HDGs) in your plots. With a high definition gradient, a color gradient can represent the distribution of a variable or set of variables. This allows for quick identification of trends and patterns in data.
High definition gradients are especially useful for visualizing large datasets, as it can be difficult to spot patterns in a table when there are many variables or data points present. You can also use charts to easily identify outliers in data, as these will be represented by points with high gradient values.
There are many popular visualization packages depending on your programming language of choice or your tool of choice. These packages allow you to tailor your visualizations as necessary, and you can control a variety of details in the plots you create, from axes and chart labels to the shape of the data points to the color(s) of the lines and points. A popular visualization package in the open-source language, R, is called
ggplot2. There are several popular visualization packages in Python, another open-source programming language, including
Data notebook exploration
Doing data exploration in a notebook is another down-and-dirty approach to data analysis. It’s essentially visual data exploration combined with code and explanatory text all in one place. Python notebooks are a legacy tool that have become a part of the data science workflow.
However, there are a few things to keep in mind when doing data exploration in a notebook:
- Keep it organized: Notebooks can quickly become a mess of code, output, and text. As you’re working, take the time to organize your thoughts and keep your notebook as clean and tidy as possible.
- Document as you go: When working in a notebook, it is important to document your work as you’re doing it. This includes adding comments to your code, adding explanatory text, and creating visualizations. It can be hard to revisit projects or share your work without proper documentation.
- Don’t forget about reproducibility: When you’re sharing your notebook with others, make sure that all of the code is reproducible and clear. This means that you should avoid using hard-coded values, use random seeds, and save your data so that others can easily recreate your results.
Although Python notebooks have been inherited by modern-day data scientists, there are a number of challenges when working in notebooks that are exacerbated by current workplace environments. Notebooks are not an aagile tool, sometimes the kernel crashes, and you have to scroll up and down to compare visualizations even if the changes are minimal. Visualizations have to be created using code, which can be alienating for less technical team members, or those still skilling up in data science techniques. In addition, notebooks were not built to be a collaborative tool, and it is difficult to share work, despite the data science process requiring inputs and outputs from multiple groups of stakeholders. With Einblick, you are able to create multiple visualizations quickly, and share your work live, a necessity within an increasingly remote-first work culture. You can import any existing Python notebook right into Einblick, and start benefiting from our visual operators and 2-D canvas immediately.
Measures of central tendency
Moving on to numbers rather than visuals, we can calculate summary statistics that help us get a better sense of the data. These are all statistics that can help you understand your data better without doing any sort of manipulation of the data. These numbers can be visualized in some of the ways we’ve discussed so far, but it is helpful to include clear statistics to contextualize visualizations. Measures of central tendency are one set of summary statistics. There are three main measures of central tendency: mean, median, and mode. Measures of central tendency can also indicate if there are any outliers or anomalies in your data that you need to investigate further. The mean, also known as the “average,” is the sum of the observed values of a continuous variable divided by the number of observations. The mean is sensitive to outliers. The median is the middle value when all the observed values are ordered. The median is not sensitive to outliers. The mode is the most commonly occurring value. So if your dataset is [1, 3, 3, 3, 7, 15, 15, 19, 35], the mean is 11.2; the median is 7, and the mode is 3.
Variance in the field of statistics is the dispersion or spread of a dataset, specifically how far the data is from the mean. Variance and standard deviation, which is calculated as the square root of the variance, are two common summary statistics you can report about variables in your dataset. When performing exploratory data analysis (EDA), you will likely need to report summary statistics, including variance and standard deviation. Beyond just reporting the numbers, box plots are a popular way to visualize the variance of variables. If you have a categorical variable in your data, you can also create a series of box plots to compare the variance of a continuous variable based on different categories. For example, comparing the variance of income based on industry.
Data exploration Python libraries
There are many different Python libraries that have built-in capabilities to aid in your data exploration. Every library has their relative strengths and weaknesses, depending on the kind of data and analysis you plan on doing. For example, one may find that a library is efficient in computing summary statistics, while another is used for creating visualizations, while another might be useful for handling special kinds of data like text or geographical data.
pandas library is a popular Python library for manipulating and examining data in the form of a
DataFrame, which is a data structure that represents data as tables. In
pandas commonly abbreviated using the alias
pd, you can quickly calculate summary statistics using functions like
head(), and more. Beyond the scope of data exploration, you can also use the
pandas library to manipulate and clean your data by removing duplicate data, dropping missing data, replacing values, and renaming columns.
Matplotlib and seaborn
matplotlib.pyplot library, usually seen under the alias
plt is a basic plotting library. With
matplotlib, you can create histograms, bar plots, box plots, scatterplots, and many other fundamental visualizations. You can customize the colors based on category or create a color gradient as mentioned earlier. You can change axes labels and chart labels, as well as the size and shape of your data points.
Matplotlib is very powerful but not the most visually appealing, and some libraries have been built on top of
matplotlib, such as
seaborn, which is another popular library used for creating data visualizations.
Another popular graphing library in Python is
plotly, which specializes in interactive visualizations. For example, if you create a basic scatterplot in
plotly, you can hover over each data point and customize the data that appears while you’re hovering over that data point.
NLTK (Natural Language Toolkit)
NLTK is a powerful library that provides a wide range of functions for analyzing text data, including tokenization, part-of-speech tagging, and sentiment analysis. There are other libraries available as well that specialize in text analysis. In order to create solid analysis, picking the right tools and libraries is critical, so make sure you do your research and consider what’s best for your specific project.
Data exploration and modeling data
Data exploration is a crucial precursor for data scientists, data analysts, business analysts, and anyone else who plans on doing further analysis on the data. Typically data exploration happens before any models are built or formal predictive analytics can occur. Sometimes the data exploration or exploratory data analysis (EDA) steps will need to be revisited after models are built. This may happen if a model produces a surprising result, or if you want to apply the model to a different subset of the data. There are also certain predictive models, like linear regression, that require certain relationships to exist within the data. If those relationships do not exist, then the model may return results, but those results would be misleading, inaccurate, and ultimately could lead to poor business decisions. Thoroughly understanding the context and makeup of your data through EDA is therefore critical before building any models.
There are many reasons why modeling the data to make predictions or recommendations is important. One reason is that it can help you to better understand the data and how it is related to other variables. Additionally, modeling the data can help you to make more accurate predictions or recommendations to stakeholders or to your customers or clients. Additionally, modeling the data can help you to identify patterns and trends that you would not be able to see if you were just creating simple bar charts or histograms. Below is a brief summary of some common techniques associated with predictive modeling and machine learning. Remember that not all machine learning is predictive in nature.
Predictive modeling is an umbrella term that encompasses many different supervised techniques that use observed or existing data to make predictions about unseen data. For example, you might want to forecast or predict revenue changes throughout the year, or to predict customer behavior–will a customer remain active or will they churn? By creating models with your data, you can better anticipate future events and customer behavior to mitigate or capitalize on circumstances.
There are two main buckets of predictive modeling, and they are centered around the two main kinds of data: continuous vs. categorical variables. If you are trying to predict a continuous variable, such as revenue, you could use some kind of linear regression. If you are trying to predict a categorical outcome, let’s say if a customer churns or not, you could use some kind of logistic regression or another kind of classification model. Classification models are a range of techniques, such as logistic regression and naive Bayes, that help you predict what group or category that each data point will fall into.
Broadly speaking, regression analysis allows you to quantify and estimate the relationship between variables. There are two main kinds of regression–linear regression and logistic regression, and each requires you to have some set of independent variables or X variables, and one dependent variable or Y variable. In the realm of predictive modeling, the Y variable is the continuous variable you are estimating or the categorical label you are predicting based on the set of X variables. Regression has its roots in traditional statistics, but is still a widely used and powerful tool in the field of data science. Linear regression can help you predict monthly revenue or number of customers, while logistic regression can help you predict mortality rate, customer churn, or subscription tier based on usage.
Association rule mining
Another example of data science modeling is association rule mining or association rule learning. Association rule mining is the process of finding relationships between variables in a dataset. For example, let’s say you work for an ecommerce company, you might be interested in understanding the buying patterns of your customers. Association rule mining helps you determine what kinds of additional products customers buy if products A, B, and C are already in their shopping cart. This kind of information can help you to target advertisements and promotional bundles.
Time series analysis
Time series analysis is a type of modeling that is used to analyze data that is collected over time. Data that contain time series have to be handled differently because data collected at time A can bias or affect the outcome of data collected at time B. But there are so many applications to time series analysis that it is important to call out specifically. For example, let’s say you are a beverage company, you might want to understand when and by how much sales spike throughout the year so that your company can manage inventory and production in a sustainable way. You can then use time series forecasting to predict when the spikes in sales occur, and then prescribe changes to the cadence of production as necessary.
Decision trees are a type of model that predicts the value of a target variable based on a series of “leaf nodes” that describe whether a particular criteria is met. You can think of a decision tree as a flowchart. For example, let’s say you are trying to predict customer churn, again. There are two final outcomes to consider: the customer returns, or the customer churns. One leaf node on the decision tree could be minutes of using the product per week. If the customer uses the product for less than 10 minutes a week the decision tree branches off in one direction or one part of the “flow chart.” If the customer uses the product for 10 minutes or more a week, the decision tree branches off in another direction as the other part of the “flow chart.” Then each of these two branches could break off again based on other criteria related to customer behavior. Based on the collective criteria, then you can predict whether or not a given customer is likely to churn or return to the product.
Clustering is a technique of grouping data points together based on similar features or characteristics. There are a number of machine learning algorithms that can automate the creation of these data clusters, such as k-means clustering. It is important to note that unlike regression analysis and decision trees, clustering does not require labeled data. Instead, clustering helps you uncover patterns in your data that can help you create labels, such as customer segments.
Ensemble methods combine multiple algorithms to create better predictions or a better fit. For example, a random forest is an ensemble method that averages the predictions of several decision trees in the hopes of getting a better prediction. Clustering ensembles combine different sets of clusters with the goal of finding a set of clusters that better matches the underlying data.
Data exploration requires true collaboration
As data scientists, we have a central mission to communicate meaningful data-driven insights. We collect and process data but eventually we have to tell a story about what our analysis uncovered. Data exploration is one of the preliminary steps necessary to tell a meaningful story. Even if you have an incredible predictive model, numbers alone cannot make a story. The observations you make in the early stages of your data science process can be critical to adding color and vibrancy to the final presentation or narrative you present to stakeholders. Domain expertise and knowledge are also a critical component of effective data exploration.
Data science, and data exploration by extension, is a collaborative process, and it requires expertise in many areas to be successful. For example, a data scientist might use programming to extract data or to write helper functions to automate part of the processes. Data scientists can then use statistical methods like hypothesis testing and regression analyses to understand the relationship between different variables in a dataset. Finally, they would use their domain expertise to interpret the results and communicate them to stakeholders. But data science cannot and does not occur in a vacuum. Data science is a tool, and is part of an overall organization’s or business’s strategy and goals. As a result, collaboration is key to properly leveraging the benefits of data science and machine learning. From the data engineer who mines the data, transforms the data, and loads it into a database, to the data analyst who builds a dashboard, to the data scientists who builds machine learning models to predict customer behavior, to the managers and executives who are looking at company-wide objectives, all need some access to the data science process and understanding of the steps. A data science platform like Einblick, which centers on collaboration, will help your team move faster in this ever-changing data-driven landscape.
Frequently asked questions
Einblick is an agile data science platform that provides data scientists with a collaborative workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick customers include Cisco, DARPA, Fuji, NetApp and USDA. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.