Exploratory data analysis in Python: what, why, and how

Becca Weng - November 18th, 2022

Exploratory data analysis (EDA) is one of the most fun and rewarding parts of the data science process. EDA is also one of the most important as it can be highly generative of crucial insights that can act as a springboard for later machine learning models or pipelines. As a data scientist, you have to juggle several skill sets that are both technical and nontechnical. You need to code in Python, SQL, R, or your language of choice, understand the context of your data and the questions you’re trying to answer, and you need to communicate your findings throughout to various parties. You’re also managing expectations from stakeholders, project timelines, and business goals.

During EDA, you’re putting together the puzzle pieces of the data story without knowing the full picture yet. You can imagine yourself as a detective or an excavationist trying to piece together a story. As exciting as EDA can be, it can also feel overwhelming and confusing.

In this post, we’ll be breaking down the what, why, and how of exploratory data analysis. We’ll start with a brief overview of what exploratory data analysis is, why it’s important, a high-level approach, and then we’ll dig into a specific example of EDA using a Goodreads dataset available on Kaggle. Throughout the example, we’ll cover some of the fundamental libraries for you to be successful as a data explorer. All of the code presented in the post is available in the embedded canvas. With that, let’s start digging into EDA.

What is exploratory data analysis?

Renowned American mathematician and statistician, John Tukey, famously promoted the concept of exploratory data analysis in opposition to what he called confirmatory data analysis in his 1970 book, Exploratory Data Analysis. For Tukey, the spirit of exploratory data analysis is letting data lead you to the hypotheses, rather than determining hypotheses and then running statistical tests to confirm them.

EDA falls into the category of descriptive analytics–you’re describing what is happening in the data without too much manipulation or complex models. The core principles of exploratory data analysis can be broken into the following:

  1. Answer the question: “what is going on here?” Note that we are asking questions without making assumptions about what they are.
  2. Visualize your data, don’t just look at numbers and tables.
  3. Be curious–ask questions and remove any assumptions you have about what your results SHOULD look like. If something looks off, look at it more closely.
  4. Be flexible–you won’t always know what the next step is, and you need to be ready to pivot and try a different approach or revisit something you’ve already done.
  5. EDA, just like the rest of data science, is iterative, so you might have to run the same sets of code or create the same visualizations with some tweaks a couple times.

Why is exploratory data analysis important?

A common understanding in the data science community is that the best model cannot save bad data. This is certainly true, and if you want to learn more about the earlier stages of the data science process, such as data profiling, data preparation, and data transformation, we have posts on those concepts too. You can think of exploratory data analysis as the connective tissue between data preparation and running statistical tests and building models.

Data science process

We can break the data science process down roughly into:

  1. Problem context: business, industry, company interests, goals, and projects
  2. Data gathering: whether this is extracting or mining raw data, querying existing databases, your data needs to come from somewhere before you can start looking at it
  3. Data preparation: putting the data in the right format, augmenting existing datasets, profiling the data, etc.
  4. Exploratory data analysis: what this post is about!
  5. Data cleaning: type conversion, handling missing data, handling duplicate data, transforming data for a specific model, filtering and subsetting your data
  6. Statistical tests and machine learning models: testing your hypotheses, predictive and prescriptive analytics.
  7. Presenting results: reports, dashboards, presentations, and data-driven decisions given to stakeholders, answering any questions others have about your process and conclusions

As stated earlier, data science is iterative, so the steps of the process are a bit fluid. But EDA is what lets you really understand your data and generate interesting insights rooted in the data. It is also important to note that EDA and data cleaning are highly connected. As you explore your data, you get a sense of how you need to filter, subset, and transform your data in preparation for future analysis, based on the hypotheses you have generated. In the example we go through later in this post, you will see that we switch between tasks that seem more exploratory and tasks that seem more like data cleaning. This is to be expected, and part of the fun of data science. You never know what you’ll find until you start asking questions.

EDA in Einblick

At Einblick, we’re working to shorten time to insight for data scientists, by operationalizing mundane and repetitive tasks, and building an intuitively and innately collaborative data science canvas. Although the data science workflow has traditionally been done in Python notebooks, such as Jupyter or Google Colab, the linearity of notebooks is too restrictive, particularly for an expansive exercise like EDA. When performing EDA, you have to find and tug at different analytical threads. Your work should match your mental model better, and as you do EDA, several interconnected work branches is more natural than a single linear flow.

A few of the key features that make Einblick great for EDA:

  • Freedom with organization: expansive, highly visual canvas, and you can use bookmarks and data zones to stay organized, and click through your canvas like a slide deck
  • Flexibility of code: Python cell and SQL operators let you code as you would in a notebook, in a more freeing environment
  • Collaborative and approachable: Einblick is made for collaboration (check out live mode!), and EDA particularly benefits from domain experts, who may be nontechnical, but Einblick’s code optional operators, like the Chart, will allow them the space to ask questions and understand the data better
  • Progressive computation engine: built off 6 years of research at MIT and Brown University, Einblick’s unique backend gives you results on terabytes of data in seconds
  • Reusable data science snippets: Einblick operators are code optional so you can save time as you profile, filter, and scale your data, extract text features, and build models using AutoML, among other processes

We offer different solutions for any size team, but you can get started for free. Let us know if you have any questions. We’re always happy to offer customer support.

Try Einblick for Free

Exploratory data analysis: a high-level approach

Now that we’ve covered on a high-level what EDA is, and why it’s important, we’ll provide an overview of how to think about exploratory data analysis. We’ll break up the approaches into three main dichotomies:

  1. Continuous vs. Categorical Data: What kind of data do you have? What kind of data are you interested in?
  2. Non-graphical vs. Graphical EDA: Are you using numbers or visualizations? What would help guide your next step?
  3. Univariate vs. Multivariate EDA: Are you analyzing one variable or the relationship between multiple variables?
Exploratory data analysis: continuous vs. categorical dataExploratory data analysis: continuous vs. categorical data

Continuous vs. Categorical Data

This dichotomy is less talked about online than the other two divisions of EDA, but arguably it is the core of any sort of data exploration–what kind of data do you have? Remember that the data holds the story, and we are just trying to unearth it. We cannot make the data tell a story that it does not already have within itself. Certain plots are only appropriate for continuous variables. Some plots only make sense for categorical data, and some plots need both continuous and categorical variables to give you the most information.

Understanding what type of data each variable should be stored in is critical, and converting your data to a different type can unlock a lot more analysis. For example, in many research surveys, it is common to ask questions on a Likert scale. The usual answer options are “Strongly Agree”, “Agree”, “Disagree”, and “Strongly Disagree”. Should the data on this survey question be coded as numerical? 1, 2, 3, and 4? Or should it be categorical, just using the answer options? Or should it be ordinal, there is an order to the categories?

Non-graphical vs. Graphical EDA

A common division in EDA is whether you’re using a non-graphical or graphical approach. In general, you will need a combination to extract the most information, and to present your results in the most impactful way. When we say “non-graphical EDA,” we’re referring to descriptive statistics, for example, measures of central tendency (mean, median, mode), measures of dispersion (standard deviation, IQR, range, variance), minimum, maximum, and number of observations. Having the hard numbers helps us to understand the context of the data. For example, an outlier is defined by the context. We can’t know if 5 is very big or very small unless we have a grasp on the descriptive statistics.

As powerful as numbers can be, graphs and plots can also give us a lot of information. Graphical EDA can range from scatter plots and box plots to histograms and heat maps. Especially with Python, there are a number of ways to adjust each visualization to make sure you’re extracting the right story from your data. Plots that you create during exploratory data analysis can always be polished for more formal reporting later as well.

Univariate vs. Multivariate EDA

Another common way to categorize EDA is univariate vs multivariate. Essentially, in the process of EDA, you will explore variables independently of each other, and in concert. How you tune the visualizations you make will depend on whether or not you’re currently looking into just one variable or many at the same time. Linear regression is one case in which variable relationships are particularly important. Linear regression is a statistical technique for predicting the value of one variable based on the value of another. You may have heard of linear regression in the context of “for every 1-unit change in X, we expect a ____-unit change in Y.” Multivariate graphical EDA can be particularly useful in determining whether the use of linear regression is appropriate or not.

Exploratory data analysis in Python

Data science is one of the fields where there truly is no substitute for hands-on experience. Real data, particularly interesting data, is messy. For the last part of the article, we’ll walk you through the process our team took when exploring a Goodreads dataset from Kaggle. You can follow along on the following canvas. We’ve created different, labeled data zones, added in explanatory annotations, as well as bookmarks. We broke down the canvas into the following data zones:

  • Red: initial EDA on Goodreads data, includes importing relevant packages
  • Yellow: initial EDA on continuous variables
  • Green: univariate EDA on continuous variables and data cleaning
  • Blue: univariate EDA on categorical variables and data cleaning
  • Purple: multivariate EDA

In the red zone, we started by importing a number of important packages. The list is not exhaustive, but is typically a good starting point for EDA.

import pandas as pd
import numpy as np
from datetime import datetime

import matplotlib.pyplot as plt
import seaborn as sns

# Sets the theme of the plots
sns.set_theme()
pandas, numpy, seaborn, matplotlib, Einblick logospandas, numpy, seaborn, matplotlib, Einblick logos

Python statistics packages

The two Python statistics packages we always start with are pandas and numpy.

  • pandas: a free, open source Python library. Using pandas, you can manipulate DataFrame objects, commonly seen as df, which are 2-D structures that behave similarly to a table, where you can store different kinds of data.
  • NumPy: a free, open source, Python library built for scientific computing. The NumPy library augments Python with the ndarray object, which is an n-dimensional array, as well as highly efficient methods to operate on arrays and matrices.

Additionally, we also import the datetime library, which is helpful for manipulating dates and times.

Python visualization libraries

The two Python visualization libraries we started with are matplotlib and seaborn. There is debate in the community about which visualization packages are best for data science. Our main focus in this post will be on seaborn, but you can read more about others in our post on data cleaning.

  • Matplotlib: a foundational Python visualization library, on which many other libraries and packages are based.
  • Seaborn: based on matplotlib, and creates graphs that can appear a bit more polished.

Goodreads: exploratory data analysis example

Goodreads is a popular social site for cataloging one’s reading habits, replete with reviews, ratings, and annual reading challenges. The Goodreads dataset from Kaggle was built using the now-retired Goodreads API.

Red: Initial exploratory data analysis

Since we’ve already imported our packages, we can start doing some initial EDA, still in the red data zone. For our very first few lines of code, we’ll stick with the pandas package, and just get some descriptive information about our dataset. We used four functions:

  • df.shape to get the dimensions of the dataset
  • df.head() to view the first n rows of the dataset (default is 5 rows
EDA with Python, df.head() functionEDA with Python, df.head() function

From df.head(), we can see that each observation is a book. This begs the question of uniqueness of each row, for example for different editions of the same book, are they all grouped in one entry or are there separate entries? How do we want to handle this?

  • df.info() to summarize data types, column names, and missing data
  • df.describe() to get some descriptive statistics about the continuous variables in the dataset
df.info() and df.describe() functions in Einblick canvasdf.info() and df.describe() functions in Einblick canvas

We found the results of df.info() particularly interesting as we had expected the num_pages and average_rating columns to be an int rather than an object. We already know a couple of places where we can dig further. There also doesn’t appear to be missing data, since the non-null count matches the row count from df.shape. But sometimes missing data can be hidden as missing values can be coded differently. We’ll have to verify later.

Unsurprisingly, there are some books that have 0 ratings and 0 text reviews, based on the minimum value of ratings_count and 0 text_reviews_count.

Yellow: initial EDA on continuous variables

sns.pairplot(df)
Seaborn pairplot in EinblickSeaborn pairplot in Einblick

One of our favorite seaborn functions is sns.pairplot(df), which creates a grid of histograms showing the distribution of each continuous variable, as well as a scatter plot of each pair of continuous variables. It’s a great snapshot so you can get a sense of anything worth exploring further. Since we only have three continuous variables at the moment, we’ll likely re-run this plot again after we’ve cleaned up the data a bit more.

You’ll notice that the histograms of ratings_count and text_reviews_count is really hard to see. So let’s look into these two variables further using individual box plots and histograms.

# Create subplots via matplotlib
fig, axs = plt.subplots(2, 2, figsize = (10, 10))

# Create each histogram in one of the four axes
sns.histplot(df, x = "ratings_count", ax = axs[0,0])
sns.histplot(df[df["ratings_count"] < 100000], x = "ratings_count", ax = axs[0,1])
sns.histplot(df[df["ratings_count"] < 10000], x = "ratings_count", ax = axs[1,0])
sns.histplot(df[df["ratings_count"] < 3000], x = "ratings_count", ax = axs[1,1])

# Space out graphs and show plots
fig.tight_layout()
plt.show()

Box plots are a great way to understand outliers in continuous variables. Histograms are great to get a clearer picture of the distribution of continuous variables. Both of these plots will come up later when we look at multivariate EDA as well. For both variables, we created several iterations of each plot with different subsets of the data, and used the plt.subplots() function from matplotlib to create a compact layout. From these plots, we can definitely tell there is a very skewed distribution for both of these variables. We may need to transform the variables at a later time.

Green: univariate EDA on continuous variables into data cleaning

In the green zone, we start exploring some additional variables that are continuous, namely publication_date, average_rating, and num_pages, and also transform the ratings_count and text_reviews_count columns.

In [11]: type(df.publication_date[0])
Out[11]:
str

When we use the type() function, we determine that publication_date is stored as a string rather than a datetime object.

In  [12]: df["publication_datetime"] = [datetime.strptime(date, "%m/%d/%Y" ) for date in df["publication_date"]]
Out [12]:
ValueErrorTraceback (most recent call last)
...
ValueError: day is out of range for month

But when we try to convert the data using the datetime.strptime() function, we get an error saying that the "day is out of range for month." We’ll need to investigate further. To do this, we’ll handle the exception using the try and except keywords.

publication_datetime = []
count_none = 0

# Convert to datetime, and handle ValueError exception temporarily
for i in range(len(df["publication_date"])):
    try:
        publication_datetime.append(datetime.strptime(df["publication_date"][i], "%m/%d/%Y"))
    except ValueError:
        publication_datetime.append(pd.NaT)
        count_none += 1

print(count_none)

df["publication_datetime"] = publication_datetime

In the case of a ValueError, we’ll append pd.NaT or “not a time” to the list of converted values, and then add to our count of missing values. We can then use pd.isna() to subset the dataframe to check out what’s happening in the rows with missing publication dates.

Pandas df.isna() function in Einblick canvasPandas df.isna() function in Einblick canvas

It seems that the data has been shifted over for about 6 books. This affects more than one column.

Next, we converted the average_rating column using the same code.

Next, we saved the logarithm of the ratings_count and text_reviews_count columns as two new columns. We used the NumPy library to get the logarithm. To handle the large number of books that have 0 ratings or 0 reviews, we used the warnings package to treat warnings as errors, so we can use the try and except statements again.

import warnings
warnings.filterwarnings("error") # Turns RuntimeWarning into an error

log_ratings_count = []
log_text_reviews_count = []

# Small constant to prevent error when taking log(0)
log_c = 1

# Count missing data
count_ratings_none = 0
count_reviews_none = 0

for i in range(len(df["ratings_count"])):

    # Get log of ratings_count
    try:
        log_ratings_count.append(np.log(df["ratings_count"][i]))
    except RuntimeWarning:
        log_ratings_count.append(np.log(log_c))
        count_ratings_none += 1

    # Get log of text_reviews_count
    try:
        log_text_reviews_count.append(np.log(df["text_reviews_count"][i]))
    except RuntimeWarning:
        log_text_reviews_count.append(np.log(log_c))
        count_reviews_none += 1

print("Books without ratings: " + str(count_ratings_none))
print("Books without reviews: " + str(count_reviews_none))


# Add new columns
df["log_ratings_count"] = log_ratings_count
df["log_reviews_count"] = log_text_reviews_count

# Drop missing publication date and rating data
df = df[df["publication_datetime"].notna() & df["avg_rating_float"].notna()]

# Return warnings to default
warnings.filterwarnings("default")

# Convert num_pages to int
df["num_pages"] = pd.to_numeric(df["num_pages"])

There are different ways to handle log(0) in cases where you need to transform your data AND don’t want to drop any rows. The technique we’re using here is to add a very small constant, log_c to 0 and then taking the logarithm of that. Then we’re dropping all books with missing publication and average rating data.

Books without ratings: 80
Books without reviews: 624

Based on our code, we can see that there were 80 books without any ratings, and 624 books without any reviews.

Blue: univariate EDA on categorical variables into data cleaning

To visualize the distribution of categorical variables, we’ll use the sns.countplot() function and stylize it a bit so that the tick labels are more legible. You can read more about how to customize seaborn plots in our post about data cleaning, as well as in their official documentation.

Seaborn sns.countplot() in Einblick canvasSeaborn sns.countplot() in Einblick canvas

We can also use df[“column”.value_counts() to get a count of values for each category in a given column. We can modify the results to sort by value using the sort_values() function as well.

In  [15]: print(df["language_code"].value_counts().sort_values())
Out [15]:
nl          1
ale         1
ara         1
wel         1
nor         1
glg         1
srp         1
tur         1
gla         1
msa         1
rus         2
swe         2
lat         3
enm         3
ita         5
en-CA       7
por        10
grc        11
zho        14
mul        19
jpn        46
ger        99
fre       143
en-GB     214
spa       218
en-US    1408
eng      8907
Name: language_code, dtype: int64

From both the graphical and non-graphical EDA, we can see that the language_code column separates English into 4 categories: en-CA, en-GB, en-US, eng but most of the English books are just tagged as eng.

df["language"] = df["language_code"].replace({'en-CA': 'eng', 'en-GB': 'eng', 'en-US': 'eng'})

Given that, we decided to code all English books using the generic eng tag. We did this using the df[“column”].replace() function. Given the use-case, we used a dictionary to map the values, but there are a variety of ways to use this function to replace values.

In  [17]: pd.set_option('display.max_rows', None)  
          df["publisher"].value_counts().sort_index()
          
Out [17]:
...
SparkNotes                                    11
Speak                                          7
Specialty Press                                1
Spectra                                       22
Spectra Books                                 11
Spectra/Bantam Books (NYC)                     1
Sphere                                         4
Springer                                       5
...

Next, we explored the publisher column. Based on initial exploration, we can see that there are 2,289 “unique” publishers, but upon further digging, we can see that some publishers that seem to be duplicated in the dataset. For example, "Spectra" and "Spectra Books" are two separate values in the dataset. For the purposes of this article, we won’t dive into how to handle text data, but this would be a good opportunity to consult a domain expert, who may have more knowledge of which publishers actually have distinct subsidiaries versus publishers that should just count as the same entity.

df[df["title"].str.contains("Potter") & df["authors"].str.contains("Rowling")].sort_values('publication_datetime')
Subset of Goodreads data in table formatSubset of Goodreads data in table format

Earlier, we had also posed the question about different editions of books, so we took a popular book series to see how it was represented in the data. We subsetted for books with "Potter" in the title, and had "Rowling" in the author column. Although each individual book in the 7-book series is not represented more than once, we do see different translations of the books. For further models, it might be important to consider if subsetting just for English books makes more sense since that dominates the dataset.

Goodreads dataset, authors columnGoodreads dataset, authors column

Additionally, if we check out the authors column, we can see that multiple authors are separated by a /. Is this the best way to store the data? Should we create columns called author1 and author2? Likely author2 would be a really sparse column. Should the authors column be a column of lists? Authorship is important as certain authors may be more famous, thus drawing more reviews and ratings.

Purple: multivariate EDA

Lastly, to wrap up EDA, we’ll examine some multivariate relationships. First, we’ll re-create the sns.pairplot() now that we’ve created new columns, and converted some columns from text to numeric data.

Seaborn sns.pairplot() in Einblick canvasSeaborn sns.pairplot() in Einblick canvas

We can see our new columns incorporated. It seems that the distributions still look a bit odd, even after log transforming the ratings_count and the text_reviews_count columns. As a result, we might not be able to run certain statistical models that have assumptions about the underlying data distributions.

# Get the top 5 publishers
top_five = df["publisher"].value_counts().sort_values().tail(n = 5).index

type(top_five)

df = df[df["publisher"].isin(top_five)]

Next, there are tons of publishers, but we can see that there are a handful that publish way more books than any other, so we filtered for just the top five publishers.

Multivariate EDA, box plot and scatter plot in Einblick canvasMultivariate EDA, box plot and scatter plot in Einblick canvas

Then we can create color-coded box plots and scatter plots to see if there’s anything interesting to observe. For sns.boxplot() we can do this by specifying the x and y arguments. For sns.scatterplot() we can select the hue of the points to be the publishers.

Nothing immediately jumps out, but there are a number of combinations of variables that we can explore now that we have a better understanding of our data. We hope this was a helpful walkthrough! Let us know if you try out Einblick.

About

Einblick is an agile data science platform that provides data scientists with a collaborative workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick customers include BMW, Cisco, Covestro, DARPA, Fuji, NetApp and USDA. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.

Start using Einblick

Pull all your data sources together, and build actionable insights on a single unified platform.

  • All connectors
  • Unlimited teammates
  • All operators