Notebooks Aren’t Great for Data Science, We Just Got Used to Them

Paul Yang - September 27th, 2023
Python notebooks only allow linear workflows, but the way we actually work and think is non-linear. Only a canvas gives you the grammar to express more complex relationships in a workflow and a visual way to lay out results.

I am messy when I build a notebook because notebooks are very rigid in structure, while our analytical minds are not. Here are some things I notice - all the time - in my notebooks.

  • There are scattered code chunks that I haven’t cleaned up, but think “hey maybe I’ll need this someday”
  • A multitude of print statements display data to preview data, debug, and show results after a step.
  • `Run All Cells` is a myth – there’s no way I can execute the whole notebook in just one go.
  • Finding a specific block of code is hard, with Ctrl-F as the main way.

The Key Problem: Data Science Involves Mixed Modalities

The mix of content in a notebook is diverse.

  • 10% necessary boilerplate: including data imports and package imports, these are steps that must be done first, but don’t require much thought.
  • 50% exploration which is iterative and chaotic: notebooks end up messy because a vast chunk of data science work is to iteratively mess around with data and there is actually no linear order to this.
  • 30% process which benefits from an execution DAG: whether joins, transformations, train-test split, or model training, these steps must happen in a particular order and are prerequisites of other cells.
  • 10% are final outputs needing organized highlighting: These include cell outputs from every stage of analysis, including elements of exploration and processing, and specific cells to describe results like Shap explainer charts.

These various steps have multiple relationships with each other – each cell is linked to other cells in a number of ways because even though we, as humans, use computers and programming, we do not think and work like a computer does. For example, when we consider a Python notebook in its entirety, we can think of the dozens of cells in:

  • Execution order: how code must be run in order for the correct results to occur.
  • Analysis order: how the outputs must be viewed by a human. For instance, EDA is frequently reviewed next to model explainability.
  • Presentation order: how results must be shared with another user in the discussion, live or asynchronously.
  • Debugging order: how components relate to each other throughout the entire notebook, and the effects they have downstream.

However, Jupyter notebooks force everything into ONE linear order, when we as data scientists are actually moving across multiple modalities. Outside of a single defined linear order, there is no way to depict these relationships.

A Canvas Makes (Much) More Sense

By putting my notebook into a canvas form, we are given many different ways to depict relationships between code cells. Let’s take a look at just a few of the organizational and executional features available to us.

Two-dimensional layout

Code that is closer in proximity has more relevance than code which is furthest apart. Two-dimensionality gives me a chance to physically relate four cells or more objects to any given cell, not just up/down. After I read in a CSV, we’d put print/display statements beside the read() block and maybe continue downwards with the next analysis step.

We can easily and cleanly lay out our workspace so that each step is logically contained and visually separated.

Annotations

Markdown cells in notebook allow us to label code and sections with rich text.

However, unlike markdown cells, I can have multiple annotations laid out in two dimensions to represent different types of commentary. It’s easy to lay out a canvas so there’s a markdown cell that titles a section, a different markdown to describe a cell, and then finally yet other markdown to summarize insights.

We can leave one or more rich text annotations anywhere on the canvas.

Colored zones

Zones are two-dimensional sections of analysis. In traditional Python notebooks (linear notebooks), you could create a section with a big markdown cell to summarize insights or signal that we’ve moved from one part of the workflow to another. However, Zones build on this concept and have three benefits that more appropriately accommodate the non-linearity of data science:

  • They are visually clear: I can zoom out and understand the layout of my canvas quickly, and coloration encodes more information
  • They can be overlapping: Cells do not need to belong in one section. They visually can be arranged into multiple.
  • They can be nested: Logical building blocks can be created out of groupings that aggregate up to larger zones.

We can choose to interlock the zones for results and forecast generation, with green representing results and red representing forecast creation.

Directed Graphs

It’s not controversial to represent execution flows as DAGs. In notebooks, there is an implicit graph: a linear top-to-bottom timeline. Within a canvas-based notebook like Einblick, we automatically try to infer execution order, and draw lines to connect cells that are dependent on others. You can also, as the user, add execution order by adding dependency links. This is a feature that explicitly tries to ensure that “Run Flow” causes intended effects, rather than in notebooks, where “Execute All” is traded off against logically ordering cells for reading/thinking.

DAGs are automatically created based on packages used, and variable names. You can also manually add sources, or remove linkages if you choose. The `Run Flow` button then runs the DAG leading into a cell.

In a traditional linear notebook, all users have to navigate with are section headers within Markdown cells. The differently sized headers visually break up the dozens of notebook cells.  But, once an analysis is spatially laid out, zoned, and annotated, navigational features become much more dynamic and meaningful. This takes the form of bookmarks within the canvas, which save a particular X-Y location on the canvas at a particular zoom level. You can copy direct links to bookmarks when embedding or sharing the canvas externally. This also enables a non-linear presentation mode, as each bookmark is akin to a Powerpoint slide of outputs.

The location and zoom level of this cell is recorded in the bookmarks tab. Scrolling between different marked regions is made easy by clicking through the bookmarks list.

Dashboards, Embeds and more production…

Powerpoints will remain the de-facto presentation layer of data science, and BI dashboards stay the most easily shared data product. But there are plenty of good outputs already visible in a Python notebook. By plucking a few outputs out of the overall workflow, and hiding the complexity of code and execution, a notebook can easily be promoted into a lightweight display layer. These results will never become out-of-sync with the source (because they are the source!)

About

Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.

Start using Einblick

Pull all your data sources together, and build actionable insights on a single unified platform.

  • All connectors
  • Unlimited teammates
  • All operators