Why data scientists still can't have it all - the downside of notebooks

Cynthia Leung - August 2nd, 2022

Introduction

Organizations have worked to build teams of data specialists to tackle tough problems, but the tools these teams use still resemble the code-heavy integrated development environments (IDEs) of yesteryear. Computational notebooks come in many brands (Jupyter, Colab, Sagemaker, Databricks, and more) and are the primary tool for the data scientist.

IDEs reflect a software programming legacy, whereas data science actually reflects a much different exercise from traditional software programming. Data scientists change domains rapidly, and must wrangle data, produce analysis, share results, and productionalize all by themselves. This workflow is not only many steps, but requires interacting with many stakeholders. Therefore, it seems obvious that notebooks, which are optimized for individual coding, cannot satisfy all aspects of the multi-persona exercise.

Recently, we read a paper written in 2020 by Microsoft Research and wanted to share the results with you. Researchers conducted a mixed-methods study examining whether computational notebooks met the needs of data scientists by interviewing and surveying professional data scientists. While computational notebooks offered quick, interactive spaces for data science, Microsoft Research concluded that notebooks fail in several key areas, including setup, exploratory data analysis, reliability, sharing, and reproducibility / reusability. Let’s dive in.

Loading and cleaning data can be a frustrating, manual process

Already at the beginning of the process, users can face complications when importing multiple datasets from outside data sources. Sometimes, this requires downloading data from the cloud to local sources and re-uploading them at a later time. While some libraries alleviate this problem, data scientists must knowingly add additional steps to their workflow process to solve the issue. Additionally, notebooks may struggle to operate when datasets are too large, and extra efforts must be engineered to fully handle large volumes of data.

In these initial stages, users must also be able to easily clean their data set. Data cleaning is a tedious process. Notebooks offer little to ease the data cleaning process, as users still need to copy and paste the same lines with small changes, leaving room for human error.

‘Quick’ exploration and analysis can be unnecessarily time-consuming

The modeling process is often complex and time expensive. Occasionally, it can cause the kernel to crash. Notebook users also cannot evaluate alternatives concurrently, allow for quick adjustments, and provide immediate feedback. Additionally, visualizations are difficult to customize and tweak on notebooks, leading to constant copying and pasting. Visualizations are limited by the notebook space and may render unexpectedly when importing/exporting. Both of these steps require iterations; however, there is no easy way to iterate efficiently.

Large datasets cause inconsistencies in notebooks

As datasets get larger, the chance for kernel crashes gets larger. If an operator fails during execution, the notebook or data becomes inconsistent. Some changes may have occurred to part of the data but not all, and it becomes difficult to understand where in the process the notebook may have left off when it crashed. In order to support this, we can deploy heavy-duty systems like Spark to achieve the results we intend, but this adds another layer of infrastructure and another set of know-how to the requirements.

No support for easy collaboration on notebooks

Collaboration comes in many flavors, but notebooks generally do not robustly support sharing options. Narrowly, sharing notebooks might just be sharing code, which is often useless without also sharing underlying data and/or environment settings. This requires extra instructions and steps outside of just sharing a notebook, though cloud-based systems have gotten better at being environment agnostic.

With the current remote environment, live side-by-side collaboration can be extremely important. Even if the collaborator has knowledge of programming, comments and code do not and cannot replace voiceovers and live interaction. But worse, data scientists need to present their findings to stakeholders or customers who don’t have the same data background, and computational notebooks make this impossible.

Reproducing results and reusing code are not easy tasks to complete on notebooks.

If someone else were to reproduce notebooks results, they would also have to ensure that environment settings and customizations were exactly the same. This can fail if a user doesn’t have the same extensions as the author of the original notebook. Reusing and adapting code can also be difficult because it may be hard to recreate dependencies that were created when writing the notebook for the first time.

Moreover, there is still a more advanced version of reproduction that is oriented around turning boilerplate code and processes into reusable templates. Data scientists frequently rerun the same analysis on new data or follow the same workflow across multiple projects. In those situations, the ability to save previously executed code into reusable fragments dramatically increases data scientist efficiency.

Conclusion

Beyond these, several other issues exist with computational notebooks, particularly when: managing code, preserving code history within and between notebooks, maintaining data security and access control, and deploying to production. Computational notebooks are still quite useful, but even modern implementations still leave a lot of problems for data scientists. Addressing these pain points can drastically improve user experience and cut down on the number of tools that professional data scientists would have to use on a daily basis.

Tools, such as Einblick, can address these pain points, all within the same environment.

  • Einblick provides collaboration through live whiteboarding and presentation modes so that data scientists can easily work together with domain experts, stakeholders and each other on the same interface.
  • Einblick supports analysis for large datasets through progressive, computational engines and connections to databases, such as Snowflake, Oracle, Amazon S3, and more. It can fulfill the computational needs of data scientists and integrate well within large organizations.
  • Einblick provides an estimated result for models and visualizations within 5 seconds (regardless of dataset size), so data scientists can quickly build and compare multiple models at the same time and iterate without delay.
  • Einblick lets users duplicate ML pipelines, save workflows as shareable functions, and reuse no code operators. Reproducibility is made easy, and no repetitive coding is needed.

As data becomes more popular and more important, the tools that data scientists use must be re-evaluated. Versatile tools will be needed, requiring future iterations of notebooks to become collaborative, scalable, and efficient. Improving the quality of life for a data scientist can have drastic effects on improving companies overall.

Citation

Souti Chattopadhyay, Ishita Prasad, Austin Z. Henley, Anita Sarma, and Titus Barik. 2020. What’s Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI ‘20). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3313831.3376729

Start using Einblick

Pull all your data sources together, and build actionable insights on a single unified platform.

  • All connectors
  • Unlimited teammates
  • All operators