Three innovations to bring humans back into the loop
Drag-and-drop interfaces let users glue analytic operators together and build pipelines to create descriptive and predictive analytics. However, while many data analysts have successfully used these tools, they are far from perfect. Through extensive user studies, Einblick identified 3 major opportunities to improve data science workflow builders, and make our software better suited to tackle modern analytics challenges:
- Don’t hide data: Analytics is about story-telling, and it should be easy to create a narrative as we step through the analysis process. No code tools can hide data behind too many small blocks, and turn you into a button clicker. Good tools make it easy to examine what you see, and apply your intuition.
- Work smarter, not harder: Data has gotten much bigger, and most analysts contend with GB-sized datasets. Consequently, existing tools now take longer time to complete, and iteration costs even more time. Modern techniques like progressive sampling help defend against the overhead that large datasets create, and focus on short time to first insight and failing fast if something is wrong.
- Make data science a group activity: Good decisions are made when solid analytic outputs are married with discussion by a diverse group of perspectives. But today, data science is not truly collaborative; analysis is almost always an individual activity, results are shared as static visuals in PowerPoint or dashboards, and iteration happens async. The whole organization should have a shared space, allowing everyone to build and react to analysis together.
Looking at data helps you see the story in the data:
The workflow-based platform’s modular nature decouples “seeing data” and the operational process that the data operators accomplish. Each analysis step is a building block, focused solely on the configuration buttons. But effective data analysis benefits from human reasoning; analysts must see tables and charts to uncover patterns and chase them. In fact, you can consider this to be part of why Excel is so easy to use – data is inherently transparent to users, and so even non-technical users can be confident they’re making the right progress.
This is ultimately because the human brain is extraordinarily well-trained to detect patterns, and you, the user, have contextual knowledge that machines do not. A common but impactful example comes from outlier analysis and outlier detection: The weight of a hospital patient is recorded as -1 when unknown. With visual feedback, they can detect this right away, intuitively. But with operator blocks, this fact would not be surfaced, and they could keep going for several iterations. They might remember to explicitly add a “check outlier block,” or they might miss it altogether.
For Einblick, we named our workflow “visual data computing,” since every node in the process is both a step and an actual visual representation of data.
When working with large datasets, tools need to be smarter
Last generation workflow platforms lack interactive speed. These platforms compute workflows sequentially, which requires that every upstream operation be fully executed at each step. This is not problematic with simple operations or small datasets, but it becomes extremely taxing when dealing with larger datasets and more complex operations.
Suppose that we are working on a 1GB customer dataset; not big data, but still sizable. We set up the processing pipeline for a market basket analysis – data ingestion, data manipulation, and then data mining. After a slightly long 10 minutes of waiting, we see the results and realize that we made a mistake: several columns were the wrong data type. Well, that’s not too bad, our dataset is only medium-sized. We can just go back, re-execute the pipeline, and wait another 10 minutes. But what if we ran the same analysis on a 10GB or even a 1TB? And should results take 10 minutes (or 10 hours) to run in the first place? Traditional tools do not scale, and “pretty big” data has become widespread; the answer cannot be “just buy a bigger machine” or “give AWS more money.”
Einblick takes a different approach – all computations in Einblick are progressive, meaning Einblick first returns an initial result over a random sample of data in <5s. Then, the software refines and updates that answer in the background with larger and larger samples. Then downstream operators can start working too, receiving the intermediate results from the upstream ones. They will immediately start crunching the numbers incrementally, allowing you to create and iterate on workflows at the speed of your thoughts.
Even if your dataset is >1TB, you will be able to reach the endpoint of your pipeline in a matter of seconds and monitor its progress in real-time. If something doesn’t look right, you can immediately intervene and swiftly iterate on your pipeline.
Collaboration improves decision making, so let’s make data science collaborative
Analytics tools create outputs, that ultimately become the inputs to decision making. In that conversation are a variety of personas. Engineers, data scientists, product managers, and business line owners each have diverse skill sets and unique expertise. But there is no shared tool to create analysis, and no shared tool to discuss analysis.
Rather than creating a single community that works on creating the best answer, current tooling only enables analytics to happen in siloes. Data science notebooks are not accessible to no-coders, while no code workflow platforms are viewed as limiting and tedious by coders. As a result, there is no single shared playing field for creating insights.
Then, once analysis is complete, Powerpoint with a Zoom call is the gold standard of collaboration. While this enables discussion, there’s no easy way to investigate any new thoughts live. Calls always end with a lot of “taking back,” and “follow-up,” rather than a flexible or performant way to explore new questions right away.
Einblick’s answer is a multiplayer data whiteboard that supports hybrid code/no code workflows. Teams of different technical expertise levels can contribute to data exploration in a shared environment. Then, bring in the wider stakeholder team to converse in real time about results. Anyone can go and make the next cut, or add an extra insight directly into the canvas.