Progressive Computation Engine (PCE)
Why is progressiveness essential?
No one likes to wait: As datasets scale past the terabyte scale, traditional engines no longer have the capability to keep up. Waiting minutes for a bar chart, or even hours for a quick draft ML model to be built just isn't a good workflow for data scientists and analysts.
It doesn't interrupt the train of thought: the best data scientists iterate fast, by starting simple and shipping an alpha-version as soon as possible. Bad data scientists start with the most advanced models they know and iterate slowly. Einblick’s progressive engine is designed to mimic the behavior of the best data scientists. It returns a first approximate answer on your data in a matter of seconds so that your train of thought keeps going at full speed and you can keep following the thread in your data. While you execute on your hypothesis, Einblick's PCE will iterate and improve on the first results by automatically scaling over the whole dataset and ultimately run the requested computation on the whole dataset.
Enables remote real-time collaborative interactions: progressiveness is not only essential but necessary to enable real-time collaboration. Imagine having an exploration session with a colleague and finding yourself waiting 20 minutes for a simple visualization to render, or hours or days for a machine model to return. Progressive and approximate answers are the key ingredients for fast iterations between several concurrent users.
Progressive v. Exhaustive Computation: a concrete example
Einblick's computation engine achieves rapid responsiveness through progressive sampling within the platform - quickly returning an initial answer on a random sample of data, before continuing to refine that answer in the background.
In the example below, we filter an 80M row dataset containing flight data: on the left, we turned Einblick's computation engine on, on the right we switched it off. The goal of this simple filtering operation is to plot a bar chart containing the flight destinations sorted by their absolute counts.
- On the left, you can notice the interactivity achieved by Einblick's progressive engine. The progressive engine returns a first approximated result in less than a second, immediately providing a good approximation of the underlying distribution, and quickly converging to the final one in about 24 seconds.
- On the right instead, we turned the progressive engine off: we now have to wait for the computation to run on the whole dataset to get an answer. You can see that the whole computation takes 6 minutes and 12 seconds, de facto inhibiting any real-time collaboration and disrupting the user train of thought.
How does progressiveness work in Einblick?
Einblick is powered by Davos, a progressive computation engine, which works over data streams and enables Einblick’s interactive speed. Every Einblick deployment is composed of several Einblick containers, and each of those containers is powered by an instance of Davos. Once a user is assigned to an Einblick container, their jobs are executed by their Davos instance which will start pulling data from the specified data source.
Where possible, Davos will push down sampling predicates to the underlying data source, otherwise, it will scan through the data and compute a reservoir sample over the data source. Every operator in Einblick works over these data streams and the users receive updated versions of the workload outputs every time the execution of an operator over a batch is completed.