Skip to main content

Python

The python operator is a built-in notebook that can be used to write Python snippets to manipulate data using Pandas' functionalities (learn more about Pandas here.) The operator provides an interface to a Python run-time environment that can be exploited in its whole potential. You can write a Python snippet by dragging the Python operator onto the canvas, adding code in the available space, and running it by pressing the play button in the bottom left. A new dataframe will be output in the bottom right.

Using Pandas

Pandas is automatically available for use in the Python operator. Invoke Pandas functions using the identifier pd.

Working With One Dataframe

The input dataframe is accessible using the name df. You can then manipulate the dataframe using normal Pandas syntax, e.g. df['time']. The output dataframe will be df after any changes, so there is no need to return anything.

In the following example, we use the Python operator to add a column named large city to our dataframe to indicate whether a city has a population of more than 7000.

Single dataframe input

Working With Two Dataframes

To access the input dataframes, use dfs[0] and dfs[1]. As there are now two dataframes involved, a return dataframe must be specified to produce an output.

In the following example, we use the Python operator to combine two dataframes and compute a new column, Score_change, based on their respective Score columns.

Options: UDF vs. UDA

For large datasets, you may want to indicate the kind of operation occurring in your python snippet when interacting with Einblick's progressive engine. Applying the UDF designation can improve the speed of the computation, but can only be safely applied to certain kinds of operations.

The UDA option tells the progressive engine that the entire dataset must be considered in the computation, ensuring that upon receiving a new sample, the calculation is performed on the entire set of samples encountered so far. This is important when the result of the calculation is dependent on the entire dataset and cannot be simplified by only considering one sampled batch at a time.

An example of a snippet requiring the UDA indication is a snippet calculating the average of an entire column. In this case, the entire column, across multiple batches, must be considered to get an accurate result. Using UDF in this case would result in the average being calculated only over the final considered batch, which will produce biases results sampled from a much smaller subset of the data.

The UDF option indicates that the operation will succeed when considering just one batch at a time.

An example of a snippet where the UDF indication is reasonable is a snippet adding a new column that is the sum of two other columns. In this case, each batch can be considered in isolation, so an increase in speed can be achieved by indicating this optimization to the progressive engine.