Skip to main content

AutoML

May 2023 Update

The built-in AutoML feature has been removed with version 5.0.0. All existing AutoML cells will remain as read-only cells in your existing canvases. At this time you are no longer able to re-run those cells or create new AutoML cells. You can however still download the pickle file and Python script.

AutoML Alternative

In lieu of our AutoML cell, we reocmmend auto-sklearn, an automated machine learning library built on top of the popular Python library, scikit-learn. auto-sklearn aims to automatically select the best machine learning model and hyperparameters for a given dataset and problem.

Get started with the following code snippets:

!pip install auto-sklearn
import autosklearn.classification

# Initialize the AutoSklearnClassifier
automl = autosklearn.classification.AutoSklearnClassifier()

# Train classifier
automl.fit(X_train, y_train, dataset_name="breast_cancer")

# Print pandas table of results for all evaluated models
print(automl.leaderboard())

Read more on the auto-sklearn website.

Explainer Alternative

explainerdashboard is an open-source Python library that allows you to quickly build interactive dashboards to explain machine learning models. Their suite of pre-built dashboard templates can be used to:

  • Explain different types of models, including linear regression, random forest, or XGBoost
  • Visualize the impact of different features on the model's predictions, such as partial dependence plots, permutation importance, and SHAP values.
  • Deploy your dashboards to the web, making them accessible to others without requiring any knowledge of Python or machine learning.

First install and import necessary packages:

!pip install explainerdashboard

from explainerdashboard import ClassifierExplainer, ExplainerDashboard, InlineExplainer

Then write a dictionary of feature descriptions, and build your model (not shown here):

feature_descriptions = {
"Sex": "Gender of passenger",
"Gender": "Gender of passenger",
"Deck": "The deck the passenger had their cabin on",
"PassengerClass": "The class of the ticket: 1st, 2nd or 3rd class",
"Fare": "The amount of money people paid",
"Embarked": "the port where the passenger boarded the Titanic. Either Southampton, Cherbourg or Queenstown",
"Age": "Age of the passenger"
}

Then create the ClassifierExplainer:

# Create the Explainer
explainer = ClassifierExplainer(model, X_test, y_test,
cats=['Deck', 'Embarked',
{'Gender': ['Sex_male', 'Sex_female', 'Sex_nan']}],
cats_notencoded={'Embarked': 'Stowaway'}, # defaults to 'NOT_ENCODED'
descriptions=feature_descriptions, # adds a table and hover labels to dashboard
labels=['Not survived', 'Survived'], # defaults to ['0', '1', etc]
idxs = test_names, # defaults to X.index
index_name = "Passenger", # defaults to X.index.name
target = "Survival", # defaults to y.name
)

And the dashboard:

# Create the ExplainerDashboard from the Explainer
db = ExplainerDashboard(explainer,
title="Titanic Explainer", # defaults to "Model Explainer"
shap_interaction=False, # you can switch off tabs with bools
mode='inline')

Finally display the dashboard with the following commands:

# Get the HTML code of the dashboard
html_code = db.to_html()

# Render the HTML code within the cell
from IPython.display import HTML
HTML(html_code)

Get more information via the explainerdashboard docs.

AutoML makes it easy to build predictive models, and saves data scientists hours of time by removing the need to manually pre-process data, or search for the best model parameters.

In short, if you provide a target and some factors that predict the target, Einblick will quickly find the best ML model pipeline (e.g. LabelEncoder + StandardScaler + XGBoost), let you see explainability (which are the top factors), and apply it to a new dataset for predictions.

Inputs

Users must provide the training data, the column to predict, and all the features used to predict the outcome.

  • Target: the attribute we want to predict (e.g. sales)
  • Features: the attributes the model uses to predict the predict the target (e.g. location, month)
  • Training Set: the data which establishes the relationship between the target and features
  • Test Set (optional): data which can be used to evaluate how well the trained model performs on unseen data.
    • The target is either absent or ignored in the test set so that the model can make predictions without "looking at the answer key"

Tasks

Einblick has two main types of modeling tasks:

  • Regression: Use input feature variables to predict a numeric value. \ Questions like "how much/many" generally are regression tasks, where the ML model will attempt to predict a quantity in regression.
  • Classification: Use input feature variables to predict a label. \ This is used when the outcome being predicted can fall in one of several distinct categorical classes, and the ML model will attempt to use patterns in the data to predict the likelihood to fall in a given class, and return what label seems to best fit the outcome.

Metrics

A scoring metric is used to evaluate a model's performance on data. It is essentially a formula that determines how much to penalize a model when it is incorrect.

For example, let's say we have a dataset with 90 False values and 10 True values. A model that guesses False for all values will have 90% accuracy (one type of metric), since it guesses correctly for 90% of the data. However, the F1-score (a different metric) will be 0, since none of the True values were predicted correctly.

Different metrics allow you to change what is prioritized when models are evaluated. Some of the most common metrics include:

Regression

  • RMSE [root mean squared error]
    • standard regression metric that penalizes outliers strongly
  • MAE [mean absolute error]
    • useful when outliers should not be penalized strongly

Classification

  • Accuracy: Useful if all labels should be treated equally, as it is a simple statement of what % of predictions were correct
  • Precision: Among the population that we predict to be in class "X," precision asks what % are actually truly "X." For instance, if we are trying to identify targets for high cost treatment, we want to come close to guaranteeing each identified positive is a true positive before investing.
  • Recall: Among the population that are actually class "X," what % of them have been identified through modeling prediction. This is used we need to capture as large a % of a given class as possible, usually because treatment is cheap (email marketing) or because any missed observation is huge cost (deadly disease).
  • F1, F1-macro, F1-micro: A blend of precision and recall, these are useful when some classes are more common than others (e.g. fraud, rare disease detection)

If you need help choosing a metric when using Einblick's prediction operator, you can consult the prediction wizard, which will guide you to a suitable metric based on your situation.

Trained Pipelines

AutoML will spit out better models as they are found, and these models can be dragged from the circles onto the canvas to see more details about the pipelines.

Each pipeline has:

  • Metrics: for both the specified, and all other metrics available to Einblick, you will see the model performance on either the holdout or the cross-validated set
  • Validation: a description of the sampling approach used (% holdout or cross-validated + % sample of training used in training)
  • Steps: a description of the sklearn components that comprise the model
  • Versions: You can retrain the model pipeline on new data by bringing a new dataframe to the top-left and hitting Run

Explainer

To obtain more granular information about a pipeline, drag out the explainer tab at the right of the pipeline operator onto the canvas.

The features tab

The features tab is the default tab of the pipeline explainer, and is visible in the image above. The accompanying chart shows SHAP (i.e. feature importance) values calculated for each attribute used in the pipeline. Features with higher SHAP values have higher impact on the decisions of the pipeline than those with lower values.

The samples tab

The samples tab allows you to drill into a row-by-row prediction view, so you can see how features for each prediction affect how the prediction of the given row was made.

Executor

To execute a pipeline, drag the executor that appears to the right of the pipeline cell onto the canvas. A new executor will then appear. This is a pickle of the trained model pipeline.

You can then specify an input dataframe in the executor, which will output a dataframe containing the predictions of the model, appended as a new column with prefix predicted_{original target}

How Does it Work? A Deep Dive

Alpine Meadow (AM), Einblick’s AutoML engine, had been designed end-to-end to mimic an expert data scientist’s behavior to enable even the least technical users to build state-of-the-art ML pipelines.

Overview of the Alpine Meadow optimization process to select the best predictive pipeline. (1) search space definition, (2) pipeline selection, (3) pipelines parameters’ selection, (4) incremental evaluation, (5) iterative refinement

AM’s optimization process starts with the Search Space Definition:

Step 1: Search Space Definition. AM first defines a search space of pipelines, where a pipeline is a Directed Acyclic Graph (DAG) of fundamental ML operators. A node in this DAG can be a preprocessor (e.g., scalers, imputers, encoders) or a predictive model (e.g., SVM, SVC, Random Forest). The search space is created according to the user’s problem specification and the characteristics of the data. For example, if a regression problem is given, just the regressors will be added to the search space; if a categorical variable is in the data, variable-encoding operators will be added to the search space. It is important to note that the parameters of those operators are not selected at this point yet.

This step is best compared to a data scientist listing all the operators compatible with the predictive task they want to solve and asks themself: "Which feature encodings, scalings, and selections should I try based on the data? Which prediction models should I use?”

After the search space is defined, AM proceeds to select the most promising pipelines: __

Step 2: Pipelines Selection. The search space of all pipelines is massive, usually due to the sheer combinatorial explosion derived by the combination of all possible operators. AM selects the most promising pipelines based on a cost/quality trade-off based on a model trained from past experiments. In the beginning, the cost/quality model favors fast pipelines to ensure that a first model is quickly returned to the end-user, hence providing better interactivity. AM also leverages the similarities of the current dataset/problem to previous runs and picks areas of the search space that contain pipelines that worked well on similar problems.

This step is best compared to a data scientist asking herself: “What should I try first?”. A data scientist will try a few good general options after taking a quick look at your data and might try, for example, to normalize all features and use a boosted decision tree as a start.

Step 3: Pipelines’ Parameter Selection (aka Pipeline Arm Selection). After selecting a promising pipeline, which defines the structure of the end-to-end solution, AM proceeds to fix the parameters of each of its operators, which are chosen via Bayesian Optimization. Each pipeline is associated with a quality-cost model that trade-off the accuracy and training time of the pipelines. The cost model is used to find promising configurations. Now the pipeline is fully defined, its parameters fixed, and ready to be evaluated.

This step is best compared to asking the data scientist, “How many trees should I use in this Random Forest?”

An example of pipeline with fixed parameters in Alpine Meadow

Step 4: Incremental Evaluation. It is beneficial, especially for large datasets, to run a pipeline on a smaller sample first, and then if the results look promising, try it on a larger portion of the dataset. Therefore, AM first evaluates each pipeline using a sub-sample of the whole data and if the results are promising, it re-runs the pipeline over larger and larger subsets of data. This mechanism guarantees that the system focuses resources on promising pipelines early on and gets good results quickly, which now can be streamed back to the user with a short response time.

This is similar to a data scientist building a model over a sample of the data before using all available data (crucial when the data is big).

Step 5: Iterative Refinement. AM constantly improves by learning from previous runs. By evaluating different pipelines, it gains experience over the current dataset that it can then use in other runs by updating the cost- and quality-model to select new pipelines and leveraging the Bayesian models to select its parameters.

This step can be best compared to the iterative refinements that a data scientist performs after she observes the results from a tested model.\