AutoML is not enough to productionalize ML – You still need humans in the loop

Paul Yang - May 20th, 2021

Let’s start by affirming the power of AutoML tools. Any user, regardless of technical capability, can now set up model creation in a few minutes when it previously took expert data scientists hundreds of lines of Python. AutoML accelerates the process of stepping through feature engineering, trying many different algorithms, tuning parameters, and ultimately identifying an accurate model. AutoML has become a crucial pillar to democratizing data science, as it abstracts away the coding and algorithmic function calls from the real prize of a productionalizable model. At Einblick, we’ve observed firsthand how our AutoML tools have empowered non-technical analysts and operations managers to start replacing “gut feel” with accurate models.

But for us, AutoML represents a tooling enhancement to achieve the goal of accelerating model building and democratizing data science. It is not, however, a magic wand that can be waved to instantly create citizen data scientists. A more realistic analogue might be that AutoML tools are electric can openers. They’re hands free to use and accomplish goals faster and more cleanly than manual cranking. But a can opener, no matter how fancy, does not replace the chef.

So an important reminder to organizational leadership is to not overinvest in AutoML solutions themselves. Rather the investment should be in both people and process. Continuing with the cooking analogy, it is more important to prepare the right ingredients and produce consumable results when completing a full end-to-end analysis.

1. Domain Knowledge Improves Inputs

Raw datasets are frequently unclean, and are rarely structured in a way that best exposes predictive variables to a modeling approach. A classic example is the relationship between weight and heart attacks. While [weight] positively correlates with [heart attacks], a better predictor might be [weight] / [height] since someone very tall and heavy probably is still healthy. At the same time, we might recognize that people have a reasonable bound of heights and weights, and we should be removing people with ridiculous heights and weights.

Basic feature engineering and data cleansing tools do come baked into most AI / ML tools. Leading AutoML platforms, including Einblick, will include a similar set of candidate transformations including one-hot encoding (categorical variables to 1/0), imputation, scaling, ratios, NLP text feature extraction. However, these approaches are a “see-what-sticks” approach. Brute force helps to highlight the transformations that add predictive value, especially when creativity is not needed and transformations are simple.

Human based domain knowledge has a few comparative advantages, which augment automatic feature engineering, including the following:

  • Detection of real-world motivated changes to pattern: A human might recognize a shift in datasets that represents a nameable event that occurred. Examples include that the organization launched a new initiative, there was a strategic shift, a natural disaster occurred, a financial crisis, etc… The model can only infer from the information revealed from the low-level statistical patterns in the data. Whereas human intuition relies on a vast repository of additional knowledge to interpret data.
  • Outlier identification based on expectation: An AutoML algorithm might be able to identify variables that are outside of 3 standard deviations, and eliminate them. However, similar to the above, understanding whether values are legitimate is a human task. Take a retail bank: A 900 credit score seems feasible, but is not a within the possible range of 300-850 for standard scoring. By contrast, a million dollar checking account is rare and much higher than the average, but immediately we know it is possible. Domain knowledge is what allows an analyst to classify whether outlying values are legitimate.
  • Intelligent transformation of data fields: Modeling tools can tell you that [weight]/[height] is better than either [weight] or [height] alone at predicting heart attacks, but subject matter expertise can tell you that squaring the denominator makes for Body Mass Index – which then is a commonly used variable. Brute forcing all possible transformations within and across each candidate driver leads to too many candidate variables, and doesn’t create interpretable variables to include.

2. Iteration and Interaction Improve Outputs

Models are only helpful when they are implemented. A simple recommendation can be better than an AI-generated model, if action is taken as a result of that simple analysis. Real value is tied to action or changed decisions, and so it is tied to the ability to generate buy-in. This buy-in is won through being able to clearly communicate what a model is doing, address questions, and resolve any points of disagreement about input drivers and output implications.

Recognizing this need, AutoML tooling development today focuses on AI Explainability tools. These range from charts to show SHAP Values and share the importance of variables, to even advanced features that let users peturb data to observe how that might change outcomes. AI / ML models have developed augmented functionality to break out of the bad impression that they are inscrutable black box operators.

However, many of these outputs are more suitable for someone with a technical background to dig into. AutoML tools boast Partial Dependence and Independent Conditional Expectation plots, but these are only helpful for an expert data scientist who would otherwise use Python to create the same outputs. What these views do not do is relate the predictions back to the generating dataset. Relating predicted outcomes to the known inputs is the model equivalent of “seeing is believing.”

Ultimately, the analyst should be able to identify what changes might be needed, and rapidly see what happens after a change. An ideal workflow must include the ability to facilitate iteration over the analyst’s own ideation and stakeholder feedback:

  • Fast iteration to help evaluate the impact of feedback: If a stakeholder feels less confident in the inclusion or exclusion of a particular driver, or remembers something new to add, iteration should be fast. The first few runs of models can be viewed as prototypes or beta versions to quickly collect feedback. The ability to be agile is valued, so any AutoML tool that takes a day to run a result is antithetical to the idea of rapid iteration in the face of new hypotheses.
  • Dynamic exploration of prediction results: While helpful, pre-boxed outputs that represent the best of model explainability tools require either faith in the process without understanding, or potentially too much prior data science knowledge. Instead, tooling should allow users to easily take and beat-up predicted results. Do values make sense, do the segmentations exist as I expect they do, are there any inexplicable patterns, etc…? If a variable is said to be important and predictive, then the easiest way to convince the audience is to on-the-fly show a comparative histogram of the targeted response variable broken out by the driver.

Seeing Is Believing

Practically, what we have described means that citizen data scientists need equal-strength descriptive analytic tools. Visual analysis and flexible interactivity both allow analysts to interrogate data in order to unlock human intuition feed into the predictive modeling process. Creating the ability to transform and visualize results prior to modeling unlocks additional candidate features for the model; especially ones informed by expert priors. Meanwhile, the ability to visualize predictions in a flexible way allows confirmation that sensible, expected outcomes are indeed coming true.

And equally critically, interactivity with outputs is the easiest way to make both analysts and stakeholders more familiar with outputs, and thus more confident. For instance, while our clients find it meaningful that Einblick’s AutoML has success beating expert solutions, citing past research is rarely sufficient to win buy-in from organizational leadership. And it shouldn’t – proof should not be an appeal to past success or brand. Instead, a clear visual analysis can help viewers more quickly understand what diagnostic statistics mean, and to match up their expectations about outcomes to the predictions of those same outcomes.

Finally, no model is generated with a single shot. This means that a complete citizen data scientist workflow requires users to be able to seamlessly move between exploratory analysis and modeling. Validation of modeling results should lead to more hypotheses about candidate drivers, which should cycle back into new modeling runs. We have abstracted away the need for coding in order to make this process easier, and more accessible as a whole – not just at the model creation stage.

Do not only focus on the automatic model and ignore the need for human interactivity with data after the model is created. Many AutoML workflows are implicitly asserting “trust us.” If the statistics about the model look good, and it was generated by a smart tool, then surely it makes sense to implement! Data science democratization does not imply that users should give up on creating well-understood explainable models.