Let’s start by defining machine learning in brief. Machine learning algorithms try to predict an outcome based on mapping a set of known input data to a target. We know the answer to the target in the model training phase, but would be unknown in practice. For instance:
- Will a customer buy — — Customer demographics and past behavior
- Is this picture a bird — — Pixels in image
- How long the time before my system fails — — Sensor readings
- Is this patient at risk of a disease — — Biomarker panel
- Is this a pedestrian — — LIDAR and Cameras
To generate this mapping, an algorithm needs to be chosen to maximize the accuracy we have selected above, and there is quite a menu here. From the simple yet trusty linear regression taught in Stats 101 to any number of algorithms developed by research teams your favorite university or technology company. In a manual analysis, you might pick up one or a few to use in trying to fit data. AutoML functionality, like what we offer at Einblick, generally starts by searching across several dozen candidate models.
We also should decide what “prediction accuracy” means. Sometimes, we want to make sure that our model captures as much of a target outcome as possible (self-driving car collision avoidance prefers false positives). Other times, we want to prioritize that there are no false positives (detecting nuclear missile launches really should not have false positives). Picking this is a contextual choice, and it should be clear what the goal is ahead of time both in terms of which factor we are predicting and how do we know we succeeded.
Then, there are additional knobs and buttons that we can add to control how the model learns, that need to be configured for each type of algorithm. These are called hyperparameters (to distinguish themselves from parameters, which are chosen by the algorithm in modeling). The “low-tech” approach is a grid search trying out all the hyperparameters — which is painful. So a number of smart enhancements exist to try and remove that process.
But wait — there’s more! There’s a whole host of “optional” transformations that can be applied to data to make it better suited to modeling. This might mean kicking out outliers, capping extreme values, normalizing data, or even advanced features to extract features from text or reduce dimensionality through creating notional representations. While some data manipulations and transformations are contextual, and must be handled manually, many of these more tactical transformations can greatly enhance model accuracy.
When you combine all four of these together, the number of possibilities explodes multiplicatively easily into the millions of combinations — of which only a small set makes sense. To search efficiently, we must deploy a smart tool to search for us. So an AutoML’s function: Select the feature engineering steps, algorithm and hyperparameters that returns the best and most accurate result for the problem that we are trying to solve.
Why Should I Use AutoML?
Even for the most seasoned R or Python user, AutoML represents a significant productivity enhancement by automating iteration. Data scientists, chasing after a possible better iteration, can overinvest in searching for a better algorithm due to “pipeline FOMO (fear of missing out).” For less technical users, AutoML represents the ability to go through a few simple wizards, and end up with a meaningful model. While we, and many others, talk about why AutoML alone does not replace human intervention, at the same time, it is important not to dismiss the value of partial automation. To wit, it’s not germane to dismiss automatic transmissions because the car isn’t self-driving.
Given a fixed amount of bandwidth and resources, it is far more important to invest in formulating a problem correctly, checking input quality, and testing outputs for accuracy out-of-sample or in the wild.
Enabling this, good tools for AutoML must also provide pipelining, explainability, and collaboration. Pipelining features allow raw data to be put into a format that can be usable. Explainability creates human-interpretable versions of the ML outputs, allowing for qualitative learnings to be extracted and sanity checking. And finally, decision making is a group activity, so sharing and collaboration ensures both that insights are actually operationalized and is a final check against unreasonable conclusions.