So What’s the Problem?
If I told a million people “you won’t have a heart attack next year,” I would be 99.99% accurate. Of course this is not very helpful and not diagnostic. A successful doctor would recommend diet, exercise, or blood pressure pills to a group which is likely too large, but includes the few folks who will have a heart attack. The risk of dying is much higher than the downside of skipping a few steak dinners.
In data science, our task is to ensure that we do not train machine learning models which are 99%-accurate-but-completely-useless. Fortunately, this solving this problem starts with applying common sense.
To start at the very beginning, the class imbalance problem occurs in classification problems. Classification is when we try to predict an outcome that is one out of a set of finite potential outcomes like “does my customer churn,” “what team should this support ticket route to,” and more. (w opposed to regression, which is predicting a continuous amount).
The class imbalance problem arises when one (or more) of the predicted outcomes happen much less frequently than others in our data. These outcomes are therefore called minority classes. Since machine learning algorithms are asked to minimize errors, they may simply return predictions that completely ignore those rare occurrences since guessing wrong on these infrequently is not very punishing.
Unfortunately, the rare event is frequently the most important one. From the above heart attack example (or really, disease detection generally), to identifying when systems fail for predictive maintenance, to predicting conversion from all the web traffic which will not, every industry’s data scientists will face this problem.
In order to solve this problem, we need to 1) refine our idea of what “accuracy” means, 2) grappling with probability rather than a simple yes or no, and 3) go back to the dataset and make some transformations.
Do You Really Mean Accuracy?
Accuracy has a very specific meaning in classification problems: it reflects the ratio of correct predictions to total observations. If we total up how many our model guesses correctly, divide it by row count, we get accuracy.
But we can decompose it into four categories of “correct” or “wrong”:
- True Positive / True Negative: our predictions matched the reality
- False Positive: We guessed it would be in a category, but it was not
- False Negative: We thought it would not be in a category, but it was
Deriving from these four categories are some other metrics we can use to represent “correctness” other than “accuracy”:
- Precision: What fraction of our affirmative guesses actually end up being what we guessed?
- [True Positive] / ([True Positive] + [False Positive])
- Recall: What fraction of all occurrences of the event do we correctly identify?
- [True Positive] / ([True Positive] + [False Negative])
- F1: A “mix” between the two (there’s actually quite a few flavors here)
- There’s many other “correctness” measures too (AUC, F2, etc…), so this is not an exhaustive list.
So when we have a very infrequent event, business logic should dictate what metric to use:
- If I built a nuclear missile warning device, I’d want to make sure that I maximized precision. Don’t guess wrong, you could end the world!
- If I built a self-driving car collision detection system, I’d want to maximize recall. Slow down, but try to never hit a pedestrian!
When applied to class imbalance problems, it means that rather than focusing on accuracy overall, we might be selecting for models that perform well on some other metric. Most modern data science tools that support Automatic Machine Learning let you select what metric you want to prioritize and then try hundreds of candidate models to find the ones that perform the best for your goal. However, understanding these concepts helps you pick the right one.
Predictions Exist on a Spectrum from 0% to 100%
While outcomes are discrete classes, the prediction algorithm is actually looking for probabilities and then discretizing that outcome at the end to turn the probability into a guess of a specific outcome.
For instance, maybe we have an algorithm that reads biomarkers in blood and tries to predict whether a patient has a disease. But since the action is just “go do a diagnostic test,” maybe we want this algorithm to return “Yes” if they exceed a 60% chance.
But if diagnostics gets cheaper, it’s possible we want to lower that threshold to just 40% likelihood.
Going hand in hand with the previous example, the higher the probability threshold we pick, the more likely it is we will get precise results; the lower the probability, the more likely it is we will get high recall as we sweep up most folks whether or not they are likely.
Ultimately, tools may set a threshold algorithmically to maximize the target “correctness” metric (e.g. set the threshold at a % that maximizes F1 score), or you can use business context to say that a potential rare, but deadly disease should be flagged even if there’s just a 5% chance that a patient has it. This is were domain and technical experitise must go hand in hand to effectively solve a use case.
Other Quick and Easy Techniques
If the classes are very imbalanced, obviously we can just forcibly balance them. The simplest, least technical approach is simply to downsample the majority class(es) and end up with a relatively more balanced overall population. Other techniques exist too – a common one is called SMOTE (Synthetic Minority Oversampling Technique), which creates “simulated” data points based on the real ones to create more examples. However, in all cases, artificially inflating the minority population can cause inflated proportions and probabilities in the resulting model. In fact, we could imagine that if we took Batman from the crime-ridden Gotham City and moved him to watch over a wealthy suburb, he’d see far more crime than actually exists. In the same way, we might end up with a predictive model that is too sensitive to the minority class, we end up with more recall, less precision.
We can also inform the algorithms we use to give uneven weight to all the possible classes. This is already part of the intuitive decision making we do every day: the cost of a medical diagnostic test might be $300, while the cost of missing out is major surgery down the line, $300,000, so of course we would lean on the side of caution. In data science terms, this is known as cost sensitive – getting the minority class wrong costs more than getting the majority class wrong, and thus we can give more weight to ML algorithms to reduce costs unevenly.
Or, just design an experiment. Industrial operators will run machines or systems at the redline to try and force more failures to happen. And medical centers run ads to recruit for studies of rare diseases to get more data points. If we need to build a good model, we can go make sure we have the evidence we need.
This is only an introduction to the class imbalance problem, but we offer the core set of thinking required to tackle it, a few practical solutions, and the right context to continue learning more. Ultimately, it’s all about the domain context. If we can articulate the business problem, relate it to the cost of action and the cost of inaction, and frame our objectives in terms of true/false positive/negative, then it becomes rather easy.
And if you have any questions, you’re always welcome to give us a quick message within the Einblick platform or at firstname.lastname@example.org.