How to Split Train / Test Correctly (And 4 Things That Can Go Wrong)

Part of our Data Science 201 Series. Make sure your models work in production, the way you expect them to based on model training.

Why Split Out a Test Group?

In short, we are attempting to make sure our model accuracy holds up in production when making prediction on new, unseen samples. By holding out a portion of our dataset for evaluation purposes, we are obtaining a more reliable or meaningful measure of our model’s performances because we are replicating the “in production” setup  the accuracy that would result once a model has new data applied to it. 

On a conceptual level, a model is a process that maps your features to your target variable. For any target phenomenon, some portion of that linkage will be encoded as  “signal” (good stuff!) and some portion is “noise.” (an error the post office made when it changed your address from “US” to Uzbekistan – “UZ”) . We want to accurately learn the signal, but many of the algorithms we use cannot tell the difference between “signal” and “noise.” The model minimizes the difference between its predictions and the values in the training set, but by doing so, will fit that noise as well. This is what we mean by an overfit model. 

As an analogy, we would never have a practice quiz that had the exact same questions as the final exam, and neither should we evaluate model accuracy on the same dataset that was used to train it.

How to Split Out a Test Group?

There’s no hard and fast rule defining exactly an exact number to hold out as a test group. Practically, models are much more computationally intensive to “train” than they are to “test,” meaning larger test groups cost us very little. Conventionally, test usually ranges from around 10% to 40% of the overall data population. Generally, a 70% train / 30% test split will work great. 

Einblick automatically holds out observations for Test

But what if we get really lucky (or unlucky) and get a Test group that is especially easy to predict? In that case, we can do cross validation, which includes a few different techniques to ensure that we try every model on multiple different Test groups. The most commonly practiced approach is k-fold cross validation, where the data is divided randomly into k number of subgroups. The model is trained k times, where for each run, one of the subgroups is used as Test dataset and the remainder of subgroups are Train dataset.  

Practically, it’s worth noting that when samples are taken from most reasonably-sized datasets, the ~30% Test will be representative of the overall population. In these cases, there is little incremental benefit to cross validating compared to just picking a single holdout. 

I Know All That, What Can Still Go Wrong?

1. Not a Subgroup, a Random Subgroup

Holdout validation works great when both Test and Train are representative of the underlying phenomenon we are modeling. – This is easily achievable by shuffling our data and splitting it after that process.. Though seemingly an innocuous requirement, a careless data scientist will fall afoul of this requirement, and end up with a non-representative dataset by incorrectly sampling data. In laziness, we might select the top or bottom N rows to be the Test sample, and assume that it’s good enough. However, most datasets produced out of a database or loaded into a spreadsheet don’t represent a random ordering. This is really dangerous!

The easiest way to avoid this problem? Don’t reinvent the wheel, and don’t perform splits by hand. There is a caveat where certain selections will not allow for random splits – we will cover that in a later discussion.

sklearn helps you split Test and Train with a simple function call

2. Splitting is the Last Thing You’ll Do

Yet another innocuous mistake would be to pre-process the test and training groups separately. This can result in a situation where Train and Test have become dissimilar to each other through the processes we have applied. Instead, just be careful to not apply separate preprocessing steps on train and test data. 

The common examples of pre-processing steps that are obviously impacted by applying separate preprocessing steps include scaling and one-hot-encoding (turning categories into series of 1/0). For scaling, when the preprocessing happens separately, Test and Train might have randomly different means or standard deviations, and as a result, a different operation happens to each population. For one-hot encoding, there may be classes that only occur in one subset. In summary, fit and transform your preprocessing operators over the train data, but be careful to just apply the same transformation over the test data.

3. Class Imbalances Trip Up Model Accuracy

When the minority class is small in number, when we split or fold our data (for cross validation), they might be easily over represented or underrepresented in one of the splits. A “good” model might for different splits appear “amazing” or “terrible” simply based on a few observations disproportionately representing the actual rate. To avoid this issue use stratified techniques. 

Stratification means that we ensure that every split (or fold) has a similar proportion of every class when compared to the overall population. Most of the functions used to do splitting will typically also let you specify stratification. Of course, when classes are very imbalanced, there will be other techniques that we want to apply to ensure our model accurately reflects our desired qualitative outcome. We will discuss that more at length in a subsequent guide. 

4. The Only Real Test is the “Test of Time”

Obviously, the only way to ensure that your model works in production is to actually run it in production. There, we need processes to test and continually retest the model over time, and measure accuracy. This is actually not a simple test. 

From my 401K fund disclaimer.

You can consider setting up constant “champion” vs. “challenger” assessments of your model, following a scientific method of experimentation. Rather than roll out a new model to be operationalized on the entire population, roll out any treatment strategy to a random subgroup. Compare the results of operations based on the model to the baseline (whether it’s an older model, or no model at all), and if it is clearly better, then we have safely cleared by holdout validation and real world validation.