Avoiding leakage in cross-validation when using SMOTE

SMOTE is a popular way to handle imbalanced datasets, but it may cause info leakage at the same time.

Yijie Wang
2 min readJan 14, 2022
Imbalance classification is everywhere. Photo by Dan Meyers on Unsplash

What is SMOTE

Imbalanced datasets are very common in data science problems like fraud detection or customer identification. At the same time, the issue, if not handled properly, can sabotage the performance of the outcome of machine learning model. There are several ways to mitigate the influence of imbalanced dataset, one way is to balance the dataset by down-sampling or over-sampling.

Over-sampling balances the dataset by increasing the number of minority instances. Different algorithms can be used to enlarge the minority group. SMOTE is a popular approach to conduct over-sampling. It generates new minority samples between existing minority instances. Compared with down-sampling, over-sampling leads to a larger size of dataset and therefore has better prediction performance, but it may also cause over-fitting due to duplicating existing observations.

Leakage problem when using SMOTE

Cross-validation is a powerful and widely-adopted way to check over-fitting. Below is a straight-forward approach to conduct an imbalanced classification with SMOTE and cross-validation:

  1. Splitting training set and test set randomly
  2. Using SMOTE to balance training set
  3. Training classification model (Logistics, Random forest, etc.) based on training set
  4. Conducting cross-validation based on training set to check over-fitting
  5. Apply model to test set

Unfortunately, information leakage happens in this straightforward process because SMOTE is conducted before the cross-validation, which means “duplicated” data may appear in the training set and validation set at the same time.

The right way to conduct cross-validation

To solve this problem, SMOTE has to be conducted separately in each fold of cross validation after splitting training and validation set. The handy cross-validation provided by sklearn package can’t realize this process and we have to build the whole process manually.

Now we can check for over-fitting without worrying about information leakage.

Thanks for reading my first post on Medium. This is an amazing place to record progress and share knowledge.

--

--

Yijie Wang

Business analytics student at Fuqua, Duke University