Handling Imbalanced Dataset-Explained!

Snega S
3 min readJul 19, 2023

Handling imbalanced datasets is a crucial step in working with real-life datasets. In this context, we will explore the concepts of balanced and imbalanced data, as well as two common techniques used to balance imbalanced datasets: SMOTETomek and RandomOverSampler.

Balanced Dataset:
A dataset is considered balanced when two or more categories are equally or nearly equally represented within the dataset. For instance, imagine a dataset where we aim to predict whether a person belongs to category A or B based on their food habits. Let’s assume that people who consume junk food belong to category A, and those who prefer vegetables belong to category B. In a balanced dataset, there would be a total of 1000 records, with 500 belonging to category A and the other 500 to category B. The ratio of predicted values in this case is 1:1. Even if the ratio were 40:60 or 60:40, the dataset would still be considered balanced.

Imbalanced Dataset:
On the other hand, an imbalanced dataset occurs when the ratio of predicted values significantly favors one category over the other. For example, if the ratio is 80:20 or 90:10, we have an imbalanced dataset. In such cases, the dataset might be skewed, leading to issues during model training and affecting accuracy rates.

The Impact of Imbalanced Data:
When training a model on an imbalanced dataset, especially during supervised learning, the model may become biased towards the majority class due to its prevalence. This can lead to poor generalization, where the model tends to predict the majority class more often, even for samples belonging to the minority class, resulting in reduced accuracy and predictive performance.

Handling Imbalanced Datasets:
To address the challenges posed by imbalanced datasets, there are two common approaches:

1. Down Sampling:
Downsampling involves reducing the number of instances in the majority class (e.g., category A) to balance the dataset. In the previous example, if there were 800 instances of category A and 200 instances of category B, downsampling would entail randomly removing 600 instances from category A. However, downsampling can lead to a significant loss of data, including valuable information necessary for accurate predictions, making it less preferable for handling imbalance.

2. Over Sampling:
Oversampling is the process of increasing the instances of the minority class to balance the dataset. For instance, in the previous example, oversampling category B would duplicate its instances to match the number of instances in category A. This results in both categories having equal representation in the dataset, which can improve the model’s ability to learn from the minority class.

Techniques to Perform Oversampling:
There are several techniques available for oversampling, two of which are SMOTETomek and RandomOverSampler.

SMOTETomek:
SMOTETomek is a combination technique that involves using the Synthetic Minority Over-sampling Technique (SMOTE) and Tomek links. SMOTE creates synthetic samples for the minority class, while Tomek links are used to remove noisy samples and instances near decision boundaries, making the oversampled data more robust and informative.

Example using SMOTETomek:

from imblearn.combine import SMOTETomek 
sak = SMOTETomek()
x_new, y_new = sak.fit_resample(x, y)
print(x_new.shape, y_new.shape)

RandomOverSampler:
RandomOverSampler, as the name suggests, randomly duplicates instances from the minority class to balance the dataset.

Example using RandomOverSampler:

from imblearn.over_sampling import RandomOverSampler
os = RandomOverSampler(random_state=0)
x_new, y_new = os.fit_resample(x, y)
print(x_new.shape, y_new.shape)

In both cases, the `fit_resample` method is used to perform the oversampling process, resulting in an updated dataset with balanced classes.

In conclusion, handling imbalanced datasets is essential to ensure reliable model performance. Balancing the dataset using techniques like SMOTETomek and RandomOverSampler can improve the model’s ability to learn from minority class samples and lead to more accurate predictions across all classes.

If you find my article useful, consider showing your support by

  • subscribing to the newsletter and
  • giving it a clap 👏.

Your feedback and engagement means a lot to me!

WRITER at MLearning.ai // Code Interpreter // AI’s Safe Deception

--

--