Handling Imbalanced Data

When handling imbalanced data, it’s essential to use techniques that can help the model learn from the minority class effectively. Here’s an in-depth explanation of each technique with syntax.

Oversampling and SMOTE
Undersampling
Class Weights

‣

1. Oversampling and SMOTE

Comprehensive Note on from imblearn.over_sampling import RandomOverSampler

‣

2. Undersampling

‣

3. Class Weights

Summary of When to Use Each Technique

Technique	Best Used When…	Advantages	Disadvantages
Over-sampling and SMOTE	You need to preserve all data and improve minority class representation	Maintains dataset size, SMOTE reduces overfitting	Risk of overfitting with oversampling; SMOTE can create unrealistic data
Undersampling	You have a large dataset and can afford to lose some majority class data	Reduces data size, faster training	Loses majority class information, may underfit
Class Weights	You want a quick fix without changing the dataset	No dataset alteration, widely supported	May not work for extreme imbalance

Use this guide to match the technique with the nature of your dataset and the model requirements, ensuring you get a balanced and effective training setup.

1. Resampling Techniques

A. Oversampling the Minority Class

This involves creating duplicate instances of the minority class to balance the data.
It can be achieved using the resample method in sklearn.


from sklearn.utils import resample

# Separate majority and minority classes
majority_class = df[df['target'] == 0]  # assuming 0 is the majority class
minority_class = df[df['target'] == 1]  # assuming 1 is the minority class

# Oversample the minority class
minority_oversampled = resample(minority_class,
                                replace=True,       # sample with replacement
                                n_samples=len(majority_class),  # match majority class size
                                random_state=42)    # for reproducibility

# Combine majority class with oversampled minority class
df_balanced = pd.concat([majority_class, minority_oversampled])

# Check new class distribution
print(df_balanced['target'].value_counts())

B. Under-sampling the Majority Class

This technique reduces the number of samples in the majority class to match the minority class size.
Useful when the majority class is significantly larger and oversampling the minority class could lead to overfitting.

python
Copy code
# Undersample the majority class
majority_undersampled = resample(majority_class,
                                 replace=False,    # sample without replacement
                                 n_samples=len(minority_class),  # match minority class size
                                 random_state=42)  # for reproducibility

# Combine undersampled majority class with minority class
df_balanced = pd.concat([majority_undersampled, minority_class])

# Check new class distribution
print(df_balanced['target'].value_counts())

2. SMOTE (Synthetic Minority Over-sampling Technique)

SMOTE creates synthetic samples by interpolating between existing samples of the minority class.
imblearn library provides an implementation of SMOTE.

python
Copy code
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# Separate features and target
X = df.drop('target', axis=1)
y = df['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Apply SMOTE to the training data
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Check new class distribution
print(y_train_smote.value_counts())

3. Class Weights

Some machine learning models, like logistic regression and decision trees, allow you to set class_weight='balanced', which automatically adjusts weights inversely proportional to class frequencies.
For algorithms like Random Forest, Logistic Regression, and SVM, setting class weights can make the model pay more attention to the minority class.

Example with Logistic Regression and Random Forest

python
Copy code
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Logistic Regression with balanced class weights
log_reg = LogisticRegression(class_weight='balanced', random_state=42)
log_reg.fit(X_train, y_train)
y_pred_log_reg = log_reg.predict(X_test)
print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_log_reg))

# Random Forest with balanced class weights
rf = RandomForestClassifier(class_weight='balanced', random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))

Manual Class Weights: For more control, you can assign custom class weights. For instance, if the minority class is very underrepresented, you might use a higher weight.

python
Copy code
# Custom class weights
weights = {0: 1, 1: 10}  # Higher weight for the minority class

# Logistic Regression with custom class weights
log_reg_custom = LogisticRegression(class_weight=weights, random_state=42)
log_reg_custom.fit(X_train, y_train)
y_pred_log_reg_custom = log_reg_custom.predict(X_test)
print("Logistic Regression with Custom Weights Classification Report:")
print(classification_report(y_test, y_pred_log_reg_custom))

Choosing the Right Technique

Oversampling and SMOTE: Useful if you want to increase the presence of the minority class without removing information from the majority class.
Undersampling: Good for large datasets with a significant majority class but can lead to loss of information from the majority class.
Class Weights: A quick and effective solution that works well with many models without altering the dataset itself.

By balancing your dataset with these methods, you help ensure the model doesn’t favor the majority class, leading to more accurate predictions across all classes.