📚

Target Encoding (Mean Encoding)

Category Encoders

Description

Replaces category with mean of target.

Relevant Class/Implementation

category_encoders.TargetEncoder

Scikit-learn

Target Encoding (Mean Encoding)

What is Target Encoding?

Target Encoding (also called Mean Encoding) replaces each category in a categorical feature with the mean of the target variable corresponding to that category.

It’s particularly useful in supervised learning when categorical features are related to the target and you want to extract that information.

Why Use Target Encoding?

✅ Great for High-Cardinality Features (e.g., 1,000+ city names)

✅ Captures Relationship Between Feature and Target

✅ Keeps Model Input Compact (unlike One-Hot Encoding)

🚫 Caution:

  • Can cause data leakage if not handled properly (e.g., target values influencing training).
  • Must be used with cross-validation or out-of-fold encoding.

How Target Encoding Works

  1. Group the dataset by the categorical feature.
  2. Calculate the mean target value for each category.
  3. Replace the category with its corresponding mean value.

Example

Let’s say we’re predicting sales and have a City column:

Input Dataset

City
Sales
London
200
Paris
250
New York
300
London
220
Paris
270

Step 1: Compute Mean Target per Category

  • London → (200 + 220) / 2 = 210
  • Paris → (250 + 270) / 2 = 260
  • New York → 300

Step 2: Replace Category with Mean Target

City
Sales
City_Encoded
London
200
210
Paris
250
260
New York
300
300
London
220
210
Paris
270
260

Implementing Target Encoding in Python

✅ Using category_encoders library

python
CopyEdit
import pandas as pd
import category_encoders as ce

# Sample Data
df = pd.DataFrame({
    'City': ['London', 'Paris', 'New York', 'London', 'Paris'],
    'Sales': [200, 250, 300, 220, 270]
})

# Initialize encoder
encoder = ce.TargetEncoder(cols=['City'])

# Fit and transform
df['City_Encoded'] = encoder.fit_transform(df['City'], df['Sales'])

print(df)

Handling Data Leakage

To avoid data leakage, especially during model validation:

  • Use out-of-fold encoding for training data
  • Apply encoding separately on validation/test sets

✅ Using Cross-Validation Encoding

python
CopyEdit
from sklearn.model_selection import KFold
import numpy as np

df['City_Encoded'] = np.nan
kf = KFold(n_splits=5, shuffle=True, random_state=42)

for train_idx, val_idx in kf.split(df):
    train_df = df.iloc[train_idx]
    val_df = df.iloc[val_idx]

    means = train_df.groupby('City')['Sales'].mean()
    df.loc[val_idx, 'City_Encoded'] = val_df['City'].map(means)

Comparison with Other Encoders

Encoding Method
Preserves Target Relationship?
Risk of Leakage
Suitable for High Cardinality
One-Hot Encoding
❌ No
❌ No
❌ No
Label Encoding
❌ No
❌ No
✅ Yes
Frequency Encoding
❌ Not direct
❌ No
✅ Yes
Target Encoding
✅ Yes
⚠️ Yes
✅✅ Highly Recommended

When to Use Target Encoding

✅ When categorical feature is strongly related to the target

✅ When dealing with high-cardinality columns

✅ In regression or classification tasks

🚫 Avoid when:

  • You're working with test data that has unseen categories
  • You're not using proper validation techniques

Conclusion

Target Encoding is a powerful technique for handling categorical features with many unique values. It captures the relationship between categories and the target, but must be used carefully to prevent leakage. Always validate properly. 🚀

  • What It Does: Replaces categories with the mean of the target variable for each category.
  • Best For: Commonly used for high-cardinality categorical features.
  • Caution: Risk of data leakage; apply carefully with validation techniques.
  • Example:
    • For a "city" feature with values ["London", "Paris", "New York"], calculate the mean target value for each city.
mean_encoding = df.groupby('city')['target'].mean().to_dict()
df['city_target_mean'] = df['city'].map(mean_encoding)
print(df[['city', 'city_target_mean']])
city
city_target_mean
London
1.0
Paris
0.0
New York
0.5
Paris
0.0
London
1.0
New York
0.5
London
1.0