✅
Replaces category with mean of target.
category_encoders.TargetEncoder
❌
Target Encoding (Mean Encoding)
What is Target Encoding?
Target Encoding (also called Mean Encoding) replaces each category in a categorical feature with the mean of the target variable corresponding to that category.
It’s particularly useful in supervised learning when categorical features are related to the target and you want to extract that information.
Why Use Target Encoding?
✅ Great for High-Cardinality Features (e.g., 1,000+ city names)
✅ Captures Relationship Between Feature and Target
✅ Keeps Model Input Compact (unlike One-Hot Encoding)
🚫 Caution:
- Can cause data leakage if not handled properly (e.g., target values influencing training).
- Must be used with cross-validation or out-of-fold encoding.
How Target Encoding Works
- Group the dataset by the categorical feature.
- Calculate the mean target value for each category.
- Replace the category with its corresponding mean value.
Example
Let’s say we’re predicting sales
and have a City
column:
Input Dataset
City | Sales |
London | 200 |
Paris | 250 |
New York | 300 |
London | 220 |
Paris | 270 |
Step 1: Compute Mean Target per Category
- London → (200 + 220) / 2 = 210
- Paris → (250 + 270) / 2 = 260
- New York → 300
Step 2: Replace Category with Mean Target
City | Sales | City_Encoded |
London | 200 | 210 |
Paris | 250 | 260 |
New York | 300 | 300 |
London | 220 | 210 |
Paris | 270 | 260 |
Implementing Target Encoding in Python
✅ Using category_encoders
library
python
CopyEdit
import pandas as pd
import category_encoders as ce
# Sample Data
df = pd.DataFrame({
'City': ['London', 'Paris', 'New York', 'London', 'Paris'],
'Sales': [200, 250, 300, 220, 270]
})
# Initialize encoder
encoder = ce.TargetEncoder(cols=['City'])
# Fit and transform
df['City_Encoded'] = encoder.fit_transform(df['City'], df['Sales'])
print(df)
Handling Data Leakage
To avoid data leakage, especially during model validation:
- Use out-of-fold encoding for training data
- Apply encoding separately on validation/test sets
✅ Using Cross-Validation Encoding
python
CopyEdit
from sklearn.model_selection import KFold
import numpy as np
df['City_Encoded'] = np.nan
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, val_idx in kf.split(df):
train_df = df.iloc[train_idx]
val_df = df.iloc[val_idx]
means = train_df.groupby('City')['Sales'].mean()
df.loc[val_idx, 'City_Encoded'] = val_df['City'].map(means)
Comparison with Other Encoders
Encoding Method | Preserves Target Relationship? | Risk of Leakage | Suitable for High Cardinality |
One-Hot Encoding | ❌ No | ❌ No | ❌ No |
Label Encoding | ❌ No | ❌ No | ✅ Yes |
Frequency Encoding | ❌ Not direct | ❌ No | ✅ Yes |
Target Encoding | ✅ Yes | ⚠️ Yes | ✅✅ Highly Recommended |
When to Use Target Encoding
✅ When categorical feature is strongly related to the target
✅ When dealing with high-cardinality columns
✅ In regression or classification tasks
🚫 Avoid when:
- You're working with test data that has unseen categories
- You're not using proper validation techniques
Conclusion
Target Encoding is a powerful technique for handling categorical features with many unique values. It captures the relationship between categories and the target, but must be used carefully to prevent leakage. Always validate properly. 🚀
- What It Does: Replaces categories with the mean of the target variable for each category.
- Best For: Commonly used for high-cardinality categorical features.
- Caution: Risk of data leakage; apply carefully with validation techniques.
- Example:
- For a
"city"
feature with values["London", "Paris", "New York"]
, calculate the mean target value for each city.
mean_encoding = df.groupby('city')['target'].mean().to_dict()
df['city_target_mean'] = df['city'].map(mean_encoding)
print(df[['city', 'city_target_mean']])
city | city_target_mean |
London | 1.0 |
Paris | 0.0 |
New York | 0.5 |
Paris | 0.0 |
London | 1.0 |
New York | 0.5 |
London | 1.0 |