Column 1
Replaces categories with their frequency count.
Column 2
❌
Column 3
✅
Column 4
category_encoders.CountEncoder
Frequency Encoding
What is Frequency Encoding?
Frequency Encoding (also called Count Encoding) replaces categorical values with their relative frequency (or count) in the dataset. This means categories that appear more frequently are assigned higher values, while less common categories receive lower values.
Why Use Frequency Encoding?
- Efficient for High-Cardinality Features (categories with many unique values)
- Reduces Dimensionality (compared to One-Hot Encoding)
- Useful when Category Frequency Affects Target Variable (e.g., common product categories may impact sales)
- May introduce bias in imbalanced datasets (e.g., if one category dominates).
- Not useful for nominal (unordered) categories where frequency has no meaning.
🚫 Caution:
How Frequency Encoding Works
- Count occurrences of each category.
- Convert counts to relative frequencies (percentage of dataset).
- Replace categorical values with their corresponding frequencies.
Example
Color Column
Red
Blue
Green
Blue
Red
Step 1: Compute Frequencies
- Red appears 2 times →
2/5 = 0.4
- Blue appears 2 times →
2/5 = 0.4
- Green appears 1 time →
1/5 = 0.2
Step 2: Replace with Frequencies
Color | Frequency |
Red | 0.4 |
Blue | 0.4 |
Green | 0.2 |
Blue | 0.4 |
Red | 0.4 |
Implementing Frequency Encoding in Python
Using pandas
import pandas as pd
# Sample Data
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']})
# Compute frequency
freq_encoding = df['Color'].value_counts(normalize=True)
# Apply frequency encoding
df['Color_Encoded'] = df['Color'].map(freq_encoding)
print(df)
mathematica
CopyEdit
Color Color_Encoded
0 Red 0.4
1 Blue 0.4
2 Green 0.2
3 Blue 0.4
4 Red 0.4
Comparison with Other Encoding Techniques
Encoding Method | Handles High-Cardinality? | Adds Extra Columns? | Retains Ordinal Information? |
One-Hot Encoding | ❌ No | 🔴 Many | ❌ No |
Label Encoding | ✅ Yes | ✅ Single Column | 🔴 Introduces Order |
Binary Encoding | ✅ Yes | 🟢 Fewer Columns | ✅ Partial |
Frequency Encoding | ✅ Yes | ✅ Single Column | 🔴 May Introduce Bias |
When to Use Frequency Encoding?
- Best when category frequency correlates with the target variable
- Useful for high-cardinality categorical features
- Efficient for Tree-Based Models (e.g., Decision Trees, Random Forests, XGBoost)
- Category frequency does not impact the target variable (e.g., gender classification).
- The dataset is highly imbalanced, leading to biased encoding.
🚫 Avoid if:
frequency_encoding = df['color'].value_counts(normalize=True).to_dict()
df['color_frequency'] = df['color'].map(frequency_encoding)
print(df[['color', 'color_frequency']])
color | color_frequency |
red | 0.285714 |
blue | 0.285714 |
green | 0.428571 |
blue | 0.285714 |
green | 0.428571 |
red | 0.285714 |
green | 0.428571 |
Conclusion
Frequency Encoding is memory-efficient and useful for high-cardinality categorical data. However, it may introduce bias if the category distribution is uneven. Always analyze whether frequency correlates with the target variable before applying it. 🚀