📚

Frequency Encoding

Column 1

Replaces categories with their frequency count.

Column 2

Column 3

Column 4

category_encoders.CountEncoder

Frequency Encoding

What is Frequency Encoding?

Frequency Encoding (also called Count Encoding) replaces categorical values with their relative frequency (or count) in the dataset. This means categories that appear more frequently are assigned higher values, while less common categories receive lower values.

Why Use Frequency Encoding?

  1. Efficient for High-Cardinality Features (categories with many unique values)
  2. Reduces Dimensionality (compared to One-Hot Encoding)
  3. Useful when Category Frequency Affects Target Variable (e.g., common product categories may impact sales)
    1. 🚫 Caution:

    2. May introduce bias in imbalanced datasets (e.g., if one category dominates).
    3. Not useful for nominal (unordered) categories where frequency has no meaning.

How Frequency Encoding Works

  1. Count occurrences of each category.
  2. Convert counts to relative frequencies (percentage of dataset).
  3. Replace categorical values with their corresponding frequencies.

Example

Color Column

Red
Blue
Green
Blue
Red

Step 1: Compute Frequencies

  • Red appears 2 times → 2/5 = 0.4
  • Blue appears 2 times → 2/5 = 0.4
  • Green appears 1 time → 1/5 = 0.2

Step 2: Replace with Frequencies

Color
Frequency
Red
0.4
Blue
0.4
Green
0.2
Blue
0.4
Red
0.4

Implementing Frequency Encoding in Python

Using pandas


import pandas as pd

# Sample Data
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']})

# Compute frequency
freq_encoding = df['Color'].value_counts(normalize=True)

# Apply frequency encoding
df['Color_Encoded'] = df['Color'].map(freq_encoding)

print(df)
mathematica
CopyEdit
   Color  Color_Encoded
0    Red            0.4
1   Blue            0.4
2  Green            0.2
3   Blue            0.4
4    Red            0.4

Comparison with Other Encoding Techniques

Encoding Method
Handles High-Cardinality?
Adds Extra Columns?
Retains Ordinal Information?
One-Hot Encoding
❌ No
🔴 Many
❌ No
Label Encoding
✅ Yes
✅ Single Column
🔴 Introduces Order
Binary Encoding
✅ Yes
🟢 Fewer Columns
✅ Partial
Frequency Encoding
✅ Yes
✅ Single Column
🔴 May Introduce Bias

When to Use Frequency Encoding?

  1. Best when category frequency correlates with the target variable
  2. Useful for high-cardinality categorical features
  3. Efficient for Tree-Based Models (e.g., Decision Trees, Random Forests, XGBoost)
    1. 🚫 Avoid if:

    2. Category frequency does not impact the target variable (e.g., gender classification).
    3. The dataset is highly imbalanced, leading to biased encoding.
frequency_encoding = df['color'].value_counts(normalize=True).to_dict()
df['color_frequency'] = df['color'].map(frequency_encoding)
print(df[['color', 'color_frequency']])
color
color_frequency
red
0.285714
blue
0.285714
green
0.428571
blue
0.285714
green
0.428571
red
0.285714
green
0.428571

Conclusion

Frequency Encoding is memory-efficient and useful for high-cardinality categorical data. However, it may introduce bias if the category distribution is uneven. Always analyze whether frequency correlates with the target variable before applying it. 🚀