Method of Encoding Categorical Variables
📚

Method of Encoding Categorical Variables

In machine learning, converting categorical data into binary classifications is essential for preparing data for algorithms that require numerical input. Here are various methods you can use:

Basic Encoding

  • One-Hot Encoding
  • Label Encoding
  • Dummy Encoding

Ordered Encoding

  • Ordinal Encoding
  • Thermometer Encoding

Target-Based Encoding

  • Target Encoding (Mean Encoding)
  • K-Fold Target Encoding
  • Leave-One-Out Encoding
  • Mean Encoding with Smoothing
  • M-estimate Encoding
  • James-Stein Encoding
  • CatBoost Encoding

Binary and Base-N Encodings

  • Binary Encoding
  • BaseN Encoding
  • Hash Encoding (Feature Hashing)

Frequency and Count Encoding

  • Frequency Encoding
  • Count Encoding

Contrast and Statistical Encodings

  • WoE (Weight of Evidence) Encoding
  • Helmert Encoding
  • Backwards Difference Encoding
  • Polynomial Encoding
  • Deviation (Sum) Encoding

Advanced / Deep Learning Encoding

  • Entity Embedding

Encoding LIBRARY

💡

Not all encoding methods listed have a dedicated class in scikit-learn. Scikit-learn supports a subset of these encodings directly, and for some specialized encodings, third-party libraries such as category_encoders or feature_engine are used. Here's a breakdown:

Here's a comprehensive table detailing the encoding methods, their availability in libraries, and relevant information:

Details on Libraries for Unsupported Methods

  1. Scikit-learn:
    • Provides foundational encoders like LabelEncoder and OneHotEncoder.
    • Can be extended with transformers for custom encoding logic.
  2. Category Encoders:
    • A library focused on categorical encoding methods, including specialized ones like Target EncodingHash Encoding, and WoE Encoding.
    • Install with pip install category_encoders.
  3. Feature Engine:
    • Another library that provides prebuilt transformers for some encoding techniques.
    • Install with pip install feature-engine.

Encoding Methods for Categorical Features

1. Basic Encoding

Encoding Method
Description
Scikit-learn
Category Encoders
Relevant Class/Implementation

Binary columns per category.

sklearn.preprocessing.OneHotEncoder

Converts categories into integer labels.

sklearn.preprocessing.LabelEncoderOrdinalEncoder

One-Hot with one column dropped.

OneHotEncoder(drop='first')

2. Ordered Encoding

Encoding Method
Description
Scikit-learn
Category Encoders
Relevant Class/Implementation

Encodes categories based on meaningful order.

sklearn.preprocessing.OrdinalEncoder

Binary thermometer-style encoding for ordinal data.

Custom implementation

3. Target-Based Encoding

4. Binary and Base-N Encoding

5. Frequency and Count Encoding

6. Contrast and Statistical Encoding

7. Advanced / Deep Learning Encoding

Install Libraries

bash
CopyEdit
pip install category_encoders
pip install feature-engine

DATA EXAMPLE

# Sample data
df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'blue', 'green', 'red', 'green'],
    'city': ['London', 'Paris', 'New York', 'Paris', 'London', 'New York', 'London'],
    'target': [1, 0, 1, 0, 1, 0, 1]
})

The initial DataFrame df that was created before any encoding looks like this:

Index
color
city
target
0
red
London
1
1
blue
Paris
0
2
green
New York
1
3
blue
Paris
0
4
green
London
1
5
red
New York
0
6
green
London
1

Leave-One-Out Encoding

Hash Encoding

Dummy Encoding (One-Hot Encoding with Drop)

WoE (Weight of Evidence) Encoding

Contrast Coding (e.g., Helmert, Deviation, Difference)

K-Fold Target Encoding

SUMMARY

Encoding Method
Description
Suitable For
Example
Label Encoding
Assigns each unique category a unique integer.
Ordinal data
"low" → 0"medium" → 1"high" → 2
One-Hot Encoding
Creates a new binary column for each category.
Nominal data
"color" → color_redcolor_bluecolor_green"red" → [1, 0, 0]
Binary Encoding
Converts categories into binary values with fewer binary digits.
High-cardinality data
["A", "B", "C", "D"] → ["00", "01", "10", "11"]
Frequency Encoding
Encodes each category based on its frequency in the dataset.
Categorical data with impact
"red" → 0.5"blue" → 0.3"green"→ 0.2
Target Encoding
Replaces each category with the mean of the target variable for that category.
High-cardinality data
"city" → mean target for "London""Paris""New York"
Leave-One-Out
Similar to target encoding, but excludes the current row's target value to avoid leakage.
Small datasets
Mean of "city" excluding current row target
Hash Encoding
Assigns categories to bins via hashing, reducing columns and useful for high-cardinality data.
High-cardinality data
color hashed to a binary vector, without creating one column for each category
Dummy Encoding
Similar to one-hot encoding but omits one category to avoid multicollinearity (n-1 columns for n categories).
Nominal data
"color" → color_bluecolor_green(excluding color_red)
Weight of Evidence
Calculates the weight of evidence for each category based on good/bad outcome distribution, commonly in credit scoring.
Ordinal or binary target
"city" → WoE based on distribution of target values
Contrast Coding
Creates binary representations that capture differences between categories, often used in experimental designs.
Groups with contrasts
Different binary vectors based on contrasts, e.g., Helmert coding for group differences

What is Cardinality in Machine Learning & Data Science?

Definition

Cardinality refers to the number of unique values in a categorical or numerical feature of a dataset.

  • Low Cardinality → A feature has a few unique values (e.g., "Gender" with only Male, Female).
  • High Cardinality → A feature has many unique values (e.g., "Customer ID" with thousands of unique values).

Examples of Cardinality

Feature
Unique Values
Cardinality Type
Gender
Male, Female
Low
Day of Week
Monday, Tuesday, ... Sunday
Low
Country
USA, UK, Canada, India, ...
Medium
Product ID
10001, 10002, 10003, ... 99999
High
User Email
(Each user has a unique email)
Very High

Why is Cardinality Important?

1️⃣ It Affects Encoding Methods

  • Low Cardinality → Use One-Hot Encoding
  • High Cardinality → Use Frequency Encoding, Binary Encoding, or Target Encoding

2️⃣ Impacts Model Performance

  • High-cardinality features can increase memory usage and slow down training.
  • Too many unique categories may lead to overfitting in models.

3️⃣ Impacts Feature Engineering

  • High-cardinality categorical features often contain useful information (e.g., product categories in sales prediction).
  • Proper handling ensures that the model generalizes well.

Handling High-Cardinality Features

✅ One-Hot Encoding (Only for low-cardinality features)

✅ Label Encoding (For ordinal categories)

✅ Binary Encoding (Efficient for mid-to-high cardinality)

✅ Frequency Encoding (When category frequency is meaningful)

✅ Target Encoding (For supervised learning tasks)

Conclusion

Cardinality is a crucial factor in data preprocessing. Identifying whether a feature has low, medium, or high cardinality helps in choosing the right encoding method to improve model performance! 🚀