In machine learning, converting categorical data into binary classifications is essential for preparing data for algorithms that require numerical input. Here are various methods you can use:

Basic Encoding

~~One-Hot Encoding~~
~~Label Encoding~~
~~Dummy Encoding~~

Ordered Encoding

Ordinal Encoding
Thermometer Encoding

Target-Based Encoding

~~Target Encoding (Mean Encoding)~~
K-Fold Target Encoding
Leave-One-Out Encoding
Mean Encoding with Smoothing
M-estimate Encoding
James-Stein Encoding
CatBoost Encoding

Binary and Base-N Encodings

~~Binary Encoding~~
BaseN Encoding
Hash Encoding (Feature Hashing)

Frequency and Count Encoding

F~~requency Encoding~~
Count Encoding

Contrast and Statistical Encodings

WoE (Weight of Evidence) Encoding
Helmert Encoding
Backwards Difference Encoding
Polynomial Encoding
Deviation (Sum) Encoding

Advanced / Deep Learning Encoding

Entity Embedding

Encoding LIBRARY

💡

Not all encoding methods listed have a dedicated class in scikit-learn. Scikit-learn supports a subset of these encodings directly, and for some specialized encodings, third-party libraries such as category_encoders or feature_engine are used. Here's a breakdown:

Here's a comprehensive table detailing the encoding methods, their availability in libraries, and relevant information:

Details on Libraries for Unsupported Methods

Scikit-learn:

Provides foundational encoders like LabelEncoder and OneHotEncoder.
Can be extended with transformers for custom encoding logic.

Category Encoders:

A library focused on categorical encoding methods, including specialized ones like Target Encoding, Hash Encoding, and WoE Encoding.
Install with pip install category_encoders.

Feature Engine:

Another library that provides prebuilt transformers for some encoding techniques.
Install with pip install feature-engine.

Encoding Methods for Categorical Features

1. Basic Encoding

Encoding Method	Description	Scikit-learn	Category Encoders	Relevant Class/Implementation
📚 One-Hot Encoding	Binary columns per category.	✅	✅	`sklearn.preprocessing.OneHotEncoder`
📚 Label Encoding	Converts categories into integer labels.	✅	✅	`sklearn.preprocessing.LabelEncoder`, `OrdinalEncoder`
📚 Dummy Encoding	One-Hot with one column dropped.	✅	✅	`OneHotEncoder(drop='first')`

2. Ordered Encoding

Encoding Method	Description	Scikit-learn	Category Encoders	Relevant Class/Implementation
📚 Ordinal Encoding	Encodes categories based on meaningful order.	✅	✅	`sklearn.preprocessing.OrdinalEncoder`
Thermometer Encoding	Binary thermometer-style encoding for ordinal data.	❌	❌	Custom implementation

‣

3. Target-Based Encoding

‣

4. Binary and Base-N Encoding

‣

5. Frequency and Count Encoding

‣

6. Contrast and Statistical Encoding

‣

7. Advanced / Deep Learning Encoding

Install Libraries

bash
CopyEdit
pip install category_encoders
pip install feature-engine

DATA EXAMPLE

# Sample data
df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'blue', 'green', 'red', 'green'],
    'city': ['London', 'Paris', 'New York', 'Paris', 'London', 'New York', 'London'],
    'target': [1, 0, 1, 0, 1, 0, 1]
})

The initial DataFrame df that was created before any encoding looks like this:

Index	color	city	target
0	red	London	1
1	blue	Paris	0
2	green	New York	1
3	blue	Paris	0
4	green	London	1
5	red	New York	0
6	green	London	1

‣

Leave-One-Out Encoding

‣

Hash Encoding

‣

Dummy Encoding (One-Hot Encoding with Drop)

‣

WoE (Weight of Evidence) Encoding

‣

Contrast Coding (e.g., Helmert, Deviation, Difference)

‣

K-Fold Target Encoding

SUMMARY

Encoding Method	Description	Suitable For	Example
Label Encoding	Assigns each unique category a unique integer.	Ordinal data	`"low" → 0`, `"medium" → 1`, `"high" → 2`
One-Hot Encoding	Creates a new binary column for each category.	Nominal data	`"color"` → `color_red`, `color_blue`, `color_green`; `"red"` → `[1, 0, 0]`
Binary Encoding	Converts categories into binary values with fewer binary digits.	High-cardinality data	`["A", "B", "C", "D"]` → `["00", "01", "10", "11"]`
Frequency Encoding	Encodes each category based on its frequency in the dataset.	Categorical data with impact	`"red"` → `0.5`, `"blue"` → `0.3`, `"green"`→ `0.2`
Target Encoding	Replaces each category with the mean of the target variable for that category.	High-cardinality data	`"city"` → mean target for `"London"`, `"Paris"`, `"New York"`
Leave-One-Out	Similar to target encoding, but excludes the current row's target value to avoid leakage.	Small datasets	Mean of `"city"` excluding current row target
Hash Encoding	Assigns categories to bins via hashing, reducing columns and useful for high-cardinality data.	High-cardinality data	`color` hashed to a binary vector, without creating one column for each category
Dummy Encoding	Similar to one-hot encoding but omits one category to avoid multicollinearity (n-1 columns for n categories).	Nominal data	`"color"` → `color_blue`, `color_green`(excluding `color_red`)
Weight of Evidence	Calculates the weight of evidence for each category based on good/bad outcome distribution, commonly in credit scoring.	Ordinal or binary target	`"city"` → WoE based on distribution of target values
Contrast Coding	Creates binary representations that capture differences between categories, often used in experimental designs.	Groups with contrasts	Different binary vectors based on contrasts, e.g., Helmert coding for group differences

What is Cardinality in Machine Learning & Data Science?

Definition

Cardinality refers to the number of unique values in a categorical or numerical feature of a dataset.

Low Cardinality → A feature has a few unique values (e.g., "Gender" with only Male, Female).
High Cardinality → A feature has many unique values (e.g., "Customer ID" with thousands of unique values).

Examples of Cardinality

Feature	Unique Values	Cardinality Type
Gender	Male, Female	Low
Day of Week	Monday, Tuesday, ... Sunday	Low
Country	USA, UK, Canada, India, ...	Medium
Product ID	10001, 10002, 10003, ... 99999	High
User Email	(Each user has a unique email)	Very High

Why is Cardinality Important?

1️⃣ It Affects Encoding Methods

Low Cardinality → Use One-Hot Encoding
High Cardinality → Use Frequency Encoding, Binary Encoding, or Target Encoding

2️⃣ Impacts Model Performance

High-cardinality features can increase memory usage and slow down training.
Too many unique categories may lead to overfitting in models.

3️⃣ Impacts Feature Engineering

High-cardinality categorical features often contain useful information (e.g., product categories in sales prediction).
Proper handling ensures that the model generalizes well.

Handling High-Cardinality Features

✅ One-Hot Encoding (Only for low-cardinality features)

✅ Label Encoding (For ordinal categories)

✅ Binary Encoding (Efficient for mid-to-high cardinality)

✅ Frequency Encoding (When category frequency is meaningful)

✅ Target Encoding (For supervised learning tasks)

Conclusion

Cardinality is a crucial factor in data preprocessing. Identifying whether a feature has low, medium, or high cardinality helps in choosing the right encoding method to improve model performance! 🚀

Method of Encoding Categorical Variables