In machine learning, converting categorical data into binary classifications is essential for preparing data for algorithms that require numerical input. Here are various methods you can use:
Basic Encoding
One-Hot EncodingLabel EncodingDummy Encoding
Ordered Encoding
- Ordinal Encoding
- Thermometer Encoding
Target-Based Encoding
Target Encoding (Mean Encoding)- K-Fold Target Encoding
- Leave-One-Out Encoding
- Mean Encoding with Smoothing
- M-estimate Encoding
- James-Stein Encoding
- CatBoost Encoding
Binary and Base-N Encodings
Binary Encoding- BaseN Encoding
- Hash Encoding (Feature Hashing)
Frequency and Count Encoding
- F
requency Encoding - Count Encoding
Contrast and Statistical Encodings
- WoE (Weight of Evidence) Encoding
- Helmert Encoding
- Backwards Difference Encoding
- Polynomial Encoding
- Deviation (Sum) Encoding
Advanced / Deep Learning Encoding
- Entity Embedding
Encoding LIBRARY
Not all encoding methods listed have a dedicated class in scikit-learn
. Scikit-learn supports a subset of these encodings directly, and for some specialized encodings, third-party libraries such as category_encoders
or
feature_engine
are used. Here's a breakdown:
Here's a comprehensive table detailing the encoding methods, their availability in libraries, and relevant information:
Details on Libraries for Unsupported Methods
- Scikit-learn:
- Provides foundational encoders like
LabelEncoder
andOneHotEncoder
. - Can be extended with transformers for custom encoding logic.
- Category Encoders:
- A library focused on categorical encoding methods, including specialized ones like Target Encoding, Hash Encoding, and WoE Encoding.
- Install with
pip install category_encoders
. - Feature Engine:
- Another library that provides prebuilt transformers for some encoding techniques.
- Install with
pip install feature-engine
.
Encoding Methods for Categorical Features
1. Basic Encoding
Encoding Method | Description | Scikit-learn | Category Encoders | Relevant Class/Implementation |
---|---|---|---|---|
Binary columns per category. | ✅ | ✅ |
| |
Converts categories into integer labels. | ✅ | ✅ |
| |
One-Hot with one column dropped. | ✅ | ✅ |
|
2. Ordered Encoding
Encoding Method | Description | Scikit-learn | Category Encoders | Relevant Class/Implementation |
---|---|---|---|---|
Encodes categories based on meaningful order. | ✅ | ✅ |
| |
Binary thermometer-style encoding for ordinal data. | ❌ | ❌ | Custom implementation |
3. Target-Based Encoding
4. Binary and Base-N Encoding
5. Frequency and Count Encoding
6. Contrast and Statistical Encoding
7. Advanced / Deep Learning Encoding
Install Libraries
bash
CopyEdit
pip install category_encoders
pip install feature-engine
DATA EXAMPLE
# Sample data
df = pd.DataFrame({
'color': ['red', 'blue', 'green', 'blue', 'green', 'red', 'green'],
'city': ['London', 'Paris', 'New York', 'Paris', 'London', 'New York', 'London'],
'target': [1, 0, 1, 0, 1, 0, 1]
})
The initial DataFrame df
that was created before any encoding looks like this:
Index | color | city | target |
0 | red | London | 1 |
1 | blue | Paris | 0 |
2 | green | New York | 1 |
3 | blue | Paris | 0 |
4 | green | London | 1 |
5 | red | New York | 0 |
6 | green | London | 1 |
Leave-One-Out Encoding
Hash Encoding
Dummy Encoding (One-Hot Encoding with Drop)
WoE (Weight of Evidence) Encoding
Contrast Coding (e.g., Helmert, Deviation, Difference)
K-Fold Target Encoding
SUMMARY
Encoding Method | Description | Suitable For | Example |
Label Encoding | Assigns each unique category a unique integer. | Ordinal data | "low" → 0 , "medium" → 1 , "high" → 2 |
One-Hot Encoding | Creates a new binary column for each category. | Nominal data | "color" → color_red , color_blue , color_green ; "red" → [1, 0, 0] |
Binary Encoding | Converts categories into binary values with fewer binary digits. | High-cardinality data | ["A", "B", "C", "D"] → ["00", "01", "10", "11"] |
Frequency Encoding | Encodes each category based on its frequency in the dataset. | Categorical data with impact | "red" → 0.5 , "blue" → 0.3 , "green" → 0.2 |
Target Encoding | Replaces each category with the mean of the target variable for that category. | High-cardinality data | "city" → mean target for "London" , "Paris" , "New York" |
Leave-One-Out | Similar to target encoding, but excludes the current row's target value to avoid leakage. | Small datasets | Mean of "city" excluding current row target |
Hash Encoding | Assigns categories to bins via hashing, reducing columns and useful for high-cardinality data. | High-cardinality data | color hashed to a binary vector, without creating one column for each category |
Dummy Encoding | Similar to one-hot encoding but omits one category to avoid multicollinearity (n-1 columns for n categories). | Nominal data | "color" → color_blue , color_green (excluding color_red ) |
Weight of Evidence | Calculates the weight of evidence for each category based on good/bad outcome distribution, commonly in credit scoring. | Ordinal or binary target | "city" → WoE based on distribution of target values |
Contrast Coding | Creates binary representations that capture differences between categories, often used in experimental designs. | Groups with contrasts | Different binary vectors based on contrasts, e.g., Helmert coding for group differences |
What is Cardinality in Machine Learning & Data Science?
Definition
Cardinality refers to the number of unique values in a categorical or numerical feature of a dataset.
- Low Cardinality → A feature has a few unique values (e.g., "Gender" with only Male, Female).
- High Cardinality → A feature has many unique values (e.g., "Customer ID" with thousands of unique values).
Examples of Cardinality
Feature | Unique Values | Cardinality Type |
Gender | Male, Female | Low |
Day of Week | Monday, Tuesday, ... Sunday | Low |
Country | USA, UK, Canada, India, ... | Medium |
Product ID | 10001, 10002, 10003, ... 99999 | High |
User Email | (Each user has a unique email) | Very High |
Why is Cardinality Important?
1️⃣ It Affects Encoding Methods
- Low Cardinality → Use One-Hot Encoding
- High Cardinality → Use Frequency Encoding, Binary Encoding, or Target Encoding
2️⃣ Impacts Model Performance
- High-cardinality features can increase memory usage and slow down training.
- Too many unique categories may lead to overfitting in models.
3️⃣ Impacts Feature Engineering
- High-cardinality categorical features often contain useful information (e.g., product categories in sales prediction).
- Proper handling ensures that the model generalizes well.
Handling High-Cardinality Features
✅ One-Hot Encoding (Only for low-cardinality features)
✅ Label Encoding (For ordinal categories)
✅ Binary Encoding (Efficient for mid-to-high cardinality)
✅ Frequency Encoding (When category frequency is meaningful)
✅ Target Encoding (For supervised learning tasks)
Conclusion
Cardinality is a crucial factor in data preprocessing. Identifying whether a feature has low, medium, or high cardinality helps in choosing the right encoding method to improve model performance! 🚀