✅
Encodes categories based on meaningful order.
sklearn.preprocessing.OrdinalEncoder
✅
Ordinal Encoding
Ordinal Encoding is a technique in Machine Learning used to convert categorical variables with a meaningful orderinto numerical values. It assigns integer labels to each category based on their rank or order, allowing models to learn from the inherent hierarchy in the data.
Why Use Ordinal Encoding?
Some categorical features have a natural order—for example:
- Education Level: High School < Bachelor's < Master's < PhD
- Size: Small < Medium < Large
Using One-Hot Encoding on such features loses the order information, while Ordinal Encoding preserves it.
How Ordinal Encoding Works
Each ordered category is mapped to an integer, starting from 0 (or 1, depending on implementation). The higher the number, the greater the rank or level.
Example 1
Example Dataset (Before Encoding)
Size
Small
Medium
Large
Medium
Small
Ordinal Encoded Representation
Size
0
1
2
1
0
- The order is defined: Small < Medium < Large
- Each row now has a numeric rank based on the size.
Implementing Ordinal Encoding in Python
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
# Sample Data
data = pd.DataFrame({'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small']})
# Define order explicitly
size_order = [['Small', 'Medium', 'Large']]
# Initialize and apply OrdinalEncoder
encoder = OrdinalEncoder(categories=size_order)
data['Size_encoded'] = encoder.fit_transform(data[['Size']])
print(data)
mathematica
CopyEdit
Size Size_encoded
0 Small 0.0
1 Medium 1.0
2 Large 2.0
3 Medium 1.0
4 Small 0.0
Using map()
for Custom Encoding in pandas
python
CopyEdit
data['Size_encoded'] = data['Size'].map({'Small': 0, 'Medium': 1, 'Large': 2})
This approach is quick and flexible, especially when you know the category order ahead of time.
Example 2
python
CopyEdit
df = pd.DataFrame({
'Education': ['High School', 'PhD', 'Bachelor', 'Master', 'High School']
})
# Define custom mapping
edu_map = {'High School': 0, 'Bachelor': 1, 'Master': 2, 'PhD': 3}
df['Education_encoded'] = df['Education'].map(edu_map)
print(df)
Education | Education_encoded |
High School | 0 |
PhD | 3 |
Bachelor | 1 |
Master | 2 |
High School | 0 |
Ordinal Encoding vs One-Hot Encoding
Feature | Ordinal Encoding | One-Hot Encoding |
Output Format | Single column (integers) | Multiple binary columns |
Assumes Order? | ✅ Yes | ❌ No |
Best for | Ordered categories (e.g. size) | Nominal categories (e.g. color) |
Memory Usage | Low | High |
Risk of Misuse | Can mislead if no real order | Safe for unordered data |
Drawbacks of Ordinal Encoding
- False Interpretation of Distance
- Models might interpret the gap between levels as uniform, which may not reflect reality.
- Dangerous for Nominal Data
- Applying it to unordered categories (e.g.,
Red
,Blue
,Green
) introduces false ordinal relationships.
When to Use Ordinal Encoding?
✅ When the feature has meaningful order or rank
✅ When using tree-based models, which handle numeric values flexibly
🚫 Avoid for nominal data (use One-Hot instead)
🚫 Be cautious with linear models unless the spacing between ranks makes sense
Conclusion
Ordinal Encoding is ideal when working with ordered categorical data. It’s simple and efficient, but must be used only when order matters. For unordered categories, it can mislead your model, so choose wisely! 🧠📊