📚

Ordinal Encoding

Category Encoders

Description

Encodes categories based on meaningful order.

Relevant Class/Implementation

sklearn.preprocessing.OrdinalEncoder

Scikit-learn

Ordinal Encoding

Ordinal Encoding is a technique in Machine Learning used to convert categorical variables with a meaningful orderinto numerical values. It assigns integer labels to each category based on their rank or order, allowing models to learn from the inherent hierarchy in the data.

Why Use Ordinal Encoding?

Some categorical features have a natural order—for example:

  • Education Level: High School < Bachelor's < Master's < PhD
  • Size: Small < Medium < Large

Using One-Hot Encoding on such features loses the order information, while Ordinal Encoding preserves it.

How Ordinal Encoding Works

Each ordered category is mapped to an integer, starting from 0 (or 1, depending on implementation). The higher the number, the greater the rank or level.

Example 1

Example Dataset (Before Encoding)

Size

Small

Medium

Large

Medium

Small

Ordinal Encoded Representation

Size

0

1

2

1

0

  • The order is defined: Small < Medium < Large
  • Each row now has a numeric rank based on the size.

Implementing Ordinal Encoding in Python


import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

# Sample Data
data = pd.DataFrame({'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small']})

# Define order explicitly
size_order = [['Small', 'Medium', 'Large']]

# Initialize and apply OrdinalEncoder
encoder = OrdinalEncoder(categories=size_order)
data['Size_encoded'] = encoder.fit_transform(data[['Size']])

print(data)

mathematica
CopyEdit
     Size  Size_encoded
0   Small           0.0
1  Medium           1.0
2   Large           2.0
3  Medium           1.0
4   Small           0.0

Using map() for Custom Encoding in pandas

python
CopyEdit
data['Size_encoded'] = data['Size'].map({'Small': 0, 'Medium': 1, 'Large': 2})

This approach is quick and flexible, especially when you know the category order ahead of time.

Example 2

python
CopyEdit
df = pd.DataFrame({
    'Education': ['High School', 'PhD', 'Bachelor', 'Master', 'High School']
})

# Define custom mapping
edu_map = {'High School': 0, 'Bachelor': 1, 'Master': 2, 'PhD': 3}

df['Education_encoded'] = df['Education'].map(edu_map)
print(df)

Education
Education_encoded
High School
0
PhD
3
Bachelor
1
Master
2
High School
0

Ordinal Encoding vs One-Hot Encoding

Feature
Ordinal Encoding
One-Hot Encoding
Output Format
Single column (integers)
Multiple binary columns
Assumes Order?
✅ Yes
❌ No
Best for
Ordered categories (e.g. size)
Nominal categories (e.g. color)
Memory Usage
Low
High
Risk of Misuse
Can mislead if no real order
Safe for unordered data

Drawbacks of Ordinal Encoding

  1. False Interpretation of Distance
    • Models might interpret the gap between levels as uniform, which may not reflect reality.
  2. Dangerous for Nominal Data
    • Applying it to unordered categories (e.g., RedBlueGreen) introduces false ordinal relationships.

When to Use Ordinal Encoding?

✅ When the feature has meaningful order or rank

✅ When using tree-based models, which handle numeric values flexibly

🚫 Avoid for nominal data (use One-Hot instead)

🚫 Be cautious with linear models unless the spacing between ranks makes sense

Conclusion

Ordinal Encoding is ideal when working with ordered categorical data. It’s simple and efficient, but must be used only when order matters. For unordered categories, it can mislead your model, so choose wisely! 🧠📊