Dummy Encoding
📚

Dummy Encoding

Category Encoders

Description

One-Hot with one column dropped.

Relevant Class/Implementation

OneHotEncoder(drop='first')

Scikit-learn

Dummy Encoding

Dummy Encoding is a technique used in Machine Learning to transform categorical variables into a numeric format, just like One-Hot Encoding. However, it intentionally drops one of the dummy variables to avoid redundancy and multicollinearity, especially in linear models.

Why Use Dummy Encoding?

In models like linear regression, including all One-Hot Encoded columns introduces perfect multicollinearity—this is called the dummy variable trap.

Dummy Encoding solves this by dropping one category (the base category), allowing the model to use the dropped variable as a reference.

How Dummy Encoding Works:

Each category is represented by a binary column, except one category which is excluded to act as the baseline.

Example 1

Example Dataset (Before Encoding)

Color

Red

Blue

Green

Blue

Red

Dummy Encoded Representation (drop one)

Blue
Green
0
0
1
0
0
1
1
0
0
0
  • Red is dropped and serves as the reference category.
  • Rows where both columns are 0 imply the category is Red.

Implementing Dummy Encoding in Python

python
CopyEdit
import pandas as pd

# Sample data
data = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']})

# Dummy Encoding with drop_first=True
encoded_df = pd.get_dummies(data, columns=['Color'], drop_first=True)

print(encoded_df)

nginx
CopyEdit
   Color_Blue  Color_Green
0           0            0
1           1            0
2           0            1
3           1            0
4           0            0

Example 2

python
CopyEdit
import pandas as pd

df = pd.DataFrame({
    'city': ['London', 'Paris', 'New York', 'Paris', 'London', 'New York', 'London'],
    'target': [1, 0, 1, 0, 1, 0, 1],
    'color': ['red', 'blue', 'green', 'blue', 'green', 'red', 'green']
})

df_dummy = pd.get_dummies(df, columns=['color'], drop_first=True)
print(df_dummy)

city
target
color_blue
color_green
London
1
0
0
Paris
0
1
0
New York
1
0
1
Paris
0
1
0
London
1
0
1
New York
0
0
0
London
1
0
1
  • The red category is dropped (baseline).
  • This avoids multicollinearity when using regression-based models.

Dummy Encoding vs One-Hot Encoding

Feature
Dummy Encoding
One-Hot Encoding
Output Format
Fewer columns (drops one category)
All categories have binary columns
Reference Handling
Yes, establishes a base category
No, treats all categories equally
Risk of Multicollinearity
Avoided
Present if used without dropping
Use Case
Linear regression, logistic models
Tree-based models, deep learning

Drawbacks of Dummy Encoding

  1. Loss of Category Visibility
    • The dropped category becomes invisible in the encoded data, which might confuse interpretation.
  2. Misuse in Tree-Based Models
    • Tree-based models like Random Forests or XGBoost do not suffer from multicollinearity, so dropping a column may unnecessarily reduce information.

When to Use Dummy Encoding?

✅ When working with linear or logistic regression

✅ When you want to avoid the dummy variable trap

✅ When your categorical feature has few categories

🚫 Avoid for tree-based models or neural networks — One-Hot Encoding is better in those cases.

Conclusion

Dummy Encoding is a simplified version of One-Hot Encoding, tailored to avoid multicollinearity in linear models by dropping one dummy variable. While it helps in creating interpretable and efficient models, it's crucial to use it in the right context—especially with models that assume independent predictors. 🧠💡