📚

Dummy Encoding

Category Encoders

✅

Description

One-Hot with one column dropped.

Relevant Class/Implementation

OneHotEncoder(drop='first')

Scikit-learn

✅

Dummy Encoding

Dummy Encoding is a technique used in Machine Learning to transform categorical variables into a numeric format, just like One-Hot Encoding. However, it intentionally drops one of the dummy variables to avoid redundancy and multicollinearity, especially in linear models.

Why Use Dummy Encoding?

In models like linear regression, including all One-Hot Encoded columns introduces perfect multicollinearity—this is called the dummy variable trap.

Dummy Encoding solves this by dropping one category (the base category), allowing the model to use the dropped variable as a reference.

How Dummy Encoding Works:

Each category is represented by a binary column, except one category which is excluded to act as the baseline.

Example 1

Example Dataset (Before Encoding)

Color

Red

Blue

Green

Blue

Red

Dummy Encoded Representation (drop one)

Blue	Green
0	0
1	0
0	1
1	0
0	0

Red is dropped and serves as the reference category.
Rows where both columns are 0 imply the category is Red.

Implementing Dummy Encoding in Python

python
CopyEdit
import pandas as pd

# Sample data
data = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']})

# Dummy Encoding with drop_first=True
encoded_df = pd.get_dummies(data, columns=['Color'], drop_first=True)

print(encoded_df)

nginx
CopyEdit
   Color_Blue  Color_Green
0           0            0
1           1            0
2           0            1
3           1            0
4           0            0

Example 2

python
CopyEdit
import pandas as pd

df = pd.DataFrame({
    'city': ['London', 'Paris', 'New York', 'Paris', 'London', 'New York', 'London'],
    'target': [1, 0, 1, 0, 1, 0, 1],
    'color': ['red', 'blue', 'green', 'blue', 'green', 'red', 'green']
})

df_dummy = pd.get_dummies(df, columns=['color'], drop_first=True)
print(df_dummy)

city	target	color_blue	color_green
London	1	0	0
Paris	0	1	0
New York	1	0	1
Paris	0	1	0
London	1	0	1
New York	0	0	0
London	1	0	1

The red category is dropped (baseline).
This avoids multicollinearity when using regression-based models.

Dummy Encoding vs One-Hot Encoding

Feature	Dummy Encoding	One-Hot Encoding
Output Format	Fewer columns (drops one category)	All categories have binary columns
Reference Handling	Yes, establishes a base category	No, treats all categories equally
Risk of Multicollinearity	Avoided	Present if used without dropping
Use Case	Linear regression, logistic models	Tree-based models, deep learning

Drawbacks of Dummy Encoding

Loss of Category Visibility

The dropped category becomes invisible in the encoded data, which might confuse interpretation.

Misuse in Tree-Based Models

Tree-based models like Random Forests or XGBoost do not suffer from multicollinearity, so dropping a column may unnecessarily reduce information.

When to Use Dummy Encoding?

✅ When working with linear or logistic regression

✅ When you want to avoid the dummy variable trap

✅ When your categorical feature has few categories

🚫 Avoid for tree-based models or neural networks — One-Hot Encoding is better in those cases.

Conclusion

Dummy Encoding is a simplified version of One-Hot Encoding, tailored to avoid multicollinearity in linear models by dropping one dummy variable. While it helps in creating interpretable and efficient models, it's crucial to use it in the right context—especially with models that assume independent predictors. 🧠💡