✅
One-Hot with one column dropped.
OneHotEncoder(drop='first')
✅
Dummy Encoding
Dummy Encoding is a technique used in Machine Learning to transform categorical variables into a numeric format, just like One-Hot Encoding. However, it intentionally drops one of the dummy variables to avoid redundancy and multicollinearity, especially in linear models.
Why Use Dummy Encoding?
In models like linear regression, including all One-Hot Encoded columns introduces perfect multicollinearity—this is called the dummy variable trap.
Dummy Encoding solves this by dropping one category (the base category), allowing the model to use the dropped variable as a reference.
How Dummy Encoding Works:
Each category is represented by a binary column, except one category which is excluded to act as the baseline.
Example 1
Example Dataset (Before Encoding)
Color
Red
Blue
Green
Blue
Red
Dummy Encoded Representation (drop one)
Blue | Green |
0 | 0 |
1 | 0 |
0 | 1 |
1 | 0 |
0 | 0 |
- Red is dropped and serves as the reference category.
- Rows where both columns are
0
imply the category is Red.
Implementing Dummy Encoding in Python
python
CopyEdit
import pandas as pd
# Sample data
data = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']})
# Dummy Encoding with drop_first=True
encoded_df = pd.get_dummies(data, columns=['Color'], drop_first=True)
print(encoded_df)
nginx
CopyEdit
Color_Blue Color_Green
0 0 0
1 1 0
2 0 1
3 1 0
4 0 0
Example 2
python
CopyEdit
import pandas as pd
df = pd.DataFrame({
'city': ['London', 'Paris', 'New York', 'Paris', 'London', 'New York', 'London'],
'target': [1, 0, 1, 0, 1, 0, 1],
'color': ['red', 'blue', 'green', 'blue', 'green', 'red', 'green']
})
df_dummy = pd.get_dummies(df, columns=['color'], drop_first=True)
print(df_dummy)
city | target | color_blue | color_green |
London | 1 | 0 | 0 |
Paris | 0 | 1 | 0 |
New York | 1 | 0 | 1 |
Paris | 0 | 1 | 0 |
London | 1 | 0 | 1 |
New York | 0 | 0 | 0 |
London | 1 | 0 | 1 |
- The red category is dropped (baseline).
- This avoids multicollinearity when using regression-based models.
Dummy Encoding vs One-Hot Encoding
Feature | Dummy Encoding | One-Hot Encoding |
Output Format | Fewer columns (drops one category) | All categories have binary columns |
Reference Handling | Yes, establishes a base category | No, treats all categories equally |
Risk of Multicollinearity | Avoided | Present if used without dropping |
Use Case | Linear regression, logistic models | Tree-based models, deep learning |
Drawbacks of Dummy Encoding
- Loss of Category Visibility
- The dropped category becomes invisible in the encoded data, which might confuse interpretation.
- Misuse in Tree-Based Models
- Tree-based models like Random Forests or XGBoost do not suffer from multicollinearity, so dropping a column may unnecessarily reduce information.
When to Use Dummy Encoding?
✅ When working with linear or logistic regression
✅ When you want to avoid the dummy variable trap
✅ When your categorical feature has few categories
🚫 Avoid for tree-based models or neural networks — One-Hot Encoding is better in those cases.
Conclusion
Dummy Encoding is a simplified version of One-Hot Encoding, tailored to avoid multicollinearity in linear models by dropping one dummy variable. While it helps in creating interpretable and efficient models, it's crucial to use it in the right context—especially with models that assume independent predictors. 🧠💡