One-Hot Encoding
📚

One-Hot Encoding

Category Encoders

Description

Binary columns per category.

Relevant Class/Implementation

sklearn.preprocessing.OneHotEncoder

Scikit-learn

One-Hot Encoding

One-Hot Encoding (OHE) is a technique used in Machine Learning to convert categorical variables into a numerical format, ensuring that models correctly interpret categorical data without introducing ordinal relationships.

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(
    categories='auto',         # Auto-detect or specify list of categories
    drop=None,                 # Drop one category (e.g. 'first') to avoid collinearity
    sparse_output=False,       # Output dense array (replaces sparse=False)
    dtype=int,                 # Data type of output (e.g. int, float)
    handle_unknown='ignore',   # Handle unseen labels gracefully
)

This creates an instance of OneHotEncoder with output as a dense NumPy array instead of a sparse matrix.

All Parameters of OneHotEncoder (as of scikit-learn v1.4)

Parameter
Type
Default
Description
categories
'auto' or list
'auto'
'auto' will determine categories from training data. Or provide list of lists like [['Red', 'Green', 'Blue']].
drop
None, 'first', or dict
None
Drops one category per feature to avoid multicollinearity (e.g. drop='first').
sparse_output
bool
True
(Replaces sparse) If False, returns dense array. Use sparse_output=False.
dtype
dtype
np.float64
Data type for the encoded output array.
handle_unknown
{'error', 'ignore', 'infrequent_if_exist'}
'error'
How to handle unknown categories during transform.
min_frequency
int, float or None
None
Used with handle_unknown='infrequent_if_exist'. Threshold for infrequent category grouping.
max_categories
int or None
None
Upper limit on the number of categories per feature.
feature_name_combiner
callable or None
None
Custom logic to combine feature and category into a name (for get_feature_names_out).

⚠️ Important Notes:

  • sparse has been deprecated since scikit-learn v1.2, and you should use sparse_output instead.
  • encoder = OneHotEncoder(sparse_output=False)  # New syntax
  • handle_unknown='ignore' is useful during inference when test data might have categories not seen in training.
  • You can call encoder.get_feature_names_out() after fitting to see the actual feature names.

Example with Full Parameters

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(
    categories='auto',         # Auto-detect or specify list of categories
    drop=None,                 # Drop one category (e.g. 'first') to avoid collinearity
    sparse_output=False,       # Output dense array (replaces sparse=False)
    dtype=int,                 # Data type of output (e.g. int, float)
    handle_unknown='ignore',   # Handle unseen labels gracefully
)

Why Use One-Hot Encoding?

Some machine learning models misinterpret label-encoded data as having a ranking or order (e.g., Red = 0Blue = 1Green = 2).

To avoid this issue, One-Hot Encoding creates separate binary columns for each category, allowing models to treat them independently.

How One-Hot Encoding Works: Each unique category is assigned a separate binary column (0 or 1).

Example 1

Example 2

One-Hot Encoding vs Label Encoding

Handling High Cardinality in One-Hot Encoding

When to Use One-Hot Encoding?

✅ When working with nominal data (no order)

✅ When the number of unique categories is small

✅ When using linear models, logistic regression, or deep learning

🚫 Avoid when the dataset has high-cardinality categorical variables (e.g., thousands of unique values like city names). Instead, consider target encoding or embedding layers.

Conclusion

One-Hot Encoding is a powerful technique for converting categorical data into numeric form without introducing false ordinal relationships. However, it increases dimensionality, so it should be used carefully for datasets with many unique categories. 🚀