📚

One-Hot Encoding

Category Encoders

✅

Description

Binary columns per category.

Relevant Class/Implementation

sklearn.preprocessing.OneHotEncoder

Scikit-learn

✅

One-Hot Encoding

One-Hot Encoding (OHE) is a technique used in Machine Learning to convert categorical variables into a numerical format, ensuring that models correctly interpret categorical data without introducing ordinal relationships.

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(
    categories='auto',         # Auto-detect or specify list of categories
    drop=None,                 # Drop one category (e.g. 'first') to avoid collinearity
    sparse_output=False,       # Output dense array (replaces sparse=False)
    dtype=int,                 # Data type of output (e.g. int, float)
    handle_unknown='ignore',   # Handle unseen labels gracefully
)

This creates an instance of OneHotEncoder with output as a dense NumPy array instead of a sparse matrix.

All Parameters of `OneHotEncoder` (as of scikit-learn v1.4)

Parameter	Type	Default	Description
`categories`	`'auto`' or `list`	`'auto'`	'auto' will determine categories from training data. Or provide list of lists like `[['Red', 'Green', 'Blue']]`.
`drop`	`None`, `'first`', or `dict`	`None`	Drops one category per feature to avoid multicollinearity (e.g. `drop='first'`).
`sparse_output`	bool	`True`	(Replaces `sparse`) If False, returns dense array. Use `sparse_output=False`.
`dtype`	dtype	`np.float64`	Data type for the encoded output array.
`handle_unknown`	{'error', 'ignore', 'infrequent_if_exist'}	`'error'`	How to handle unknown categories during transform.
`min_frequency`	int, float or None	`None`	Used with `handle_unknown='infrequent_if_exist'`. Threshold for infrequent category grouping.
`max_categories`	int or None	`None`	Upper limit on the number of categories per feature.
`feature_name_combiner`	callable or None	`None`	Custom logic to combine feature and category into a name (for `get_feature_names_out`).

⚠️ Important Notes:

sparse has been deprecated since scikit-learn v1.2, and you should use sparse_output instead.

encoder = OneHotEncoder(sparse_output=False)  # New syntax

handle_unknown='ignore' is useful during inference when test data might have categories not seen in training.
You can call encoder.get_feature_names_out() after fitting to see the actual feature names.

Example with Full Parameters

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(
    categories='auto',         # Auto-detect or specify list of categories
    drop=None,                 # Drop one category (e.g. 'first') to avoid collinearity
    sparse_output=False,       # Output dense array (replaces sparse=False)
    dtype=int,                 # Data type of output (e.g. int, float)
    handle_unknown='ignore',   # Handle unseen labels gracefully
)

Why Use One-Hot Encoding?

Some machine learning models misinterpret label-encoded data as having a ranking or order (e.g., Red = 0, Blue = 1, Green = 2).

To avoid this issue, One-Hot Encoding creates separate binary columns for each category, allowing models to treat them independently.

How One-Hot Encoding Works: Each unique category is assigned a separate binary column (0 or 1).

‣

Example 1

‣

Example 2

‣

One-Hot Encoding vs Label Encoding

‣

Handling High Cardinality in One-Hot Encoding

When to Use One-Hot Encoding?

✅ When working with nominal data (no order)

✅ When the number of unique categories is small

✅ When using linear models, logistic regression, or deep learning

🚫 Avoid when the dataset has high-cardinality categorical variables (e.g., thousands of unique values like city names). Instead, consider target encoding or embedding layers.

Conclusion

One-Hot Encoding is a powerful technique for converting categorical data into numeric form without introducing false ordinal relationships. However, it increases dimensionality, so it should be used carefully for datasets with many unique categories. 🚀

One-Hot Encoding

One-Hot Encoding

All Parameters of OneHotEncoder (as of scikit-learn v1.4)

⚠️ Important Notes:

Example with Full Parameters

Why Use One-Hot Encoding?

Example 1

Example 2

One-Hot Encoding vs Label Encoding

Handling High Cardinality in One-Hot Encoding

When to Use One-Hot Encoding?

Conclusion

All Parameters of `OneHotEncoder` (as of scikit-learn v1.4)