✅
Binary columns per category.
sklearn.preprocessing.OneHotEncoder
✅
One-Hot Encoding
One-Hot Encoding (OHE) is a technique used in Machine Learning to convert categorical variables into a numerical format, ensuring that models correctly interpret categorical data without introducing ordinal relationships.
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(
categories='auto', # Auto-detect or specify list of categories
drop=None, # Drop one category (e.g. 'first') to avoid collinearity
sparse_output=False, # Output dense array (replaces sparse=False)
dtype=int, # Data type of output (e.g. int, float)
handle_unknown='ignore', # Handle unseen labels gracefully
)
This creates an instance of OneHotEncoder
with output as a dense NumPy array instead of a sparse matrix.
All Parameters of OneHotEncoder
(as of scikit-learn v1.4)
Parameter | Type | Default | Description |
categories | 'auto ' or list | 'auto' | 'auto' will determine categories from training data. Or provide list of lists like [['Red', 'Green', 'Blue']] . |
drop | None , 'first ', or dict | None | Drops one category per feature to avoid multicollinearity (e.g. drop='first' ). |
sparse_output | bool | True | (Replaces sparse ) If False, returns dense array. Use sparse_output=False . |
dtype | dtype | np.float64 | Data type for the encoded output array. |
handle_unknown | {'error', 'ignore', 'infrequent_if_exist'} | 'error' | How to handle unknown categories during transform. |
min_frequency | int, float or None | None | Used with handle_unknown='infrequent_if_exist' . Threshold for infrequent category grouping. |
max_categories | int or None | None | Upper limit on the number of categories per feature. |
feature_name_combiner | callable or None | None | Custom logic to combine feature and category into a name (for get_feature_names_out ). |
⚠️ Important Notes:
sparse
has been deprecated since scikit-learn v1.2, and you should usesparse_output
instead.handle_unknown='ignore'
is useful during inference when test data might have categories not seen in training.- You can call
encoder.get_feature_names_out()
after fitting to see the actual feature names.
encoder = OneHotEncoder(sparse_output=False) # New syntax
Example with Full Parameters
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(
categories='auto', # Auto-detect or specify list of categories
drop=None, # Drop one category (e.g. 'first') to avoid collinearity
sparse_output=False, # Output dense array (replaces sparse=False)
dtype=int, # Data type of output (e.g. int, float)
handle_unknown='ignore', # Handle unseen labels gracefully
)
Why Use One-Hot Encoding?
Some machine learning models misinterpret label-encoded data
as having a ranking or order (e.g., Red = 0
, Blue = 1
, Green = 2
).
To avoid this issue, One-Hot Encoding creates separate binary columns for each category, allowing models to treat them independently.
How One-Hot Encoding Works: Each unique category is assigned a separate binary column (0 or 1).
Example 1
Example 2
One-Hot Encoding vs Label Encoding
Handling High Cardinality in One-Hot Encoding
When to Use One-Hot Encoding?
✅ When working with nominal data (no order)
✅ When the number of unique categories is small
✅ When using linear models, logistic regression, or deep learning
🚫 Avoid when the dataset has high-cardinality categorical variables (e.g., thousands of unique values like city names). Instead, consider target encoding or embedding layers.
Conclusion
One-Hot Encoding is a powerful technique for converting categorical data into numeric form without introducing false ordinal relationships. However, it increases dimensionality, so it should be used carefully for datasets with many unique categories. 🚀