✅
Binary columns per category.
sklearn.preprocessing.OneHotEncoder
✅
One-Hot Encoding
One-Hot Encoding (OHE) is a technique used in Machine Learning to convert categorical variables into a numerical format, ensuring that models correctly interpret categorical data without introducing ordinal relationships.
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(
    categories='auto',         # Auto-detect or specify list of categories
    drop=None,                 # Drop one category (e.g. 'first') to avoid collinearity
    sparse_output=False,       # Output dense array (replaces sparse=False)
    dtype=int,                 # Data type of output (e.g. int, float)
    handle_unknown='ignore',   # Handle unseen labels gracefully
)
This creates an instance of OneHotEncoder with output as a dense NumPy array instead of a sparse matrix.
All Parameters of OneHotEncoder (as of scikit-learn v1.4)
Parameter  | Type  | Default  | Description  | 
categories | 'auto' or list | 'auto' | 'auto' will determine categories from training data. Or provide list of lists like  [['Red', 'Green', 'Blue']]. | 
drop | None, 'first', or dict | None | Drops one category per feature to avoid multicollinearity (e.g.  drop='first'). | 
sparse_output | bool  | True | (Replaces  sparse) If False, returns dense array. Use sparse_output=False. | 
dtype | dtype  | np.float64 | Data type for the encoded output array.  | 
handle_unknown | {'error', 'ignore', 'infrequent_if_exist'}  | 'error' | How to handle unknown categories during transform.  | 
min_frequency | int, float or None  | None | Used with  handle_unknown='infrequent_if_exist'. Threshold for infrequent category grouping. | 
max_categories | int or None  | None | Upper limit on the number of categories per feature.  | 
feature_name_combiner | callable or None  | None | Custom logic to combine feature and category into a name (for  get_feature_names_out). | 
⚠️ Important Notes:
sparsehas been deprecated since scikit-learn v1.2, and you should usesparse_outputinstead.handle_unknown='ignore'is useful during inference when test data might have categories not seen in training.- You can call 
encoder.get_feature_names_out()after fitting to see the actual feature names. 
encoder = OneHotEncoder(sparse_output=False)  # New syntaxExample with Full Parameters
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(
    categories='auto',         # Auto-detect or specify list of categories
    drop=None,                 # Drop one category (e.g. 'first') to avoid collinearity
    sparse_output=False,       # Output dense array (replaces sparse=False)
    dtype=int,                 # Data type of output (e.g. int, float)
    handle_unknown='ignore',   # Handle unseen labels gracefully
)
Why Use One-Hot Encoding?
Some machine learning models misinterpret label-encoded data as having a ranking or order (e.g., Red = 0, Blue = 1, Green = 2).
To avoid this issue, One-Hot Encoding creates separate binary columns for each category, allowing models to treat them independently.
How One-Hot Encoding Works: Each unique category is assigned a separate binary column (0 or 1).
Example 1
Example 2
One-Hot Encoding vs Label Encoding
Handling High Cardinality in One-Hot Encoding
When to Use One-Hot Encoding?
✅ When working with nominal data (no order)
✅ When the number of unique categories is small
✅ When using linear models, logistic regression, or deep learning
🚫 Avoid when the dataset has high-cardinality categorical variables (e.g., thousands of unique values like city names). Instead, consider target encoding or embedding layers.
Conclusion
One-Hot Encoding is a powerful technique for converting categorical data into numeric form without introducing false ordinal relationships. However, it increases dimensionality, so it should be used carefully for datasets with many unique categories. 🚀