- Input: Pandas DataFrame or Series.
- Method: Directly applied on DataFrame columns.
- Best for: When your data is structured in a tabular format (e.g., CSV, SQL, or Excel data loaded into a DataFrame).
- Example Use Case: Quickly encode a column in a DataFrame for feature engineering in data analysis or machine learning.
- What It Does:
- Converts categorical variables into one-hot encoded columns.
- Automatically handles the transformation for pandas DataFrames or Series.
- Adds a new column for each unique value in the specified categorical columns.
- Example: If a column has values
['red', 'blue', 'green'], it generates three new columns:color_red,color_blue,color_green.
Syntax
pd.get_dummies(data,
columns=None,
prefix=None,
prefix_sep='_',
dummy_na=False,
sparse=False,
drop_first=False,
dtype=None
)
Parameters:
Parameter | Description | Default | Recommended Values |
data | DataFrame or Series to be encoded. | Required | Your DataFrame or Series containing categorical data. |
columns | Column names in the DataFrame to encode. Encodes all object or category dtype columns if None. | None | Specify columns explicitly for control (e.g., ['category', 'type']). |
prefix | String or list of strings to prepend to column names (if columns is specified). | None | Use meaningful prefixes (e.g., ['col', 'type']) for clarity, especially when encoding multiple columns. |
prefix_sep | Separator/delimiter between the prefix and value. | '_' | Use '-' or other separators for better readability, depending on naming conventions. |
dummy_na | Whether to add a column for missing values ( NaN). | False | Set to True if missing values (NaN) are present and need explicit handling. |
drop_first | Whether to remove the first category (to avoid multicollinearity in regression models). | False | Set to True in regression models to avoid multicollinearity. |
dtype | Data type of the resulting one-hot encoded columns. | None | Use dtype='int64' or dtype='uint8' for memory optimization, depending on data size and model needs. |
sparse | Whether the encoded data should be a sparse DataFrame. | False | Use True for large datasets with many unique values to save memory. |
‣
Example 1 pd.get_dummies()
‣
Example 2 pd.get_dummies()
When to Use:
- When working with pandas DataFrames directly.
- When you want a quick and simple transformation of categorical variables to one-hot encoded columns without much preprocessing.
- Ideal for exploratory data analysis or pipelines that stay within pandas.
- Pros:
- Simple and easy to use for small to medium-sized datasets.
- Directly integrates with pandas DataFrames.
- Cons:
- Not designed for workflows involving transformations beyond pandas.
- Requires explicitly listing all columns to be encoded.
- Cannot handle dictionaries or sparse representations.