- Input: Pandas DataFrame or Series.
- Method: Directly applied on DataFrame columns.
- Best for: When your data is structured in a tabular format (e.g., CSV, SQL, or Excel data loaded into a DataFrame).
- Example Use Case: Quickly encode a column in a DataFrame for feature engineering in data analysis or machine learning.
- What It Does:
- Converts categorical variables into one-hot encoded columns.
- Automatically handles the transformation for pandas DataFrames or Series.
- Adds a new column for each unique value in the specified categorical columns.
- Example: If a column has valuesÂ
['red', 'blue', 'green']
, it generates three new columns:Âcolor_red
,Âcolor_blue
,Âcolor_green
.
Syntax
pd.get_dummies(data,
columns=None,
prefix=None,
prefix_sep='_',
dummy_na=False,
sparse=False,
drop_first=False,
dtype=None
)
Parameters:
Parameter | Description | Default | Recommended Values |
data | DataFrame or Series to be encoded. | Required | Your DataFrame or Series containing categorical data. |
columns | Column names in the DataFrame to encode. Encodes all object or category dtype columns if None . | None | Specify columns explicitly for control (e.g., ['category', 'type'] ). |
prefix | String or list of strings to prepend to column names (if columns  is specified). | None | Use meaningful prefixes (e.g., ['col', 'type'] ) for clarity, especially when encoding multiple columns. |
prefix_sep | Separator/delimiter between the prefix  and value. | '_' | Use '-'  or other separators for better readability, depending on naming conventions. |
dummy_na | Whether to add a column for missing values ( NaN ). | False | Set to True  if missing values (NaN ) are present and need explicit handling. |
drop_first | Whether to remove the first category (to avoid multicollinearity in regression models). | False | Set to True  in regression models to avoid multicollinearity. |
dtype | Data type of the resulting one-hot encoded columns. | None | Use dtype='int64'  or dtype='uint8'  for memory optimization, depending on data size and model needs. |
sparse | Whether the encoded data should be a sparse DataFrame. | False | Use True  for large datasets with many unique values to save memory. |
‣
Example 1Â pd.get_dummies()
‣
Example 2Â pd.get_dummies()
When to Use:
- When working with pandas DataFrames directly.
- When you want a quick and simple transformation of categorical variables to one-hot encoded columns without much preprocessing.
- Ideal for exploratory data analysis or pipelines that stay within pandas.
- Pros:
- Simple and easy to use for small to medium-sized datasets.
- Directly integrates with pandas DataFrames.
- Cons:
- Not designed for workflows involving transformations beyond pandas.
- Requires explicitly listing all columns to be encoded.
- Cannot handle dictionaries or sparse representations.