API Reference : SimpleImputer
1.0 Introduction to SimpleImputer
What is SimpleImputer?
SimpleImputer is a class from scikit-learn's sklearn.impute
module that handles missing values in datasets using simple, univariate strategies. It's designed for straightforward imputation tasks where you want to replace missing values with basic statistical measures.
Why Use SimpleImputer?
- Quick and efficient: Fast processing for large datasets
- Simple to implement: Minimal code required
- Versatile: Works with both numerical and categorical data
- Consistent: Part of scikit-learn's preprocessing pipeline
- Memory efficient: Can handle sparse matrices
2.0 When to Use SimpleImputer
Good Use Cases:
- Dataset with relatively uniform distribution
- Missing values are random (MCAR - Missing Completely at Random)
- Quick preprocessing for baseline models
- Features don't have complex relationships
- Simple imputation strategies are sufficient
When NOT to Use:
- Missing values have patterns or dependencies
- Complex relationships between features exist
- Need sophisticated imputation (use KNNImputer or IterativeImputer instead)
- Time series data with temporal patterns
Real-World Examples:
- Education: Filling missing exam scores with class average
- Sales: Replacing missing product ratings with most frequent rating
- Healthcare: Imputing missing patient weight with median weight
- Finance: Filling missing income data with mean income
3. Import and Basic Syntax
from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd
# Basic instantiation
imputer = SimpleImputer(missing_values=np.nan,
strategy='mean',
fill_value=None,
add_indicator=False)
# Fit and transform
imputed_data = imputer.fit_transform(data)
4. Parameters Explained
Parameter | Description | Default | Type |
missing_values | Placeholder for missing values. | np.nan | int, float, str, np.nan |
strategy | Imputation strategy: 'mean' , 'median' , 'most_frequent' , or 'constant' . | 'mean' | str |
fill_value | Value to replace missing values when strategy='constant' . | None | str, numerical |
add_indicator | If True , adds a MissingIndicator to the output. | False | bool |
Strategy Options:
For Numerical Features:
'mean'
: Replace with arithmetic mean of column'median'
: Replace with median value (robust to outliers)'constant'
: Replace with a specified constant value
For Categorical Features:
'most_frequent'
: Replace with mode (most common value)'constant'
: Replace with a specified constant value
5. Handling Numerical Features
Example Dataset Creation:
import pandas as pd
import numpy as np
# Create sample data with missing values
data = {
'Age': [25, 30, np.nan, 35, 40, np.nan, 45],
'Income': [50000, 60000, 55000, np.nan, 80000, 75000, np.nan],
'Score': [85, np.nan, 90, 88, np.nan, 92, 87]
}
df = pd.DataFrame(data)
print("Original Data:")
print(df)
Original Data:
Age Income Score
0 25.0 50000.0 85.0
1 30.0 60000.0 NaN
2 NaN 55000.0 90.0
3 35.0 NaN 88.0
4 40.0 80000.0 NaN
5 NaN 75000.0 92.0
6 45.0 NaN 87.0
Mean Strategy (Default):
Median Strategy (Robust to Outliers):
Constant Strategy:
6. Handling Categorical Features
Example with Categorical Data:
# Create data with categorical features
cat_data = {
'Category': ['A', 'B', np.nan, 'A', 'C', np.nan, 'A'],
'Grade': ['Good', 'Excellent', 'Good', np.nan, 'Fair', 'Good', np.nan],
'Status': [np.nan, 'Active', 'Inactive', 'Active', np.nan, 'Active', 'Active']
}
df_cat = pd.DataFrame(cat_data)
print("Original Categorical Data:")
print(df_cat)
Original Categorical Data:
Category Grade Status
0 A Good NaN
1 B Excellent Active
2 NaN Good Inactive
3 A NaN Active
4 C Fair NaN
5 NaN Good Active
6 A NaN Active
Most Frequent Strategy:
Constant Strategy for Categories:
7. Advanced Features
Adding Missing Value Indicators:
# Add binary indicators for originally missing values
indicator_imputer = SimpleImputer(strategy='mean', add_indicator=True)
# This returns both imputed data and indicator matrix
imputed_with_indicator = indicator_imputer.fit_transform(df[num_cols])
# The result is a numpy array with original columns + indicator columns
print("Shape with indicators:", imputed_with_indicator.shape)
print("Original shape:", df[num_cols].shape)
Shape with indicators: (7, 3)
Original shape: (7, 3)
Handling Different Missing Value Representations:
# Handle different missing value representations
df_custom = df.copy()
df_custom.loc[0, 'Age'] = -999 # Custom missing value
# Specify custom missing value
custom_imputer = SimpleImputer(missing_values=-999, strategy='mean')
df_custom[['Age']] = custom_imputer.fit_transform(df_custom[['Age']])
This code demonstrates how to handle custom missing value representations using scikit-learn's SimpleImputer
class. Let me break down what's happening step by step.
The first part creates a copy of the original dataframe and artificially introduces a custom missing value indicator:
df_custom = df.copy()
df_custom.loc[0, 'Age'] = -999 # Custom missing value
Age Income Score
0 -999.0 50000.0 85.0
1 30.0 60000.0 88.4
2 35.0 55000.0 90.0
3 35.0 64000.0 88.0
4 40.0 80000.0 88.4
5 35.0 75000.0 92.0
6 45.0 64000.0 87.0
The df.copy()
method creates a shallow copy of the dataframe, ensuring the original data remains unchanged. The code then sets the first row's 'Age' column to -999
, which serves as a custom placeholder for missing data. This is common in real-world datasets where missing values might be encoded as specific numbers like -999, -1, or 9999 instead of standard NaN
values.
Configuring the SimpleImputer for Custom Missing Values
The next part configures a SimpleImputer
to recognize and handle the custom missing value:
custom_imputer = SimpleImputer(missing_values=-999, strategy='mean')
Here's what each parameter does:
missing_values=-999
: Tells the imputer to treat999
as a missing value rather than a legitimate data pointstrategy='mean'
: Specifies that missing values should be replaced with the mean of the non-missing values in that column
Applying the Imputation
Finally, the imputation is applied:
df_custom[['Age']] = custom_imputer.fit_transform(df_custom[['Age']])
The fit_transform()
method is a convenient combination that:
- Fits the imputer by calculating the mean of all non-missing values (excluding -999)
- Transforms the data by replacing -999 with the calculated mean
Note the double bracket notation [['Age']]
- this ensures the result maintains a DataFrame structure rather than becoming a Series.
Key Considerations
Gotcha Alert: If your dataset legitimately contains -999 as a valid value, this approach would incorrectly treat it as missing data. Always verify that your chosen missing value indicator doesn't conflict with actual data values.
This pattern is particularly useful when working with legacy datasets or data from systems that use specific numeric codes to represent missing information, making your data preprocessing more flexible and robust.
8. Pipeline Integration
Using with Scikit-learn Pipeline:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
# Create preprocessing pipeline
preprocessing_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
])
# Complete ML pipeline
ml_pipeline = Pipeline([
('preprocessing', preprocessing_pipeline),
('classifier', RandomForestClassifier())
])
Column Transformer for Mixed Data Types:
from sklearn.compose import ColumnTransformer
# Define preprocessing for different column types
preprocessor = ColumnTransformer(
transformers=[
('num', SimpleImputer(strategy='median'), num_cols),
('cat', SimpleImputer(strategy='most_frequent'), cat_cols)
]
)
9. Best Practices and Tips
1. Strategy Selection Guidelines:
2. Preserve Original Data:
3. Validate Imputation Results:
4. Handle Edge Cases:
10. Common Pitfalls and Solutions
❌ Pitfall 1: Data Leakage
❌ Pitfall 2: Ignoring Data Types
❌ Pitfall 3: Not Handling Sparse Matrices
11. Performance Considerations
Memory Efficiency:
# For large datasets, process in chunks
def impute_in_chunks(df, chunk_size=10000):
imputer = SimpleImputer(strategy='mean')
imputer.fit(df) # Fit on entire dataset once
imputed_chunks = []
for i in range(0, len(df), chunk_size):
chunk = df.iloc[i:i+chunk_size]
imputed_chunk = pd.DataFrame(
imputer.transform(chunk),
columns=chunk.columns,
index=chunk.index
)
imputed_chunks.append(imputed_chunk)
return pd.concat(imputed_chunks)
12. Comparison with Other Imputation Methods
Method | Best For | Pros | Cons |
SimpleImputer | Quick, simple imputation | Fast, easy to use | Ignores feature relationships |
KNNImputer | Feature relationships matter | Considers correlations | Computationally expensive |
IterativeImputer | Complex dependencies | Most sophisticated | Slowest, can be unstable |
13. Summary Checklist
Before using SimpleImputer, ensure you:
14. Practice Exercise
Try implementing SimpleImputer on this sample dataset:
# Create practice dataset
practice_data = {
'temperature': [20.1, 22.3, np.nan, 21.8, np.nan, 23.1, 19.8],
'humidity': [45, 50, 48, np.nan, 52, np.nan, 47],
'weather': ['Sunny', np.nan, 'Rainy', 'Cloudy', 'Sunny', 'Rainy', np.nan],
'season': ['Summer', 'Summer', np.nan, 'Spring', 'Summer', 'Spring', 'Spring']
}
practice_df = pd.DataFrame(practice_data)
# Your task:
# 1. Identify numerical and categorical columns
# 2. Apply appropriate imputation strategies
# 3. Validate the results
# 4. Create a pipeline for future use
This completes your comprehensive guide to SimpleImputer!