1. API Reference : SimpleImputer
📂

1. API Reference : SimpleImputer

Status
Done

API Reference : SimpleImputer

1.0 Introduction to SimpleImputer

What is SimpleImputer?

SimpleImputer is a class from scikit-learn's sklearn.impute module that handles missing values in datasets using simple, univariate strategies. It's designed for straightforward imputation tasks where you want to replace missing values with basic statistical measures.

Why Use SimpleImputer?

  • Quick and efficient: Fast processing for large datasets
  • Simple to implement: Minimal code required
  • Versatile: Works with both numerical and categorical data
  • Consistent: Part of scikit-learn's preprocessing pipeline
  • Memory efficient: Can handle sparse matrices

2.0 When to Use SimpleImputer

Good Use Cases:

  • Dataset with relatively uniform distribution
  • Missing values are random (MCAR - Missing Completely at Random)
  • Quick preprocessing for baseline models
  • Features don't have complex relationships
  • Simple imputation strategies are sufficient

When NOT to Use:

  • Missing values have patterns or dependencies
  • Complex relationships between features exist
  • Need sophisticated imputation (use KNNImputer or IterativeImputer instead)
  • Time series data with temporal patterns

Real-World Examples:

  • Education: Filling missing exam scores with class average
  • Sales: Replacing missing product ratings with most frequent rating
  • Healthcare: Imputing missing patient weight with median weight
  • Finance: Filling missing income data with mean income

3. Import and Basic Syntax

from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd

# Basic instantiation
imputer = SimpleImputer(missing_values=np.nan,
                        strategy='mean',
                        fill_value=None,
                        add_indicator=False)

# Fit and transform
imputed_data = imputer.fit_transform(data)

4. Parameters Explained

Parameter
Description
Default
Type
missing_values
Placeholder for missing values.
np.nan
int, float, str, np.nan
strategy
Imputation strategy: 'mean''median''most_frequent', or 'constant'.
'mean'
str
fill_value
Value to replace missing values when strategy='constant'.
None
str, numerical
add_indicator
If True, adds a MissingIndicator to the output.
False
bool

Strategy Options:

For Numerical Features:

  • 'mean': Replace with arithmetic mean of column
  • 'median': Replace with median value (robust to outliers)
  • 'constant': Replace with a specified constant value

For Categorical Features:

  • 'most_frequent': Replace with mode (most common value)
  • 'constant': Replace with a specified constant value

5. Handling Numerical Features

Example Dataset Creation:

import pandas as pd
import numpy as np

# Create sample data with missing values
data = {
    'Age': [25, 30, np.nan, 35, 40, np.nan, 45],
    'Income': [50000, 60000, 55000, np.nan, 80000, 75000, np.nan],
    'Score': [85, np.nan, 90, 88, np.nan, 92, 87]
}
df = pd.DataFrame(data)
print("Original Data:")
print(df)
Original Data:
    Age   Income  Score
0  25.0  50000.0   85.0
1  30.0  60000.0    NaN
2   NaN  55000.0   90.0
3  35.0      NaN   88.0
4  40.0  80000.0    NaN
5   NaN  75000.0   92.0
6  45.0      NaN   87.0

Mean Strategy (Default):

Median Strategy (Robust to Outliers):

Constant Strategy:

6. Handling Categorical Features

Example with Categorical Data:

# Create data with categorical features
cat_data = {
    'Category': ['A', 'B', np.nan, 'A', 'C', np.nan, 'A'],
    'Grade': ['Good', 'Excellent', 'Good', np.nan, 'Fair', 'Good', np.nan],
    'Status': [np.nan, 'Active', 'Inactive', 'Active', np.nan, 'Active', 'Active']
}
df_cat = pd.DataFrame(cat_data)
print("Original Categorical Data:")
print(df_cat)
Original Categorical Data:
  Category      Grade    Status
0        A       Good       NaN
1        B  Excellent    Active
2      NaN       Good  Inactive
3        A        NaN    Active
4        C       Fair       NaN
5      NaN       Good    Active
6        A        NaN    Active

Most Frequent Strategy:

Constant Strategy for Categories:

7. Advanced Features

Adding Missing Value Indicators:

# Add binary indicators for originally missing values
indicator_imputer = SimpleImputer(strategy='mean', add_indicator=True)

# This returns both imputed data and indicator matrix
imputed_with_indicator = indicator_imputer.fit_transform(df[num_cols])

# The result is a numpy array with original columns + indicator columns
print("Shape with indicators:", imputed_with_indicator.shape)
print("Original shape:", df[num_cols].shape)
Shape with indicators: (7, 3)
Original shape: (7, 3)

Handling Different Missing Value Representations:

# Handle different missing value representations
df_custom = df.copy()
df_custom.loc[0, 'Age'] = -999  # Custom missing value

# Specify custom missing value
custom_imputer = SimpleImputer(missing_values=-999, strategy='mean')
df_custom[['Age']] = custom_imputer.fit_transform(df_custom[['Age']])

This code demonstrates how to handle custom missing value representations using scikit-learn's SimpleImputer class. Let me break down what's happening step by step.

The first part creates a copy of the original dataframe and artificially introduces a custom missing value indicator:

df_custom = df.copy()
df_custom.loc[0, 'Age'] = -999  # Custom missing value
     Age   Income  Score
0 -999.0  50000.0   85.0
1   30.0  60000.0   88.4
2   35.0  55000.0   90.0
3   35.0  64000.0   88.0
4   40.0  80000.0   88.4
5   35.0  75000.0   92.0
6   45.0  64000.0   87.0

The df.copy() method creates a shallow copy of the dataframe, ensuring the original data remains unchanged. The code then sets the first row's 'Age' column to -999, which serves as a custom placeholder for missing data. This is common in real-world datasets where missing values might be encoded as specific numbers like -999, -1, or 9999 instead of standard NaN values.

Configuring the SimpleImputer for Custom Missing Values

The next part configures a SimpleImputer to recognize and handle the custom missing value:

custom_imputer = SimpleImputer(missing_values=-999, strategy='mean')

Here's what each parameter does:

  • missing_values=-999: Tells the imputer to treat 999 as a missing value rather than a legitimate data point
  • strategy='mean': Specifies that missing values should be replaced with the mean of the non-missing values in that column

Applying the Imputation

Finally, the imputation is applied:

df_custom[['Age']] = custom_imputer.fit_transform(df_custom[['Age']])

The fit_transform() method is a convenient combination that:

  1. Fits the imputer by calculating the mean of all non-missing values (excluding -999)
  2. Transforms the data by replacing -999 with the calculated mean

Note the double bracket notation [['Age']] - this ensures the result maintains a DataFrame structure rather than becoming a Series.

Key Considerations

Gotcha Alert: If your dataset legitimately contains -999 as a valid value, this approach would incorrectly treat it as missing data. Always verify that your chosen missing value indicator doesn't conflict with actual data values.

This pattern is particularly useful when working with legacy datasets or data from systems that use specific numeric codes to represent missing information, making your data preprocessing more flexible and robust.

8. Pipeline Integration

Using with Scikit-learn Pipeline:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

# Create preprocessing pipeline
preprocessing_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Complete ML pipeline
ml_pipeline = Pipeline([
    ('preprocessing', preprocessing_pipeline),
    ('classifier', RandomForestClassifier())
])

Column Transformer for Mixed Data Types:

from sklearn.compose import ColumnTransformer

# Define preprocessing for different column types
preprocessor = ColumnTransformer(
    transformers=[
        ('num', SimpleImputer(strategy='median'), num_cols),
        ('cat', SimpleImputer(strategy='most_frequent'), cat_cols)
    ]
)

9. Best Practices and Tips

1. Strategy Selection Guidelines:

2. Preserve Original Data:

3. Validate Imputation Results:

4. Handle Edge Cases:

10. Common Pitfalls and Solutions

❌ Pitfall 1: Data Leakage

❌ Pitfall 2: Ignoring Data Types

❌ Pitfall 3: Not Handling Sparse Matrices

11. Performance Considerations

Memory Efficiency:

# For large datasets, process in chunks
def impute_in_chunks(df, chunk_size=10000):
    imputer = SimpleImputer(strategy='mean')
    imputer.fit(df)  # Fit on entire dataset once

    imputed_chunks = []
    for i in range(0, len(df), chunk_size):
        chunk = df.iloc[i:i+chunk_size]
        imputed_chunk = pd.DataFrame(
            imputer.transform(chunk),
            columns=chunk.columns,
            index=chunk.index
        )
        imputed_chunks.append(imputed_chunk)

    return pd.concat(imputed_chunks)

12. Comparison with Other Imputation Methods

Method
Best For
Pros
Cons
SimpleImputer
Quick, simple imputation
Fast, easy to use
Ignores feature relationships
KNNImputer
Feature relationships matter
Considers correlations
Computationally expensive
IterativeImputer
Complex dependencies
Most sophisticated
Slowest, can be unstable

13. Summary Checklist

Before using SimpleImputer, ensure you:

Understand your missing data pattern
Choose appropriate strategy for data type and distribution
Separate numerical and categorical features
Fit only on training data to avoid leakage
Validate imputation results
Consider if simple imputation is sufficient for your use case

14. Practice Exercise

Try implementing SimpleImputer on this sample dataset:

# Create practice dataset
practice_data = {
    'temperature': [20.1, 22.3, np.nan, 21.8, np.nan, 23.1, 19.8],
    'humidity': [45, 50, 48, np.nan, 52, np.nan, 47],
    'weather': ['Sunny', np.nan, 'Rainy', 'Cloudy', 'Sunny', 'Rainy', np.nan],
    'season': ['Summer', 'Summer', np.nan, 'Spring', 'Summer', 'Spring', 'Spring']
}

practice_df = pd.DataFrame(practice_data)

# Your task:
# 1. Identify numerical and categorical columns
# 2. Apply appropriate imputation strategies
# 3. Validate the results
# 4. Create a pipeline for future use

This completes your comprehensive guide to SimpleImputer!