📂

1. API Reference : SimpleImputer

Status

Done

API Reference : SimpleImputer

1.0 Introduction to SimpleImputer

What is SimpleImputer?

SimpleImputer is a class from scikit-learn's sklearn.impute module that handles missing values in datasets using simple, univariate strategies. It's designed for straightforward imputation tasks where you want to replace missing values with basic statistical measures.

Why Use SimpleImputer?

Quick and efficient: Fast processing for large datasets
Simple to implement: Minimal code required
Versatile: Works with both numerical and categorical data
Consistent: Part of scikit-learn's preprocessing pipeline
Memory efficient: Can handle sparse matrices

2.0 When to Use SimpleImputer

Good Use Cases:

Dataset with relatively uniform distribution
Missing values are random (MCAR - Missing Completely at Random)
Quick preprocessing for baseline models
Features don't have complex relationships
Simple imputation strategies are sufficient

When NOT to Use:

Missing values have patterns or dependencies
Complex relationships between features exist
Need sophisticated imputation (use KNNImputer or IterativeImputer instead)
Time series data with temporal patterns

Real-World Examples:

Education: Filling missing exam scores with class average
Sales: Replacing missing product ratings with most frequent rating
Healthcare: Imputing missing patient weight with median weight
Finance: Filling missing income data with mean income

3. Import and Basic Syntax

from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd

# Basic instantiation
imputer = SimpleImputer(missing_values=np.nan,
                        strategy='mean',
                        fill_value=None,
                        add_indicator=False)

# Fit and transform
imputed_data = imputer.fit_transform(data)

4. Parameters Explained

Parameter	Description	Default	Type
`missing_values`	Placeholder for missing values.	`np.nan`	int, float, str, np.nan
`strategy`	Imputation strategy: `'mean'`, `'median'`, `'most_frequent'`, or `'constant'`.	`'mean'`	str
`fill_value`	Value to replace missing values when `strategy='constant'`.	`None`	str, numerical
`add_indicator`	If `True`, adds a MissingIndicator to the output.	`False`	bool

Strategy Options:

For Numerical Features:

'mean': Replace with arithmetic mean of column
'median': Replace with median value (robust to outliers)
'constant': Replace with a specified constant value

For Categorical Features:

'most_frequent': Replace with mode (most common value)
'constant': Replace with a specified constant value

5. Handling Numerical Features

Example Dataset Creation:

import pandas as pd
import numpy as np

# Create sample data with missing values
data = {
    'Age': [25, 30, np.nan, 35, 40, np.nan, 45],
    'Income': [50000, 60000, 55000, np.nan, 80000, 75000, np.nan],
    'Score': [85, np.nan, 90, 88, np.nan, 92, 87]
}
df = pd.DataFrame(data)
print("Original Data:")
print(df)

Original Data:
    Age   Income  Score
0  25.0  50000.0   85.0
1  30.0  60000.0    NaN
2   NaN  55000.0   90.0
3  35.0      NaN   88.0
4  40.0  80000.0    NaN
5   NaN  75000.0   92.0
6  45.0      NaN   87.0

‣

Mean Strategy (Default):

‣

Median Strategy (Robust to Outliers):

‣

Constant Strategy:

6. Handling Categorical Features

Example with Categorical Data:

# Create data with categorical features
cat_data = {
    'Category': ['A', 'B', np.nan, 'A', 'C', np.nan, 'A'],
    'Grade': ['Good', 'Excellent', 'Good', np.nan, 'Fair', 'Good', np.nan],
    'Status': [np.nan, 'Active', 'Inactive', 'Active', np.nan, 'Active', 'Active']
}
df_cat = pd.DataFrame(cat_data)
print("Original Categorical Data:")
print(df_cat)

Original Categorical Data:
  Category      Grade    Status
0        A       Good       NaN
1        B  Excellent    Active
2      NaN       Good  Inactive
3        A        NaN    Active
4        C       Fair       NaN
5      NaN       Good    Active
6        A        NaN    Active

‣

Most Frequent Strategy:

‣

Constant Strategy for Categories:

7. Advanced Features

Adding Missing Value Indicators:

# Add binary indicators for originally missing values
indicator_imputer = SimpleImputer(strategy='mean', add_indicator=True)

# This returns both imputed data and indicator matrix
imputed_with_indicator = indicator_imputer.fit_transform(df[num_cols])

# The result is a numpy array with original columns + indicator columns
print("Shape with indicators:", imputed_with_indicator.shape)
print("Original shape:", df[num_cols].shape)

Shape with indicators: (7, 3)
Original shape: (7, 3)

Handling Different Missing Value Representations:

# Handle different missing value representations
df_custom = df.copy()
df_custom.loc[0, 'Age'] = -999  # Custom missing value

# Specify custom missing value
custom_imputer = SimpleImputer(missing_values=-999, strategy='mean')
df_custom[['Age']] = custom_imputer.fit_transform(df_custom[['Age']])

This code demonstrates how to handle custom missing value representations using scikit-learn's SimpleImputer class. Let me break down what's happening step by step.

The first part creates a copy of the original dataframe and artificially introduces a custom missing value indicator:

df_custom = df.copy()
df_custom.loc[0, 'Age'] = -999  # Custom missing value

     Age   Income  Score
0 -999.0  50000.0   85.0
1   30.0  60000.0   88.4
2   35.0  55000.0   90.0
3   35.0  64000.0   88.0
4   40.0  80000.0   88.4
5   35.0  75000.0   92.0
6   45.0  64000.0   87.0

The df.copy() method creates a shallow copy of the dataframe, ensuring the original data remains unchanged. The code then sets the first row's 'Age' column to -999, which serves as a custom placeholder for missing data. This is common in real-world datasets where missing values might be encoded as specific numbers like -999, -1, or 9999 instead of standard NaN values.

Configuring the SimpleImputer for Custom Missing Values

The next part configures a SimpleImputer to recognize and handle the custom missing value:

custom_imputer = SimpleImputer(missing_values=-999, strategy='mean')

Here's what each parameter does:

missing_values=-999: Tells the imputer to treat 999 as a missing value rather than a legitimate data point
strategy='mean': Specifies that missing values should be replaced with the mean of the non-missing values in that column

Applying the Imputation

Finally, the imputation is applied:

df_custom[['Age']] = custom_imputer.fit_transform(df_custom[['Age']])

The fit_transform() method is a convenient combination that:

Fits the imputer by calculating the mean of all non-missing values (excluding -999)
Transforms the data by replacing -999 with the calculated mean

Note the double bracket notation [['Age']] - this ensures the result maintains a DataFrame structure rather than becoming a Series.

Key Considerations

Gotcha Alert: If your dataset legitimately contains -999 as a valid value, this approach would incorrectly treat it as missing data. Always verify that your chosen missing value indicator doesn't conflict with actual data values.

This pattern is particularly useful when working with legacy datasets or data from systems that use specific numeric codes to represent missing information, making your data preprocessing more flexible and robust.

8. Pipeline Integration

Using with Scikit-learn Pipeline:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

# Create preprocessing pipeline
preprocessing_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Complete ML pipeline
ml_pipeline = Pipeline([
    ('preprocessing', preprocessing_pipeline),
    ('classifier', RandomForestClassifier())
])

Column Transformer for Mixed Data Types:

from sklearn.compose import ColumnTransformer

# Define preprocessing for different column types
preprocessor = ColumnTransformer(
    transformers=[
        ('num', SimpleImputer(strategy='median'), num_cols),
        ('cat', SimpleImputer(strategy='most_frequent'), cat_cols)
    ]
)

9. Best Practices and Tips

‣

1. Strategy Selection Guidelines:

‣

2. Preserve Original Data:

‣

3. Validate Imputation Results:

‣

4. Handle Edge Cases:

10. Common Pitfalls and Solutions

‣

❌ Pitfall 1: Data Leakage

‣

❌ Pitfall 2: Ignoring Data Types

‣

❌ Pitfall 3: Not Handling Sparse Matrices

11. Performance Considerations

Memory Efficiency:

# For large datasets, process in chunks
def impute_in_chunks(df, chunk_size=10000):
    imputer = SimpleImputer(strategy='mean')
    imputer.fit(df)  # Fit on entire dataset once

    imputed_chunks = []
    for i in range(0, len(df), chunk_size):
        chunk = df.iloc[i:i+chunk_size]
        imputed_chunk = pd.DataFrame(
            imputer.transform(chunk),
            columns=chunk.columns,
            index=chunk.index
        )
        imputed_chunks.append(imputed_chunk)

    return pd.concat(imputed_chunks)

12. Comparison with Other Imputation Methods

Method	Best For	Pros	Cons
SimpleImputer	Quick, simple imputation	Fast, easy to use	Ignores feature relationships
KNNImputer	Feature relationships matter	Considers correlations	Computationally expensive
IterativeImputer	Complex dependencies	Most sophisticated	Slowest, can be unstable

13. Summary Checklist

Before using SimpleImputer, ensure you:

Understand your missing data pattern

Choose appropriate strategy for data type and distribution

Separate numerical and categorical features

Fit only on training data to avoid leakage

Validate imputation results

Consider if simple imputation is sufficient for your use case

14. Practice Exercise

Try implementing SimpleImputer on this sample dataset:

# Create practice dataset
practice_data = {
    'temperature': [20.1, 22.3, np.nan, 21.8, np.nan, 23.1, 19.8],
    'humidity': [45, 50, 48, np.nan, 52, np.nan, 47],
    'weather': ['Sunny', np.nan, 'Rainy', 'Cloudy', 'Sunny', 'Rainy', np.nan],
    'season': ['Summer', 'Summer', np.nan, 'Spring', 'Summer', 'Spring', 'Spring']
}

practice_df = pd.DataFrame(practice_data)

# Your task:
# 1. Identify numerical and categorical columns
# 2. Apply appropriate imputation strategies
# 3. Validate the results
# 4. Create a pipeline for future use

This completes your comprehensive guide to SimpleImputer!