🧹 Data Cleaning with pandas - Complete Guide

🧹 Data Cleaning with pandas - Complete Portfolio Guide

Author: Teslim Uthman Adeyanju

Purpose: Professional data cleaning techniques for pandas DataFrames

Status: Study Material

📦 Learning Resources Overview

This comprehensive guide contains everything you need to master data cleaning in pandas:

1️⃣ pandas_data_cleaning.py Module

Your main 400-line production-ready module with 8 organized sections:

✅ Column Header Cleaning
✅ Numeric Data Cleaning
✅ Duplicate Management
✅ Missing Values Handling
✅ Outlier Detection & Removal
✅ Text Data Cleaning
✅ Data Summary & Inspection
✅ Quick Clean & Convenience Aliases

2️⃣ Complete Reference Guide (70+ pages)

In-depth tutorial with Titanic dataset case study

3️⃣ Interactive Jupyter Notebook

Hands-on exercises with immediate feedback

4️⃣ Sample Dataset

Titanic data with intentional quality issues for practice

🎯 Quick Start

For Beginners:

Start with the Jupyter notebook (interactive)
Complete all exercises step-by-step
Read the guide sections alongside
Practice with your own datasets

For Intermediate:

Review the complete workflow example
Study domain-specific strategies
Build your own cleaning pipelines
Optimize for performance

For Reference:

Use the quick reference card
Search for specific topics
Review pro tips periodically

📚 Learning Path (3.5 hours total)

Topic 1: Column Header Cleaning (15 min)

What you'll learn:

Standardize column names to lowercase
Remove spaces and special characters
Handle duplicate column names
Create Python-friendly variable names

Key Function:

df = clean_column_headers(df)

Why it matters: Clean headers prevent coding errors and make automation easier.

Topic 2: Numeric Data Cleaning (30 min)

What you'll learn:

Handle non-numeric strings (#REF!, NA, nil)
Fill missing values (median, mean, mode)
Remove negative values when inappropriate
Round to appropriate decimal places

Key Function:

df = clean_numeric_column(
    df, 
    column='age',
    fill_method='median',
    remove_negatives=False,
    round_decimals=0
)

Decision Framework:

Use median for data with outliers (most cases) ✅
Use mean for normally distributed data
Use mode for discrete/categorical numeric data
Use zero when zero is meaningful

Topic 3: Duplicate Management (20 min)

What you'll learn:

Detect duplicate rows
Remove duplicates based on all or specific columns
Choose which occurrence to keep (first/last)
Investigate before removing

Key Function:

df = remove_duplicates(
    df,
    subset=['name'],  # Only consider these columns
    keep='first'  # Keep first occurrence
)

Important: Always investigate duplicates before removing - some may be legitimate!

Topic 4: Missing Values Handling (45 min)

What you'll learn:

Understand why data is missing
Choose appropriate fill strategies
Use forward/backward fill for time series
Apply custom values with domain knowledge

Key Function:

df = fill_missing_values(
    df,
    column='cabin',
    method='custom',
    custom_value='Unknown'
)

Strategy Guide:

Data Type	Method	Example
Numeric with outliers	median	Age, Income
Normally distributed	mean	Test scores
Categorical	mode	City, Status
Time series	forward/backward	Stock prices
Domain knowledge	custom	0 for counts

Topic 5: Outlier Detection (40 min)

What you'll learn:

Detect outliers using IQR method
Decide when to remove vs. keep outliers
Cap outliers instead of removing
Use appropriate factor values

Key Functions:

# Detect only
outliers = detect_outliers_iqr(df, 'fare', factor=1.5)

# Remove outliers
df = remove_outliers_iqr(df, 'fare', factor=1.5)

Decision Framework:

✅ REMOVE if data entry errors
✅ REMOVE if measurement errors
❌ KEEP if true extreme values
❌ KEEP if natural variation
Factor 1.5 = standard outliers
Factor 3.0 = extreme outliers only

Topic 6: Text Data Cleaning (25 min)

What you'll learn:

Standardize capitalization
Remove extra whitespace
Handle multiple missing representations
When to preserve vs. lowercase

Key Function:

df = clean_text_column(
    df,
    column='name',
    lowercase=True,
    remove_extra_spaces=True,
    standardize_missing=True
)

When to lowercase:

✅ YES: Cities, countries, categories
❌ NO: Names, addresses, codes (C85 cabin)

Topic 7: Data Summary & Quality Assessment (20 min)

What you'll learn:

Generate comprehensive data summaries
Calculate data quality scores
Compare before/after cleaning
Create cleaning reports

Key Functions:

# Pretty printed summary
print_data_summary(df)

# Programmatic access
summary = get_data_summary(df)

Quality Metrics:

Total rows and columns
Missing value counts and percentages
Duplicate counts
Data types
Memory usage
Statistical summaries for numeric columns

Topic 8: Quick Clean Pipeline (15 min)

What you'll learn:

Apply all cleaning steps at once
When to use vs. manual steps
Customize for your needs
Build reusable pipelines

Key Function:

df_clean = quick_clean(
    df,
    clean_headers=True,
    remove_dupes=True,
    numeric_columns=['age', 'fare'],
    text_columns=['name', 'cabin']
)

Use quick_clean for:

✅ Initial exploration
✅ Prototyping
✅ Standard datasets

Use manual steps for:

✅ Production pipelines
✅ Complex requirements
✅ Need detailed validation

🎓 Complete Workflow Example

Professional Data Cleaning Pipeline:

import pandas as pd
from pandas_data_cleaning import *

# 1. LOAD AND INSPECT
df = pd.read_csv('titanic_dirty.csv')
print_data_summary(df)

# 2. CLEAN HEADERS
df = clean_column_headers(df)

# 3. REMOVE DUPLICATES
df = remove_duplicates(df)

# 4. CLEAN NUMERIC COLUMNS
df = clean_numeric_column(df, 'age', fill_method='median')
df = clean_numeric_column(df, 'fare', fill_method='median', remove_negatives=True)

# 5. HANDLE OUTLIERS (investigate first!)
outliers = detect_outliers_iqr(df, 'fare', factor=1.5)
print(f"Outliers detected: {outliers.sum()}")
# Decision: Keep them (they're real values)

# 6. CLEAN TEXT COLUMNS
df = clean_text_column(df, 'name', lowercase=True)
df = clean_text_column(df, 'cabin', lowercase=False)  # Keep case
df = clean_text_column(df, 'embarked', lowercase=True)

# 7. FILL REMAINING MISSING VALUES
df = fill_missing_values(df, 'cabin', method='custom', custom_value='Unknown')
df = fill_missing_values(df, 'survived', method='mode')

# 8. FINAL VALIDATION
print_data_summary(df)

# 9. SAVE CLEANED DATA
df.to_csv('titanic_cleaned.csv', index=False)
print("✅ Data cleaning complete!")

💡 Pro Tips & Best Practices

Golden Rules:

👀 Look Before You Leap

Always inspect data before cleaning
Use print_data_summary(df) first
Understand what you're cleaning

💾 Make Copies

df_original = df.copy()  # Always keep original
df_clean = clean_column_headers(df.copy())

📝 Document Everything

cleaning_log = {
    'age': {
        'action': 'filled_missing',
        'method': 'median',
        'value': 28.0,
        'reason': 'Robust to outliers',
        'affected_rows': 177
    }
}

✅ Validate Each Step

Check results after each operation
Compare before/after
Ensure changes make sense

🎯 Choose Methods Wisely

Use median for skewed data
Use mean for normal distributions
Use mode for categorical data
Use custom values with domain knowledge

🏢 Domain-Specific Strategies

For Financial Data:

def clean_financial_data(df):
    # Preserve precision (4 decimals)
    df = clean_numeric_column(df, 'price', round_decimals=4)
    
    # Don't remove negatives (profits can be negative!)
    df = clean_numeric_column(df, 'profit', remove_negatives=False)
    
    # Use forward fill for time series
    df = fill_missing_values(df, 'stock_price', method='forward')
    
    return df

For Healthcare Data:

def clean_medical_data(df):
    # Keep case for medical codes
    df = clean_text_column(df, 'icd_code', lowercase=False)
    
    # Be conservative with outliers (might be real conditions)
    df = remove_outliers_iqr(df, 'test_result', factor=3.0)
    
    # Standardize dates
    df['admission_date'] = pd.to_datetime(df['admission_date'])
    
    return df

For E-commerce Data:

def clean_ecommerce_data(df):
    # Standardize product names
    df = clean_text_column(df, 'product_name', lowercase=True)
    
    # Price must be positive
    df = clean_numeric_column(df, 'price', remove_negatives=True)
    
    # Quantity must be whole numbers
    df = clean_numeric_column(df, 'quantity', fill_method='zero', round_decimals=0)
    
    return df

⚠️ Common Pitfalls to Avoid

1. Cleaning Without Understanding

❌ Bad:

df = quick_clean(df)  # What am I cleaning?

✅ Good:

print_data_summary(df)  # Inspect first!
df = quick_clean(df)    # Then clean

2. Using Mean with Outliers

❌ Bad:

# Data: [10, 20, 30, 40, 1000]
df = clean_numeric_column(df, 'income', fill_method='mean')  # Skewed!

✅ Good:

df = clean_numeric_column(df, 'income', fill_method='median')  # Robust!

3. Removing Outliers Blindly

❌ Bad:

df = remove_outliers_iqr(df, 'fare')  # Are they errors or real?

✅ Good:

outliers = detect_outliers_iqr(df, 'fare')
print(df[outliers])  # Investigate first!
# Then decide to keep/remove/cap

4. Lowercasing Everything

❌ Bad:

df = clean_text_column(df, 'cabin', lowercase=True)
# Result: C85 → c85 (loses readability)

✅ Good:

df = clean_text_column(df, 'cabin', lowercase=False)
# Result: C85 → C85 (keeps meaning)

5. Not Documenting Decisions

❌ Bad:

df = clean_numeric_column(df, 'age', fill_method='median')
# Why median? How many filled? No record!

✅ Good:

missing_before = df['age'].isna().sum()
df = clean_numeric_column(df, 'age', fill_method='median')
log = {'age': {'method': 'median', 'filled': missing_before, 'reason': 'Robust'}}

📊 Quick Reference Card

Column Headers

clean_column_headers(df)
# Lowercase, remove spaces, remove special chars

Numeric Data

clean_numeric_column(df, 'col', fill_method='median')
# Methods: median, mean, mode, zero

Duplicates

remove_duplicates(df, subset=['col'], keep='first')
# Keep: first, last, False

Missing Values

fill_missing_values(df, 'col', method='median')
# Methods: mean, median, mode, forward, backward, custom

Outliers

detect_outliers_iqr(df, 'col', factor=1.5)
remove_outliers_iqr(df, 'col', factor=1.5)
# Factor: 1.5 (standard), 3.0 (extreme)

Text Data

clean_text_column(df, 'col', lowercase=True)
# Options: lowercase, remove_extra_spaces, standardize_missing

Summary

print_data_summary(df)
get_data_summary(df)  # Returns dict

Quick Clean

quick_clean(df, clean_headers=True, remove_dupes=True)

🎯 Study Schedule

Intensive (1 Week):

Day 1-2: Topics 1-3 (Headers, Numeric, Duplicates)
Day 3-4: Topics 4-5 (Missing Values, Outliers)
Day 5-6: Topics 6-8 (Text, Summary, Quick Clean)
Day 7: Complete workflow + Practice with own data

Relaxed (3 Weeks):

Week 1: Topics 1-3 + Notebook exercises
Week 2: Topics 4-6 + Real dataset practice
Week 3: Topics 7-8 + Build portfolio project

✅ Milestones & Goals

Beginner:

✅ Can clean column headers
✅ Can handle missing numeric values
✅ Can remove duplicates
✅ Can generate data summary
✅ Understand median vs mean

Intermediate:

✅ Can detect and handle outliers
✅ Can clean text data effectively
✅ Can build complete cleaning pipeline
✅ Can choose appropriate methods
✅ Can validate cleaning results

Advanced:

✅ Can create domain-specific functions
✅ Can optimize for large datasets
✅ Can create reusable pipelines
✅ Can document all decisions
✅ Can teach data cleaning to others

📁 Files & Resources

All files are available in your outputs folder:

pandas_data_cleaning.py - Main module
DATA_CLEANING_GUIDE.md - 70+ page comprehensive guide
data_cleaning_tutorial.ipynb - Interactive Jupyter notebook
titanic_dirty.csv - Sample dataset for practice
README.md - Quick start guide

🚀 Next Steps

✅ Read this overview (you just did!)
📓 Open the Jupyter notebook and start exercises
📖 Reference the full guide as you progress
💻 Practice with real data from Kaggle
🎯 Build 3-5 portfolio projects to showcase

💪 Final Thoughts

Data cleaning is foundational. It's not glamorous, but it's essential:

80% of data science is data cleaning
Clean data = Better insights
Clean data = Accurate models
Clean data = Confident decisions

Master these techniques and you'll stand out as a professional data analyst!

🎉 Ready to become a data cleaning pro? Start with the Jupyter notebook now!

Last updated: November 8, 2025