🧹 Data Cleaning with pandas - Complete Portfolio Guide
Author: Teslim Uthman Adeyanju
Purpose: Professional data cleaning techniques for pandas DataFrames
Status: Study Material
📦 Learning Resources Overview
This comprehensive guide contains everything you need to master data cleaning in pandas:
1️⃣ pandas_data_cleaning.py Module
Your main 400-line production-ready module with 8 organized sections:
- ✅ Column Header Cleaning
- ✅ Numeric Data Cleaning
- ✅ Duplicate Management
- ✅ Missing Values Handling
- ✅ Outlier Detection & Removal
- ✅ Text Data Cleaning
- ✅ Data Summary & Inspection
- ✅ Quick Clean & Convenience Aliases
2️⃣ Complete Reference Guide (70+ pages)
In-depth tutorial with Titanic dataset case study
3️⃣ Interactive Jupyter Notebook
Hands-on exercises with immediate feedback
4️⃣ Sample Dataset
Titanic data with intentional quality issues for practice
🎯 Quick Start
For Beginners:
- Start with the Jupyter notebook (interactive)
- Complete all exercises step-by-step
- Read the guide sections alongside
- Practice with your own datasets
For Intermediate:
- Review the complete workflow example
- Study domain-specific strategies
- Build your own cleaning pipelines
- Optimize for performance
For Reference:
- Use the quick reference card
- Search for specific topics
- Review pro tips periodically
📚 Learning Path (3.5 hours total)
Topic 1: Column Header Cleaning (15 min)
What you'll learn:
- Standardize column names to lowercase
- Remove spaces and special characters
- Handle duplicate column names
- Create Python-friendly variable names
Key Function:
df = clean_column_headers(df)Why it matters: Clean headers prevent coding errors and make automation easier.
Topic 2: Numeric Data Cleaning (30 min)
What you'll learn:
- Handle non-numeric strings (#REF!, NA, nil)
- Fill missing values (median, mean, mode)
- Remove negative values when inappropriate
- Round to appropriate decimal places
Key Function:
df = clean_numeric_column(
df,
column='age',
fill_method='median',
remove_negatives=False,
round_decimals=0
)Decision Framework:
- Use
medianfor data with outliers (most cases) ✅ - Use
meanfor normally distributed data - Use
modefor discrete/categorical numeric data - Use
zerowhen zero is meaningful
Topic 3: Duplicate Management (20 min)
What you'll learn:
- Detect duplicate rows
- Remove duplicates based on all or specific columns
- Choose which occurrence to keep (first/last)
- Investigate before removing
Key Function:
df = remove_duplicates(
df,
subset=['name'], # Only consider these columns
keep='first' # Keep first occurrence
)Important: Always investigate duplicates before removing - some may be legitimate!
Topic 4: Missing Values Handling (45 min)
What you'll learn:
- Understand why data is missing
- Choose appropriate fill strategies
- Use forward/backward fill for time series
- Apply custom values with domain knowledge
Key Function:
df = fill_missing_values(
df,
column='cabin',
method='custom',
custom_value='Unknown'
)Strategy Guide:
Data Type | Method | Example |
Numeric with outliers | median | Age, Income |
Normally distributed | mean | Test scores |
Categorical | mode | City, Status |
Time series | forward/backward | Stock prices |
Domain knowledge | custom | 0 for counts |
Topic 5: Outlier Detection (40 min)
What you'll learn:
- Detect outliers using IQR method
- Decide when to remove vs. keep outliers
- Cap outliers instead of removing
- Use appropriate factor values
Key Functions:
# Detect only
outliers = detect_outliers_iqr(df, 'fare', factor=1.5)
# Remove outliers
df = remove_outliers_iqr(df, 'fare', factor=1.5)Decision Framework:
- ✅ REMOVE if data entry errors
- ✅ REMOVE if measurement errors
- ❌ KEEP if true extreme values
- ❌ KEEP if natural variation
- Factor 1.5 = standard outliers
- Factor 3.0 = extreme outliers only
Topic 6: Text Data Cleaning (25 min)
What you'll learn:
- Standardize capitalization
- Remove extra whitespace
- Handle multiple missing representations
- When to preserve vs. lowercase
Key Function:
df = clean_text_column(
df,
column='name',
lowercase=True,
remove_extra_spaces=True,
standardize_missing=True
)When to lowercase:
- ✅ YES: Cities, countries, categories
- ❌ NO: Names, addresses, codes (C85 cabin)
Topic 7: Data Summary & Quality Assessment (20 min)
What you'll learn:
- Generate comprehensive data summaries
- Calculate data quality scores
- Compare before/after cleaning
- Create cleaning reports
Key Functions:
# Pretty printed summary
print_data_summary(df)
# Programmatic access
summary = get_data_summary(df)Quality Metrics:
- Total rows and columns
- Missing value counts and percentages
- Duplicate counts
- Data types
- Memory usage
- Statistical summaries for numeric columns
Topic 8: Quick Clean Pipeline (15 min)
What you'll learn:
- Apply all cleaning steps at once
- When to use vs. manual steps
- Customize for your needs
- Build reusable pipelines
Key Function:
df_clean = quick_clean(
df,
clean_headers=True,
remove_dupes=True,
numeric_columns=['age', 'fare'],
text_columns=['name', 'cabin']
)Use quick_clean for:
- ✅ Initial exploration
- ✅ Prototyping
- ✅ Standard datasets
Use manual steps for:
- ✅ Production pipelines
- ✅ Complex requirements
- ✅ Need detailed validation
🎓 Complete Workflow Example
Professional Data Cleaning Pipeline:
import pandas as pd
from pandas_data_cleaning import *
# 1. LOAD AND INSPECT
df = pd.read_csv('titanic_dirty.csv')
print_data_summary(df)
# 2. CLEAN HEADERS
df = clean_column_headers(df)
# 3. REMOVE DUPLICATES
df = remove_duplicates(df)
# 4. CLEAN NUMERIC COLUMNS
df = clean_numeric_column(df, 'age', fill_method='median')
df = clean_numeric_column(df, 'fare', fill_method='median', remove_negatives=True)
# 5. HANDLE OUTLIERS (investigate first!)
outliers = detect_outliers_iqr(df, 'fare', factor=1.5)
print(f"Outliers detected: {outliers.sum()}")
# Decision: Keep them (they're real values)
# 6. CLEAN TEXT COLUMNS
df = clean_text_column(df, 'name', lowercase=True)
df = clean_text_column(df, 'cabin', lowercase=False) # Keep case
df = clean_text_column(df, 'embarked', lowercase=True)
# 7. FILL REMAINING MISSING VALUES
df = fill_missing_values(df, 'cabin', method='custom', custom_value='Unknown')
df = fill_missing_values(df, 'survived', method='mode')
# 8. FINAL VALIDATION
print_data_summary(df)
# 9. SAVE CLEANED DATA
df.to_csv('titanic_cleaned.csv', index=False)
print("✅ Data cleaning complete!")💡 Pro Tips & Best Practices
Golden Rules:
- 👀 Look Before You Leap
- Always inspect data before cleaning
- Use
print_data_summary(df)first - Understand what you're cleaning
- 💾 Make Copies
- 📝 Document Everything
- ✅ Validate Each Step
- Check results after each operation
- Compare before/after
- Ensure changes make sense
- 🎯 Choose Methods Wisely
- Use median for skewed data
- Use mean for normal distributions
- Use mode for categorical data
- Use custom values with domain knowledge
df_original = df.copy() # Always keep original
df_clean = clean_column_headers(df.copy())cleaning_log = {
'age': {
'action': 'filled_missing',
'method': 'median',
'value': 28.0,
'reason': 'Robust to outliers',
'affected_rows': 177
}
}🏢 Domain-Specific Strategies
For Financial Data:
def clean_financial_data(df):
# Preserve precision (4 decimals)
df = clean_numeric_column(df, 'price', round_decimals=4)
# Don't remove negatives (profits can be negative!)
df = clean_numeric_column(df, 'profit', remove_negatives=False)
# Use forward fill for time series
df = fill_missing_values(df, 'stock_price', method='forward')
return dfFor Healthcare Data:
def clean_medical_data(df):
# Keep case for medical codes
df = clean_text_column(df, 'icd_code', lowercase=False)
# Be conservative with outliers (might be real conditions)
df = remove_outliers_iqr(df, 'test_result', factor=3.0)
# Standardize dates
df['admission_date'] = pd.to_datetime(df['admission_date'])
return dfFor E-commerce Data:
def clean_ecommerce_data(df):
# Standardize product names
df = clean_text_column(df, 'product_name', lowercase=True)
# Price must be positive
df = clean_numeric_column(df, 'price', remove_negatives=True)
# Quantity must be whole numbers
df = clean_numeric_column(df, 'quantity', fill_method='zero', round_decimals=0)
return df⚠️ Common Pitfalls to Avoid
1. Cleaning Without Understanding
❌ Bad:
df = quick_clean(df) # What am I cleaning?✅ Good:
print_data_summary(df) # Inspect first!
df = quick_clean(df) # Then clean2. Using Mean with Outliers
❌ Bad:
# Data: [10, 20, 30, 40, 1000]
df = clean_numeric_column(df, 'income', fill_method='mean') # Skewed!✅ Good:
df = clean_numeric_column(df, 'income', fill_method='median') # Robust!3. Removing Outliers Blindly
❌ Bad:
df = remove_outliers_iqr(df, 'fare') # Are they errors or real?✅ Good:
outliers = detect_outliers_iqr(df, 'fare')
print(df[outliers]) # Investigate first!
# Then decide to keep/remove/cap4. Lowercasing Everything
❌ Bad:
df = clean_text_column(df, 'cabin', lowercase=True)
# Result: C85 → c85 (loses readability)✅ Good:
df = clean_text_column(df, 'cabin', lowercase=False)
# Result: C85 → C85 (keeps meaning)5. Not Documenting Decisions
❌ Bad:
df = clean_numeric_column(df, 'age', fill_method='median')
# Why median? How many filled? No record!✅ Good:
missing_before = df['age'].isna().sum()
df = clean_numeric_column(df, 'age', fill_method='median')
log = {'age': {'method': 'median', 'filled': missing_before, 'reason': 'Robust'}}📊 Quick Reference Card
Column Headers
clean_column_headers(df)
# Lowercase, remove spaces, remove special charsNumeric Data
clean_numeric_column(df, 'col', fill_method='median')
# Methods: median, mean, mode, zeroDuplicates
remove_duplicates(df, subset=['col'], keep='first')
# Keep: first, last, FalseMissing Values
fill_missing_values(df, 'col', method='median')
# Methods: mean, median, mode, forward, backward, customOutliers
detect_outliers_iqr(df, 'col', factor=1.5)
remove_outliers_iqr(df, 'col', factor=1.5)
# Factor: 1.5 (standard), 3.0 (extreme)Text Data
clean_text_column(df, 'col', lowercase=True)
# Options: lowercase, remove_extra_spaces, standardize_missingSummary
print_data_summary(df)
get_data_summary(df) # Returns dictQuick Clean
quick_clean(df, clean_headers=True, remove_dupes=True)🎯 Study Schedule
Intensive (1 Week):
- Day 1-2: Topics 1-3 (Headers, Numeric, Duplicates)
- Day 3-4: Topics 4-5 (Missing Values, Outliers)
- Day 5-6: Topics 6-8 (Text, Summary, Quick Clean)
- Day 7: Complete workflow + Practice with own data
Relaxed (3 Weeks):
- Week 1: Topics 1-3 + Notebook exercises
- Week 2: Topics 4-6 + Real dataset practice
- Week 3: Topics 7-8 + Build portfolio project
✅ Milestones & Goals
Beginner:
- ✅ Can clean column headers
- ✅ Can handle missing numeric values
- ✅ Can remove duplicates
- ✅ Can generate data summary
- ✅ Understand median vs mean
Intermediate:
- ✅ Can detect and handle outliers
- ✅ Can clean text data effectively
- ✅ Can build complete cleaning pipeline
- ✅ Can choose appropriate methods
- ✅ Can validate cleaning results
Advanced:
- ✅ Can create domain-specific functions
- ✅ Can optimize for large datasets
- ✅ Can create reusable pipelines
- ✅ Can document all decisions
- ✅ Can teach data cleaning to others
📁 Files & Resources
All files are available in your outputs folder:
- pandas_data_cleaning.py - Main module
- DATA_CLEANING_GUIDE.md - 70+ page comprehensive guide
- data_cleaning_tutorial.ipynb - Interactive Jupyter notebook
- titanic_dirty.csv - Sample dataset for practice
- README.md - Quick start guide
🚀 Next Steps
- ✅ Read this overview (you just did!)
- 📓 Open the Jupyter notebook and start exercises
- 📖 Reference the full guide as you progress
- 💻 Practice with real data from Kaggle
- 🎯 Build 3-5 portfolio projects to showcase
💪 Final Thoughts
Data cleaning is foundational. It's not glamorous, but it's essential:
- 80% of data science is data cleaning
- Clean data = Better insights
- Clean data = Accurate models
- Clean data = Confident decisions
Master these techniques and you'll stand out as a professional data analyst!
🎉 Ready to become a data cleaning pro? Start with the Jupyter notebook now!
Last updated: November 8, 2025