π’ Complete Data Cleaning Guide Using Titanic Dataset
A Step-by-Step Tutorial to Master pandas_data_cleaning.py
Author: Teslim Uthman Adeyanju
Purpose: Learn professional data cleaning techniques from beginner to pro
Case Study: Titanic Passenger Data
π Table of Contents
Introduction & Setup
Section 1: Column Header Cleaning
Section 2: Numeric Data Cleaning
Section 3: Duplicate Management
Section 4: Missing Values Handling
Section 5: Outlier Detection & Removal
Section 6: Text Data Cleaning
Section 7: Data Summary & Inspection
Section 8: Quick Clean Pipeline
Complete Workflow Example
Pro Tips & Best Practices
1. Introduction & Setup
What Youβll Learn
By the end of this guide, youβll be able to:
- Identify and fix common data quality issues
- Apply appropriate cleaning techniques for different data types
- Build reproducible data cleaning pipelines
- Make informed decisions about data transformations
- Document your cleaning process professionally
Setup Your Environment
# Import required librariesimport pandas as pd
import numpy as np
from pandas_data_cleaning import *# Load the dirty Titanic datasetdf = pd.read_csv('titanic_dirty.csv')
# Always look at your data first!print("Raw Data:")
print(df.head())
print(f"\nDataset Shape: {df.shape}")
print(f"\nColumn Names: {df.columns.tolist()}")
Common Data Quality Issues in Real Datasets
The Titanic dataset weβre using has these intentional problems:
- β Inconsistent column names (spaces, mixed case)
- β Duplicate rows
- β Missing values (empty strings, βNAβ, βNilβ, None)
- β Inconsistent text formatting (mixed case, extra spaces)
- β Non-numeric values in numeric columns (β#REF!β, βNAβ)
- β Outliers in fare prices
- β Inconsistent categorical values
These are the EXACT problems youβll face in real-world data!
2. Section 1: Column Header Cleaning
π― Why This Matters
Column names are your interface to the data. Bad column names lead to:
- Coding errors and typos
- Difficulty in automation
- Problems when exporting to databases
- Confusion in team collaboration
π What It Does
The clean_column_headers() function:
1. Converts all column names to lowercase
2. Removes leading/trailing whitespace
3. Replaces spaces with underscores
4. Removes special characters
5. Handles duplicate column names
6. Creates Python-friendly variable names
π» Practical Example
π When to Use
ALWAYS as the first step in your cleaning pipeline
Before any column-specific operations
When receiving data from external sources
When merging datasets from different sources
β οΈ Common Pitfalls
DONβT clean headers multiple times (itβs already clean!)
DONβT manually rename columns before using this function
DO save a copy of original column names if needed for documentation
π₯ Pro Tips
# Save original column mapping for referenceoriginal_cols = df.columns.tolist()
df = clean_column_headers(df)
cleaned_cols = df.columns.tolist()
# Create a mapping dictionarycol_mapping = dict(zip(original_cols, cleaned_cols))
print("Column Name Changes:")
for old, new in col_mapping.items():
if old != new:
print(f" '{old}' β '{new}'")
π Before & After Comparison
# Show the transformationcomparison = pd.DataFrame({
'Original': original_cols,
'Cleaned': cleaned_cols,
'Changed': [old != new for old, new in zip(original_cols, cleaned_cols)]
})
print(comparison)
3. Section 2: Numeric Data Cleaning
π― Why This Matters
Numeric columns are used for calculations, statistics, and machine learning. Even one bad value can ruin your entire analysis!
π What It Does
The clean_numeric_column() function:
1. Replaces problematic strings (β#REF!β, βNilβ, βNAβ, etc.) with NaN
2. Converts column to numeric type
3. Fills missing values using statistical methods
4. Optionally removes negative values
5. Rounds to specified decimal places
π» Practical Example: Cleaning Age Column
π» Practical Example: Cleaning Fare Column
π Choosing the Right Fill Method
Method
When to Use
Example
'median'
BEST for most cases. Robust to outliers
Age, Fare, Income
'mean'
When data is normally distributed, no outliers
Test scores, Height
'mode'
For discrete/categorical numeric data
Number of siblings, Pclass
'zero'
When zero is meaningful
Number of purchases, Complaints
β οΈ Common Pitfalls
π₯ Pro Tips
π Visualizing Numeric Cleaning Impact
4. Section 3: Duplicate Management
π― Why This Matters
Duplicate rows can:
- Inflate your dataset size unnecessarily
- Bias statistical analyses
- Create incorrect counts and aggregations
- Cause issues in machine learning (data leakage)
π What It Does
The remove_duplicates() function:
1. Identifies duplicate rows based on all or specific columns
2. Keeps first, last, or removes all duplicates
3. Resets the index for clean row numbering
4. Reports how many duplicates were removed
π» Practical Example: Finding Duplicates
# First, let's see if we have duplicatesprint("Checking for duplicates...")
print(f"Total rows: {len(df)}")
print(f"Duplicate rows: {df.duplicated().sum()}")
# Show the duplicatesif df.duplicated().sum() > 0:
print("\nDuplicate entries:")
print(df[df.duplicated(keep=False)].sort_values('name'))
π» Example 1: Remove Complete Duplicates
# Remove rows where ALL columns are identicaldf_clean = remove_duplicates(df)
# Output: β Removed 2 duplicate rowsprint(f"Rows before: {len(df)}")
print(f"Rows after: {len(df_clean)}")
π» Example 2: Remove Duplicates Based on Specific Columns
# Maybe we only care about passenger name being uniquedf_unique_names = remove_duplicates(
df,
subset=['name'], # Only consider 'name' column keep='first' # Keep the first occurrence)
print(f"Unique passengers: {len(df_unique_names)}")
π» Example 3: Keep Last Occurrence
# Keep the most recent entry (useful for updated records)df_keep_last = remove_duplicates(
df,
subset=['passenger_id'],
keep='last' # Keep the last occurrence)
π Understanding the βkeepβ Parameter
Value
Behavior
Use Case
'first'
Keeps the first occurrence
DEFAULT - Most common
'last'
Keeps the last occurrence
When you want the most recent record
False
Removes ALL duplicates
When duplicates indicate data errors
β οΈ Common Pitfalls
π₯ Pro Tips
π Duplicate Patterns Analysis
5. Section 4: Missing Values Handling
π― Why This Matters
Missing data is inevitable in real-world datasets. How you handle it can make or break your analysis:
- Deleting rows loses information
- Improper imputation introduces bias
- Different columns need different strategies
- ML algorithms canβt handle missing values
π What It Does
The fill_missing_values() function:
1. Detects missing values (NaN, None, null)
2. Fills them using various strategies
3. Supports forward/backward fill for time series
4. Allows custom values
5. Reports how many values were filled
π» Practical Example: Understanding Missing Data Patterns
π» Example 1: Fill Numeric Column with Statistical Method
# Fill age with median (already done in numeric cleaning)df = fill_missing_values(
df,
column='age',
method='median')
# Output: β Filled 2 missing values in 'age' using median
π» Example 2: Fill Categorical Column with Mode
# Fill cabin with most common valuedf = fill_missing_values(
df,
column='cabin',
method='mode')
# Output: β Filled 4 missing values in 'cabin' using mode
π» Example 3: Fill with Custom Value
# Fill embarked location with 'Unknown' for claritydf = fill_missing_values(
df,
column='embarked',
method='custom',
custom_value='Unknown')
π» Example 4: Forward Fill for Time Series
# Imagine we have a time series of stock prices# Forward fill propagates last valid observation forwarddf_timeseries = fill_missing_values(
df_timeseries,
column='stock_price',
method='forward')
π Choosing the Right Filling Strategy
Method
Best For
Example
Pros
Cons
'median'
Numeric with outliers
Age, Income, Prices
Robust to outliers
May not reflect distribution
'mean'
Normally distributed
Test scores, Height
Preserves mean
Sensitive to outliers
'mode'
Categorical data
City, Category, Status
Preserves most common
Increases frequency of mode
'forward'
Time series
Stock prices, Temperatures
Maintains trends
Propagates errors
'backward'
Time series (reverse)
Future projections
Useful for forecasting
Requires future data
'custom'
Domain knowledge
0 for counts, βUnknownβ
Meaningful
Requires expertise
β οΈ Common Pitfalls
π₯ Pro Tips
π Missing Data Impact Analysis
π Advanced: Multiple Imputation Strategies
6. Section 5: Outlier Detection & Removal
π― Why This Matters
Outliers can:
- Distort statistical analyses (mean, standard deviation)
- Reduce model accuracy
- Hide patterns in your data
- Sometimes indicate data entry errors
- Sometimes be the most interesting data points!
KEY QUESTION: Are outliers errors or valuable extreme values?
π What It Does
The module provides two functions:
1. detect_outliers_iqr() - Finds outliers (doesnβt remove them)
2. remove_outliers_iqr() - Removes outliers based on IQR method
π» Practical Example: Detecting Outliers in Fare
π» Example 1: Conservative Outlier Removal
# Use factor=3.0 for extreme outliers onlydf_conservative = remove_outliers_iqr(
df,
column='fare',
factor=3.0, # Only remove EXTREME outliers verbose=True)
# Output: β Removed 1 outliers from column 'fare'
π» Example 2: Standard Outlier Removal
# Use factor=1.5 for standard outliersdf_standard = remove_outliers_iqr(
df,
column='fare',
factor=1.5, # Standard outlier detection verbose=True)
# Output: β Removed 3 outliers from column 'fare'
π» Example 3: Multiple Column Outlier Detection
π When to Remove Outliers (Decision Framework)
Situation
Decision
Reasoning
Data entry errors
β REMOVE
Age = 999, Fare = -100 (impossible)
Measurement errors
β REMOVE
Height = 10 feet (instrument error)
True extreme values
β KEEP
CEO salary, Rare disease case
Natural variation
β KEEP
Athletic performance, Stock prices
Before modeling
β οΈ DEPENDS
Some models handle outliers, others donβt
β οΈ Common Pitfalls
π₯ Pro Tips
π Visual Comparison: Before and After
7. Section 6: Text Data Cleaning
π― Why This Matters
Text data is messy by nature:
- Inconsistent capitalization
- Extra whitespace
- Typos and variations
- Multiple representations of βmissingβ
- Makes grouping and searching difficult
Clean text data enables:
- Accurate categorization
- Reliable searching and filtering
- Consistent analysis
- Better data visualization
π What It Does
The clean_text_column() function:
1. Converts text to lowercase
2. Removes extra whitespace
3. Standardizes missing values
4. Handles NaN values properly
5. Preserves actual null values
π» Practical Example: Cleaning Name Column
# First, see what we're dealing withprint("Name column issues:")
print(df['name'].unique())
# Output: Shows mixed case, extra spaces, etc.# Clean the name columndf = clean_text_column(
df,
column='name',
lowercase=True,
remove_extra_spaces=True,
standardize_missing=True)
print("\nCleaned names:")
print(df['name'].unique())
π» Example 1: Cleaning Embarkation Port
π» Example 2: Preserving Case Sensitivity
# Sometimes you want to keep case (like names)df = clean_text_column(
df,
column='cabin',
lowercase=False, # Keep original case remove_extra_spaces=True,
standardize_missing=True)
π» Example 3: Cleaning Multiple Text Columns
# Clean all text columns at oncetext_columns = ['name', 'cabin', 'embarked']
for col in text_columns:
if col in df.columns:
df = clean_text_column(df, col)
print(f"β Cleaned text column: {col}")
π Text Cleaning Parameters Guide
Parameter
Effect
When to Use
lowercase=True
All text β lowercase
Categories, cities, countries
lowercase=False
Keep original case
Names, addresses, codes
remove_extra_spaces=True
Trim & collapse spaces
Always recommended
standardize_missing=True
βnaβ, βnilβ, etc. β NaN
When you want uniform nulls
β οΈ Common Pitfalls
π₯ Pro Tips
π Text Data Quality Assessment
π Advanced: Text Normalization Pipeline
8. Section 7: Data Summary & Inspection
π― Why This Matters
You canβt clean what you donβt understand! Data inspection is crucial for:
- Understanding data structure and types
- Identifying cleaning priorities
- Tracking cleaning progress
- Documenting dataset characteristics
- Communicating data quality to stakeholders
βLook at your data first, clean it second!β
π What It Does
Two powerful functions:
1. get_data_summary() - Returns dictionary with comprehensive statistics
2. print_data_summary() - Pretty-printed formatted summary
π» Practical Example: Initial Data Inspection
π» Example 1: Programmatic Access to Summary
π» Example 2: Before and After Comparison
π» Example 3: Missing Data Dashboard
π Understanding the Summary Statistics
π₯ Pro Tips
π Creating a Cleaning Progress Tracker
9. Section 8: Quick Clean Pipeline
π― Why This Matters
Sometimes you need to clean data FAST! The quick_clean() function:
- Applies multiple cleaning steps at once
- Uses sensible defaults
- Automates common workflows
- Saves time on routine cleaning tasks
Perfect for: Initial exploration, prototypes, standard datasets
Not recommended for: Production pipelines (use individual functions for control)
π What It Does
The quick_clean() function:
1. Cleans column headers
2. Removes duplicate rows
3. Auto-detects and cleans numeric columns
4. Auto-detects and cleans text columns
5. Prints progress messages
# Clean only specific columnsdf_clean = quick_clean(
df,
clean_headers=True,
remove_dupes=True,
numeric_columns=['age', 'fare'], # Only these text_columns=['name', 'cabin'] # Only these)
π» Example 3: Before and After Comparison
π When to Use Quick Clean
Situation
Use Quick Clean?
Why?
Exploring new dataset
β YES
Fast initial cleanup
Kaggle competition start
β YES
Quick baseline
Production pipeline
β NO
Need explicit control
Critical data
β NO
Need validation at each step
Documentation needed
β NO
Need detailed logging
Learning data cleaning
β οΈ MAYBE
Good for overview, but use individual functions to learn
β οΈ Common Pitfalls
π₯ Pro Tips
10. Complete Workflow Example
π― Putting It All Together
Hereβs a complete, professional data cleaning workflow for the Titanic dataset:
π Expected Output
11. Pro Tips & Best Practices
π General Best Practices
1. Always Look at Your Data First
# DON'T start cleaning blindly# df = quick_clean(df) # What am I cleaning?# DO inspect firstprint_data_summary(df)
print(df.head())
print(df.dtypes)
print(df.describe())
2. Make Copies, Donβt Modify Originals
# DON'T modify the original# clean_column_headers(df) # Oops, can't go back!# DO create copiesdf_original = df.copy()
df_clean = clean_column_headers(df.copy())
# Now you can compare!
3. Document Your Decisions
4. Validate After Each Step
5. Use Type Hints and Documentation
π― Domain-Specific Tips
For Financial Data
For Medical/Healthcare Data
For E-commerce Data
π Performance Tips
For Large Datasets
π Creating Reusable Cleaning Functions
π Learning Resources
Next Steps to Become a Pro:
Practice with Real Datasets
Kaggle datasets (messy data)
UCI Machine Learning Repository
Government open data portals
Learn Advanced Pandas
Multi-indexing
GroupBy operations
Apply and transform functions
Study Data Quality Frameworks
Data profiling
Data lineage
Data governance
Explore Related Tools
Great Expectations (data testing)
Pandas Profiling (automated EDA)
Dask (large datasets)
Read Industry Best Practices
βData Cleaningβ by Ihab Ilyas
Pandas documentation
Real-world case studies
π Congratulations!
Youβve completed the comprehensive guide to data cleaning with pandas!
You now know how to:
- β Identify common data quality issues
- β Apply appropriate cleaning techniques
- β Build reproducible cleaning pipelines
- β Document your decisions
- β Validate your results
- β Handle domain-specific requirements
Remember the golden rules:
1. Look before you leap - Always inspect data first
2. Document everything - Your future self will thank you
3. Validate at each step - Catch problems early
4. Think about the domain - Context matters!
5. Keep the original - Always work on copies
# Inspect the problemprint("Age column before cleaning:")
print(df['age'].head(10))
print(f"\nData type: {df['age'].dtype}")
print(f"Unique values: {df['age'].unique()}")
# Clean the age columndf = clean_numeric_column(
df,
column='age',
fill_method='median', # Use median (robust to outliers) remove_negatives=False, # Ages shouldn't be negative anyway round_decimals=0 # Ages are whole numbers)
print("\nAge column after cleaning:")
print(df['age'].head(10))
print(f"\nData type: {df['age'].dtype}")
print(f"Statistics: Mean={df['age'].mean():.1f}, Median={df['age'].median():.1f}")
# Fare has different requirements than ageprint("Fare column before cleaning:")
print(df['fare'].unique())
df = clean_numeric_column(
df,
column='fare',
fill_method='median',
remove_negatives=True, # Negative fares don't make sense round_decimals=2 # Money typically has 2 decimal places)
print("\nFare column after cleaning:")
print(df['fare'].describe())
# β DON'T clean non-numeric columns as numeric# df = clean_numeric_column(df, 'name') # This will fail!# β DON'T use mean for data with outliers# Outliers will skew the mean and create unrealistic imputations# β DO check the data type firstprint(f"Is 'fare' numeric? {pd.api.types.is_numeric_dtype(df['fare'])}")
# β DO inspect unique values before cleaningprint(f"Problematic values in age: {df[df['age'] == '#REF!']['age'].unique()}")
# Tip 1: Always check BEFORE and AFTERdef clean_with_validation(df, column, **kwargs):
"""Clean numeric column with before/after validation""" print(f"π Cleaning '{column}'...")
print(f"Before: {df[column].isna().sum()} missing, {df[column].dtype} dtype")
df = clean_numeric_column(df, column, **kwargs)
print(f"After: {df[column].isna().sum()} missing, {df[column].dtype} dtype")
print(f"Range: {df[column].min():.2f} to {df[column].max():.2f}\n")
return df
# Tip 2: Handle multiple numeric columns efficientlynumeric_columns = ['age', 'fare', 'pclass']
for col in numeric_columns:
df = clean_with_validation(df, col)
# Tip 3: Create a cleaning reportcleaning_report = {
'age': {'method': 'median', 'filled': 2, 'reason': 'Robust to outliers'},
'fare': {'method': 'median', 'filled': 1, 'reason': 'Contains outliers'},
}
print("Cleaning Report:", cleaning_report)
# Compare distributions before and afterimport matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# Before (load original data)df_original = pd.read_csv('titanic_dirty.csv')
df_original['Age'] = pd.to_numeric(df_original['Age'], errors='coerce')
# Plotaxes[0].hist(df_original['Age'].dropna(), bins=20, edgecolor='black')
axes[0].set_title('Age Distribution - Before Cleaning')
axes[0].set_xlabel('Age')
axes[1].hist(df['age'], bins=20, edgecolor='black')
axes[1].set_title('Age Distribution - After Cleaning')
axes[1].set_xlabel('Age')
plt.tight_layout()
plt.savefig('numeric_cleaning_comparison.png')
print("β Saved comparison plot")
# β DON'T remove duplicates without investigating first# Some "duplicates" might be legitimate (e.g., same name, different person)# β DON'T forget to specify subset when needed# remove_duplicates(df) # Removes ONLY if ALL columns match# remove_duplicates(df, subset=['name']) # More flexible# β DO investigate duplicates before removingduplicates = df[df.duplicated(subset=['name'], keep=False)]
print(f"Found {len(duplicates)} duplicate names")
print(duplicates[['name', 'age', 'fare']]) # Compare the records# β DO document your duplicate removal strategyduplicate_removal_notes = """Duplicate Removal Strategy:- Remove complete duplicate rows (all columns identical)- Keep first occurrence when passenger_id matches- Investigate name duplicates manually (might be family members)"""
# Tip 1: Create a duplicate analysis functiondef analyze_duplicates(df, subset_columns=None):
"""Analyze duplicate patterns in the dataset""" if subset_columns:
dups = df[df.duplicated(subset=subset_columns, keep=False)]
print(f"Duplicates based on {subset_columns}:")
else:
dups = df[df.duplicated(keep=False)]
print("Complete duplicates:")
print(f" Total duplicate rows: {len(dups)}")
print(f" Unique duplicate sets: {dups.duplicated(subset=subset_columns).sum()}")
if len(dups) > 0:
print("\nExample duplicates:")
print(dups.head().to_string())
return dups
# Tip 2: Use duplicates for data quality assessmentdup_analysis = analyze_duplicates(df, subset_columns=['name'])
# Tip 3: Create a duplicate removal reportdef remove_duplicates_with_report(df, **kwargs):
"""Remove duplicates and generate detailed report""" before_count = len(df)
before_duplicates = df.duplicated().sum()
df_clean = remove_duplicates(df, **kwargs)
after_count = len(df_clean)
removed = before_count - after_count
report = {
'rows_before': before_count,
'rows_after': after_count,
'duplicates_found': before_duplicates,
'rows_removed': removed,
'percentage_removed': (removed / before_count * 100) if before_count > 0 else 0 }
print("\nπ Duplicate Removal Report:")
for key, value in report.items():
print(f" {key}: {value}")
return df_clean, report
# Advanced: Find columns with high duplicate ratesdef find_high_duplicate_columns(df, threshold=0.5):
"""Find columns where >threshold values are duplicates""" results = {}
for col in df.columns:
unique_ratio = df[col].nunique() / len(df)
if unique_ratio < threshold:
results[col] = {
'unique_values': df[col].nunique(),
'total_values': len(df),
'duplicate_rate': 1 - unique_ratio
}
return pd.DataFrame(results).T
print("Columns with high duplicate rates:")
print(find_high_duplicate_columns(df))
# β DON'T fill all columns with the same method# for col in df.columns:# df = fill_missing_values(df, col, method='mean') # BAD!# β DON'T fill missing values without understanding WHY they're missing# Missing at random? Missing not at random? Systematic bias?# β DON'T use mean for skewed dataskewed_data = df['fare']
print(f"Fare - Mean: {skewed_data.mean():.2f}, Median: {skewed_data.median():.2f}")
# If these differ significantly, use median!# β DO analyze missing data patterns firstdef missing_data_report(df):
"""Generate comprehensive missing data report""" report = []
for col in df.columns:
missing_count = df[col].isnull().sum()
if missing_count > 0:
missing_pct = (missing_count / len(df)) * 100 dtype = df[col].dtype
# Recommended strategy if pd.api.types.is_numeric_dtype(df[col]):
recommended = 'median (or mean if normally distributed)' else:
recommended = 'mode (or custom value with domain knowledge)' report.append({
'Column': col,
'Missing': missing_count,
'Percent': f'{missing_pct:.1f}%',
'Type': dtype,
'Recommended': recommended
})
return pd.DataFrame(report)
print("\nπ Missing Data Strategy Guide:")
print(missing_data_report(df).to_string(index=False))
# β DO consider dropping columns with >50% missing datahigh_missing = [col for col in df.columns if df[col].isnull().sum() / len(df) > 0.5]
if high_missing:
print(f"\nβ οΈ Columns with >50% missing data (consider dropping): {high_missing}")
# Tip 1: Create a flexible filling functiondef smart_fill_missing(df, column, strategy='auto'):
""" Intelligently fill missing values based on data type and distribution """ if strategy == 'auto':
if pd.api.types.is_numeric_dtype(df[column]):
# Check for skewness skewness = df[column].skew()
if abs(skewness) > 1: # Highly skewed method = 'median' print(f" {column}: Using median (skewed data)")
else:
method = 'mean' print(f" {column}: Using mean (normally distributed)")
else:
method = 'mode' print(f" {column}: Using mode (categorical)")
else:
method = strategy
return fill_missing_values(df, column, method=method)
# Tip 2: Track all imputation decisionsimputation_log = {
'age': {'method': 'median', 'value': df['age'].median(), 'reason': 'Robust to outliers'},
'cabin': {'method': 'mode', 'value': df['cabin'].mode()[0], 'reason': 'Most common cabin'},
'embarked': {'method': 'custom', 'value': 'Unknown', 'reason': 'Domain knowledge'}
}
print("\nπ Imputation Log:")
for col, info in imputation_log.items():
print(f" {col}: {info['method']} = {info['value']} ({info['reason']})")
# Tip 3: Create indicator variables for missing data# Sometimes "missingness" is informative!df['cabin_was_missing'] = df['cabin'].isnull().astype(int)
df['age_was_missing'] = df['age'].isnull().astype(int)
print("\nMissing data indicators created - useful for ML models!")
# Compare statistics before and after imputationdef compare_imputation_impact(df_before, df_after, column):
"""Compare statistics before and after imputation""" before_stats = df_before[column].describe()
after_stats = df_after[column].describe()
comparison = pd.DataFrame({
'Before': before_stats,
'After': after_stats,
'Change': after_stats - before_stats
})
print(f"\nπ Imputation Impact on '{column}':")
print(comparison)
return comparison
# Example usage:# compare_imputation_impact(df_original, df, 'age')
# Get summary as a dictionary for programmatic usesummary = get_data_summary(df)
# Access specific informationprint(f"Dataset has {summary['total_rows']} rows")
print(f"Memory usage: {summary['memory_usage_mb']:.3f} MB")
# Find columns with high missing datahigh_missing = {
col: pct
for col, pct in summary['missing_percentage'].items()
if pct > 20}
print(f"\nColumns with >20% missing data: {high_missing}")
def missing_data_dashboard(df):
"""Create a comprehensive missing data visualization""" import matplotlib.pyplot as plt
summary = get_data_summary(df)
missing_pct = summary['missing_percentage']
# Filter columns with missing data cols_with_missing = {k: v for k, v in missing_pct.items() if v > 0}
if not cols_with_missing:
print("β No missing data found!")
return # Create visualization fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Bar chart cols = list(cols_with_missing.keys())
values = list(cols_with_missing.values())
axes[0].barh(cols, values, color='coral')
axes[0].set_xlabel('Missing Data (%)')
axes[0].set_title('Missing Data by Column')
axes[0].axvline(50, color='red', linestyle='--', label='50% threshold')
axes[0].legend()
# Heatmap-style visualization missing_matrix = df[cols].isnull().astype(int)
axes[1].imshow(missing_matrix.T, cmap='RdYlGn_r', aspect='auto')
axes[1].set_yticks(range(len(cols)))
axes[1].set_yticklabels(cols)
axes[1].set_xlabel('Row Index')
axes[1].set_title('Missing Data Pattern')
plt.tight_layout()
plt.savefig('missing_data_dashboard.png', dpi=300, bbox_inches='tight')
print("β Saved missing data dashboard")
# Print summary print(f"\nπ Missing Data Summary:")
print(f" Columns with missing data: {len(cols_with_missing)}")
print(f" Highest missing %: {max(values):.1f}% ({cols[values.index(max(values))]})")
print(f" Average missing %: {np.mean(values):.1f}%")
# Usage:# missing_data_dashboard(df)
# Numeric summary contains these statistics:summary = get_data_summary(df)
if 'numeric_summary' in summary:
age_stats = summary['numeric_summary']['age']
print("\nπ Understanding Numeric Statistics:")
print(f" count: Number of non-missing values ({age_stats['count']})")
print(f" mean: Average value ({age_stats['mean']:.2f})")
print(f" std: Standard deviation ({age_stats['std']:.2f})")
print(f" min: Minimum value ({age_stats['min']:.2f})")
print(f" 25%: First quartile (Q1) ({age_stats['25%']:.2f})")
print(f" 50%: Median (Q2) ({age_stats['50%']:.2f})")
print(f" 75%: Third quartile (Q3) ({age_stats['75%']:.2f})")
print(f" max: Maximum value ({age_stats['max']:.2f})")
# Tip 1: Create a data quality scoredef calculate_data_quality_score(df):
"""Calculate overall data quality score (0-100)""" summary = get_data_summary(df)
# Factors affecting quality total_cells = summary['total_rows'] * summary['total_columns']
missing_cells = sum(summary['missing_values'].values())
missing_ratio = missing_cells / total_cells if total_cells > 0 else 0 duplicate_ratio = summary['duplicates'] / summary['total_rows'] if summary['total_rows'] > 0 else 0 # Calculate score (100 = perfect) quality_score = 100 * (1 - missing_ratio) * (1 - duplicate_ratio)
print(f"\nβ Data Quality Score: {quality_score:.1f}/100")
print(f" Missing data penalty: {missing_ratio * 100:.1f}%")
print(f" Duplicate penalty: {duplicate_ratio * 100:.1f}%")
if quality_score >= 90:
grade = "A - Excellent" elif quality_score >= 80:
grade = "B - Good" elif quality_score >= 70:
grade = "C - Fair" elif quality_score >= 60:
grade = "D - Poor" else:
grade = "F - Needs Major Cleaning" print(f" Grade: {grade}")
return quality_score
# Tip 2: Export summary to filedef export_summary_report(df, filename='data_summary_report.txt'):
"""Export data summary to text file""" import sys
from io import StringIO
# Capture printed output old_stdout = sys.stdout
sys.stdout = captured_output = StringIO()
print_data_summary(df)
summary_text = captured_output.getvalue()
sys.stdout = old_stdout
# Write to file with open(filename, 'w') as f:
f.write(summary_text)
print(f"β Summary exported to {filename}")
# Tip 3: Column-specific deep divedef deep_dive_column(df, column):
"""Detailed analysis of a specific column""" print(f"\n㪠Deep Dive: '{column}'")
print("="*60)
# Basic info print(f"Data type: {df[column].dtype}")
print(f"Total values: {len(df[column])}")
print(f"Missing values: {df[column].isna().sum()} ({df[column].isna().sum()/len(df)*100:.1f}%)")
print(f"Unique values: {df[column].nunique()}")
# Type-specific analysis if pd.api.types.is_numeric_dtype(df[column]):
print(f"\nπ Numeric Statistics:")
print(df[column].describe())
# Check for outliers outliers = detect_outliers_iqr(df, column, factor=1.5)
print(f"\nβ οΈ Potential outliers: {outliers.sum()}")
else: # Categorical/Text print(f"\nπ Value Counts:")
print(df[column].value_counts().head(10))
if df[column].nunique() < 20:
print(f"\nπ All unique values:")
print(df[column].unique())
# Tip 4: Generate HTML reportdef generate_html_report(df, filename='data_report.html'):
"""Generate an HTML report with styling""" summary = get_data_summary(df)
html = f""" <html> <head> <title>Data Cleaning Report</title> <style> body {{ font-family: Arial, sans-serif; margin: 20px; }} h1 {{ color: #2c3e50; }} h2 {{ color: #34495e; }} table {{ border-collapse: collapse; width: 100%; margin: 20px 0; }} th, td {{ border: 1px solid #ddd; padding: 12px; text-align: left; }} th {{ background-color: #3498db; color: white; }} tr:nth-child(even) {{ background-color: #f2f2f2; }} .metric {{ display: inline-block; margin: 10px; padding: 15px; background: #ecf0f1; border-radius: 5px; }} </style> </head> <body> <h1>π Data Cleaning Report</h1> <h2>Overview</h2> <div class="metric"><strong>Rows:</strong> {summary['total_rows']}</div> <div class="metric"><strong>Columns:</strong> {summary['total_columns']}</div> <div class="metric"><strong>Duplicates:</strong> {summary['duplicates']}</div> <div class="metric"><strong>Memory:</strong> {summary['memory_usage_mb']:.2f} MB</div> <h2>Column Details</h2> <table> <tr> <th>Column</th> <th>Data Type</th> <th>Missing Count</th> <th>Missing %</th> </tr> """ for col in summary['columns']:
html += f""" <tr> <td>{col}</td> <td>{summary['dtypes'][col]}</td> <td>{summary['missing_values'][col]}</td> <td>{summary['missing_percentage'][col]:.1f}%</td> </tr> """ html += """ </table> </body> </html> """ with open(filename, 'w') as f:
f.write(html)
print(f"β HTML report saved to {filename}")
# Usage examples:# calculate_data_quality_score(df)# export_summary_report(df)# deep_dive_column(df, 'age')# generate_html_report(df)
# β DON'T use quick_clean for production without testing# df = quick_clean(df) # What if something goes wrong?# β DON'T use quick_clean when you need custom parameters# quick_clean() uses default median/mode/lowercase# But maybe you need mean, or want to preserve case!# β DON'T use quick_clean on data with complex requirements# Complex business logic? Use individual functions!# β DO use quick_clean for initial explorationprint("Original data:")
print_data_summary(df)
df_quick = quick_clean(df.copy())
print("\nAfter quick clean:")
print_data_summary(df_quick)
# β DO create a custom quick clean for your domaindef quick_clean_medical(df):
"""Custom quick clean for medical data""" print("π₯ Starting medical data quick clean...")
df = clean_column_headers(df)
df = remove_duplicates(df)
# Medical-specific cleaning for col in ['patient_id', 'age', 'weight', 'height']:
if col in df.columns:
df = clean_numeric_column(df, col, fill_method='median')
for col in ['diagnosis', 'medication', 'doctor']:
if col in df.columns:
df = clean_text_column(df, col, lowercase=False) # Keep case! print("π Medical quick clean completed!")
return df
# β DO validate results after quick cleandef validate_quick_clean(df_before, df_after):
"""Validate quick clean results""" issues = []
# Check for unexpected data loss rows_lost = len(df_before) - len(df_after)
if rows_lost > len(df_before) * 0.1: # Lost >10% of rows issues.append(f"β οΈ Lost {rows_lost} rows ({rows_lost/len(df_before)*100:.1f}%)")
# Check for unexpected column removal cols_lost = set(df_before.columns) - set(df_after.columns)
if cols_lost:
issues.append(f"β οΈ Lost columns: {cols_lost}")
# Check for data type changes for col in df_before.columns:
if col in df_after.columns:
if df_before[col].dtype != df_after[col].dtype:
issues.append(f"β οΈ {col}: {df_before[col].dtype} β {df_after[col].dtype}")
if issues:
print("\nβ οΈ Validation Issues:")
for issue in issues:
print(f" {issue}")
else:
print("\nβ Validation passed!")
return len(issues) == 0
"""Complete Titanic Data Cleaning Workflow========================================This example demonstrates a full professional data cleaning pipeline."""import pandas as pd
import numpy as np
from pandas_data_cleaning import *import matplotlib.pyplot as plt
# ============================================================================# STEP 1: LOAD AND INSPECT DATA# ============================================================================print("="*80)
print("STEP 1: LOAD AND INSPECT DATA")
print("="*80)
# Load the dirty datadf = pd.read_csv('titanic_dirty.csv')
# Initial inspectionprint("\nπ Initial Data Inspection:")
print_data_summary(df)
# Calculate initial data quality scoreinitial_quality = calculate_data_quality_score(df)
# Save original for comparisondf_original = df.copy()
# Initialize progress trackertracker = DataCleaningTracker(df, "Titanic Dataset")
# ============================================================================# STEP 2: CLEAN COLUMN HEADERS# ============================================================================print("\n" + "="*80)
print("STEP 2: CLEAN COLUMN HEADERS")
print("="*80)
# Show original column namesprint("\nOriginal columns:", df.columns.tolist())
# Clean headersdf = clean_column_headers(df)
# Show cleaned column namesprint("Cleaned columns:", df.columns.tolist())
tracker.add_checkpoint(df, "Cleaned column headers")
# ============================================================================# STEP 3: HANDLE DUPLICATES# ============================================================================print("\n" + "="*80)
print("STEP 3: HANDLE DUPLICATES")
print("="*80)
# Analyze duplicates before removingprint("\nπ Analyzing duplicates...")
duplicates = df[df.duplicated(keep=False)]
if len(duplicates) > 0:
print(f"Found {len(duplicates)} duplicate entries")
print(duplicates[['passenger_id', 'name', 'age']].sort_values('name'))
# Remove duplicatesdf = remove_duplicates(df)
tracker.add_checkpoint(df, "Removed duplicates")
# ============================================================================# STEP 4: CLEAN NUMERIC COLUMNS# ============================================================================print("\n" + "="*80)
print("STEP 4: CLEAN NUMERIC COLUMNS")
print("="*80)
# Clean age columnprint("\nπ’ Cleaning 'age' column...")
print(f"Before: {df['age'].dtype}, Missing: {df['age'].isna().sum()}")
df = clean_numeric_column(
df,
column='age',
fill_method='median',
remove_negatives=False,
round_decimals=0)
print(f"After: {df['age'].dtype}, Missing: {df['age'].isna().sum()}")
print(f"Age range: {df['age'].min():.0f} to {df['age'].max():.0f}")
# Clean fare columnprint("\nπ’ Cleaning 'fare' column...")
print(f"Before: {df['fare'].dtype}, Missing: {df['fare'].isna().sum()}")
df = clean_numeric_column(
df,
column='fare',
fill_method='median',
remove_negatives=True,
round_decimals=2)
print(f"After: {df['fare'].dtype}, Missing: {df['fare'].isna().sum()}")
print(f"Fare range: Β£{df['fare'].min():.2f} to Β£{df['fare'].max():.2f}")
tracker.add_checkpoint(df, "Cleaned numeric columns")
# ============================================================================# STEP 5: HANDLE OUTLIERS# ============================================================================print("\n" + "="*80)
print("STEP 5: HANDLE OUTLIERS")
print("="*80)
# Detect outliers in fareprint("\nπ― Analyzing outliers in 'fare'...")
outliers = detect_outliers_iqr(df, 'fare', factor=1.5)
print(f"Outliers detected: {outliers.sum()} ({outliers.sum()/len(df)*100:.1f}%)")
if outliers.sum() > 0:
print("\nOutlier fares:")
print(df[outliers][['name', 'fare', 'pclass']].sort_values('fare', ascending=False))
# For this example, we'll cap outliers instead of removing # (preserving passenger records is important!) Q1 = df['fare'].quantile(0.25)
Q3 = df['fare'].quantile(0.75)
IQR = Q3 - Q1
upper_bound = Q3 + 1.5 * IQR
df['fare'] = df['fare'].clip(upper=upper_bound)
print(f"\nβ Capped {outliers.sum()} outlier fares at Β£{upper_bound:.2f}")
tracker.add_checkpoint(df, "Handled outliers")
# ============================================================================# STEP 6: CLEAN TEXT COLUMNS# ============================================================================print("\n" + "="*80)
print("STEP 6: CLEAN TEXT COLUMNS")
print("="*80)
# Clean name columnprint("\nπ Cleaning 'name' column...")
print(f"Sample before: {df['name'].head(3).tolist()}")
df = clean_text_column(
df,
column='name',
lowercase=True,
remove_extra_spaces=True,
standardize_missing=True)
print(f"Sample after: {df['name'].head(3).tolist()}")
# Clean cabin columnprint("\nπ Cleaning 'cabin' column...")
print(f"Unique values before: {df['cabin'].nunique()}")
df = clean_text_column(
df,
column='cabin',
lowercase=False, # Keep cabin codes in original case remove_extra_spaces=True,
standardize_missing=True)
print(f"Unique values after: {df['cabin'].nunique()}")
# Clean embarked columnprint("\nπ Cleaning 'embarked' column...")
print(f"Values before: {df['embarked'].unique()}")
df = clean_text_column(
df,
column='embarked',
lowercase=True,
remove_extra_spaces=True,
standardize_missing=True)
print(f"Values after: {df['embarked'].unique()}")
tracker.add_checkpoint(df, "Cleaned text columns")
# ============================================================================# STEP 7: HANDLE REMAINING MISSING VALUES# ============================================================================print("\n" + "="*80)
print("STEP 7: HANDLE REMAINING MISSING VALUES")
print("="*80)
# Check remaining missing valuesmissing_summary = df.isnull().sum()
missing_summary = missing_summary[missing_summary > 0]
if len(missing_summary) > 0:
print("\nπ Remaining missing values:")
for col, count in missing_summary.items():
pct = count / len(df) * 100 print(f" {col}: {count} ({pct:.1f}%)")
# Fill cabin with 'Unknown' (many passengers didn't have cabin assignments) if 'cabin' in missing_summary.index:
df = fill_missing_values(df, 'cabin', method='custom', custom_value='Unknown')
# Fill embarked with mode if 'embarked' in missing_summary.index:
df = fill_missing_values(df, 'embarked', method='mode')
# Fill survived with mode (if missing) if 'survived' in missing_summary.index:
df = fill_missing_values(df, 'survived', method='mode')
else:
print("\nβ No missing values remaining!")
tracker.add_checkpoint(df, "Filled remaining missing values")
# ============================================================================# STEP 8: FINAL VALIDATION AND REPORT# ============================================================================print("\n" + "="*80)
print("STEP 8: FINAL VALIDATION AND REPORT")
print("="*80)
# Final data summaryprint("\nπ Final Data Summary:")
print_data_summary(df)
# Calculate final data quality scorefinal_quality = calculate_data_quality_score(df)
# Show improvementprint(f"\nπ Quality Improvement:")
print(f" Initial score: {initial_quality:.1f}/100")
print(f" Final score: {final_quality:.1f}/100")
print(f" Improvement: +{final_quality - initial_quality:.1f} points")
# Show cleaning progresstracker.print_progress()
# Compare before and afterprint("\n" + "="*80)
print("BEFORE vs AFTER COMPARISON")
print("="*80)
compare_datasets(df_original, df, "Original", "Cleaned")
# ============================================================================# STEP 9: SAVE CLEANED DATA# ============================================================================print("\n" + "="*80)
print("STEP 9: SAVE CLEANED DATA")
print("="*80)
# Save to CSVdf.to_csv('titanic_cleaned.csv', index=False)
print("\nβ Saved cleaned data to 'titanic_cleaned.csv'")
# Save cleaning reportwith open('cleaning_report.txt', 'w') as f:
f.write("TITANIC DATASET CLEANING REPORT\n")
f.write("="*80 + "\n\n")
f.write(f"Initial rows: {len(df_original)}\n")
f.write(f"Final rows: {len(df)}\n")
f.write(f"Rows removed: {len(df_original) - len(df)}\n\n")
f.write(f"Initial missing values: {df_original.isnull().sum().sum()}\n")
f.write(f"Final missing values: {df.isnull().sum().sum()}\n\n")
f.write(f"Initial duplicates: {df_original.duplicated().sum()}\n")
f.write(f"Final duplicates: {df.duplicated().sum()}\n\n")
f.write(f"Quality score: {initial_quality:.1f} β {final_quality:.1f}\n")
print("β Saved cleaning report to 'cleaning_report.txt'")
print("\nπ DATA CLEANING COMPLETE!")
print("="*80)
================================================================================
TITANIC DATASET CLEANING COMPLETE!
================================================================================
Summary of Changes:
β’ Column headers: Standardized to lowercase with underscores
β’ Duplicates: Removed 2 duplicate rows
β’ Numeric columns: Cleaned age, fare (filled missing, removed problematic values)
β’ Outliers: Capped extreme fare values
β’ Text columns: Standardized name, cabin, embarked
β’ Missing values: Filled using appropriate methods
β’ Quality score: Improved from 67.3 to 94.8
Files Created:
β titanic_cleaned.csv - Clean dataset ready for analysis
β cleaning_report.txt - Detailed cleaning documentation
β Multiple visualization files showing before/after comparisons
Next Steps:
1. Review the cleaned data: pd.read_csv('titanic_cleaned.csv')
2. Proceed with exploratory data analysis (EDA)
3. Build predictive models
4. Generate insights and visualizations
# Create a cleaning logcleaning_log = {
'age': {
'action': 'filled_missing',
'method': 'median',
'value': 28.0,
'reason': 'Median is robust to outliers in age distribution',
'affected_rows': 177 },
'cabin': {
'action': 'filled_missing',
'method': 'custom',
'value': 'Unknown',
'reason': 'Many passengers did not have cabin assignments',
'affected_rows': 687 }
}
# Save to fileimport json
with open('cleaning_decisions.json', 'w') as f:
json.dump(cleaning_log, f, indent=2)
def validate_cleaning_step(df, step_name):
"""Validate data after each cleaning step""" issues = []
# Check for NaNs in unexpected places if df.isnull().all().any():
cols = df.columns[df.isnull().all()].tolist()
issues.append(f"Columns with all NaNs: {cols}")
# Check for empty dataframe if len(df) == 0:
issues.append("DataFrame is empty!")
# Check for duplicate columns if len(df.columns) != len(set(df.columns)):
issues.append("Duplicate column names detected")
if issues:
print(f"\nβ οΈ Issues after '{step_name}':")
for issue in issues:
print(f" β’ {issue}")
return False else:
print(f"β Validation passed for '{step_name}'")
return True# Usage:df = clean_column_headers(df)
validate_cleaning_step(df, "Header cleaning")
from typing import Dict, List, Optional
def custom_cleaning_function(
df: pd.DataFrame,
numeric_cols: List[str],
text_cols: List[str],
config: Optional[Dict] = None) -> pd.DataFrame:
""" Custom cleaning function with clear documentation. Args: df: Input DataFrame to clean numeric_cols: List of numeric column names text_cols: List of text column names config: Optional configuration dictionary Returns: Cleaned DataFrame Example: >>> df_clean = custom_cleaning_function( ... df, ... numeric_cols=['age', 'fare'], ... text_cols=['name', 'cabin'] ... ) """ # Implementation here pass
def clean_financial_data(df):
"""Best practices for cleaning financial data""" # 1. Preserve precision for col in ['price', 'revenue', 'profit']:
if col in df.columns:
df = clean_numeric_column(df, col, round_decimals=4)
# 2. Handle negative values carefully (they might be valid!) # Don't use remove_negatives=True for profit/loss columns # 3. Check for currency consistency # Make sure all values are in the same currency # 4. Handle missing trading days (use forward fill) if 'date' in df.columns:
df = df.sort_values('date')
for col in df.select_dtypes(include=[np.number]).columns:
df[col] = df[col].fillna(method='ffill')
return df
def clean_medical_data(df):
"""Best practices for cleaning medical data""" # 1. Preserve case in medical codes (ICD, CPT) for col in ['icd_code', 'cpt_code', 'diagnosis']:
if col in df.columns:
df = clean_text_column(df, col, lowercase=False)
# 2. Handle sensitive data carefully # Don't log or display patient identifiable information # 3. Be conservative with outliers # Extreme values might be real medical conditions, not errors # Use factor=3.0 instead of 1.5 for outlier detection # 4. Standardize date formats # Medical data often has multiple date columns date_cols = ['admission_date', 'discharge_date', 'birth_date']
for col in date_cols:
if col in df.columns:
df[col] = pd.to_datetime(df[col], errors='coerce')
return df
def clean_ecommerce_data(df):
"""Best practices for cleaning e-commerce data""" # 1. Standardize product names if 'product_name' in df.columns:
df = clean_text_column(df, 'product_name', lowercase=True)
# 2. Clean price data if 'price' in df.columns:
df = clean_numeric_column(df, 'price', fill_method='median',
remove_negatives=True, round_decimals=2)
# 3. Handle quantity (must be positive integer) if 'quantity' in df.columns:
df = clean_numeric_column(df, 'quantity', fill_method='zero',
remove_negatives=True, round_decimals=0)
# 4. Standardize categories if 'category' in df.columns:
df['category'] = df['category'].str.lower().str.strip()
df['category'] = df['category'].str.replace('_', ' ')
return df
# 1. Use appropriate data typesdef optimize_dtypes(df):
"""Reduce memory usage by optimizing data types""" # Convert integer columns to smallest possible int type for col in df.select_dtypes(include=['int64']).columns:
if df[col].min() >= 0:
df[col] = df[col].astype('uint32') # Unsigned if no negatives else:
df[col] = df[col].astype('int32')
# Convert float columns to float32 for col in df.select_dtypes(include=['float64']).columns:
df[col] = df[col].astype('float32')
# Convert object columns with few unique values to category for col in df.select_dtypes(include=['object']).columns:
if df[col].nunique() / len(df) < 0.5: # Less than 50% unique df[col] = df[col].astype('category')
return df
# 2. Process in chunks for very large filesdef clean_large_csv(filename, chunksize=10000):
"""Clean large CSV file in chunks""" cleaned_chunks = []
for chunk in pd.read_csv(filename, chunksize=chunksize):
chunk = clean_column_headers(chunk)
chunk = remove_duplicates(chunk)
# Add other cleaning steps... cleaned_chunks.append(chunk)
df_clean = pd.concat(cleaned_chunks, ignore_index=True)
return df_clean