Understanding clean_column_headers()
This function standardizes messy column names in a pandas DataFrame, making them clean, consistent, and code-friendly. It's perfect for real-world data where columns often have inconsistent formatting.
What It Does (Step-by-Step)
1. Converts everything to strings - Handles any non-string column names
2. Strips whitespace - Removes leading/trailing spaces
3. Converts to lowercase - "Name" → "name"4. Replaces spaces with underscores - "User Name" → "user_name"5. Removes special characters - "Age (years)" → "age_years"6. Collapses multiple underscores - "first___name" → "first_name"7. Removes trailing underscores - "price_" → "price"8. Handles duplicates - If two columns have the same name, adds suffixes like name, name_1, name_2
How to Use It
Basic Usage:
import pandas as pd
# Messy column names (common in real data)
df = pd.DataFrame({
'Name ': ['Alice', 'Bob'],
'Age (years)': [25, 30],
'E-mail Address!!': ['alice@email.com', 'bob@email.com'],
' Salary ': [50000, 60000]
})
print("Before:", df.columns.tolist())
# Before: ['Name ', 'Age (years)', 'E-mail Address!!', ' Salary ']
df_clean = clean_column_headers(df)
print("After:", df_clean.columns.tolist())
# After: ['name', 'age_years', 'email_address', 'salary']
Handling Duplicates:
df = pd.DataFrame({
'Name': ['Alice'],
'Name': ['Bob'], # Duplicate column
'Name ': ['Charlie'] # Also becomes 'name' after cleaning
})
df_clean = clean_column_headers(df)
print(df_clean.columns.tolist())
# Output: ['name', 'name_1', 'name_2']
Why This Matters
✅ Code-friendly names - Can access columns with df.column_name instead of df['Column Name!!']
✅ Consistency - All columns follow same naming convention
✅ SQL-compatible - Clean names work well when exporting to databases
✅ No surprises - Removes hidden spaces and special characters that cause bugs
Common Use Case
This is typically your first step in any data cleaning pipeline:
# Start of your cleaning workflow
df = pd.read_csv('messy_data.csv')
df = clean_column_headers(df) # ← Clean headers first
# Now continue with other cleaning...
Perfect for including in your portfolio because it solves a real, common problem that every data scientist encounters!