clean_column_headers()

Understanding `clean_column_headers()`

This function standardizes messy column names in a pandas DataFrame, making them clean, consistent, and code-friendly. It's perfect for real-world data where columns often have inconsistent formatting.

What It Does (Step-by-Step)

1. Converts everything to strings - Handles any non-string column names 2. Strips whitespace - Removes leading/trailing spaces 3. Converts to lowercase - "Name" → "name"4. Replaces spaces with underscores - "User Name" → "user_name"5. Removes special characters - "Age (years)" → "age_years"6. Collapses multiple underscores - "first___name" → "first_name"7. Removes trailing underscores - "price_" → "price"8. Handles duplicates - If two columns have the same name, adds suffixes like name, name_1, name_2

How to Use It

Basic Usage:

import pandas as pd

# Messy column names (common in real data)
df = pd.DataFrame({
    'Name ': ['Alice', 'Bob'],
    'Age (years)': [25, 30],
    'E-mail Address!!': ['alice@email.com', 'bob@email.com'],
    '  Salary  ': [50000, 60000]
})

print("Before:", df.columns.tolist())
# Before: ['Name ', 'Age (years)', 'E-mail Address!!', '  Salary  ']

df_clean = clean_column_headers(df)

print("After:", df_clean.columns.tolist())
# After: ['name', 'age_years', 'email_address', 'salary']

Handling Duplicates:

df = pd.DataFrame({
    'Name': ['Alice'],
    'Name': ['Bob'],      # Duplicate column
    'Name ': ['Charlie']  # Also becomes 'name' after cleaning
})

df_clean = clean_column_headers(df)
print(df_clean.columns.tolist())
# Output: ['name', 'name_1', 'name_2']

Why This Matters

✅ Code-friendly names - Can access columns with df.column_name instead of df['Column Name!!']

✅ Consistency - All columns follow same naming convention

✅ SQL-compatible - Clean names work well when exporting to databases

✅ No surprises - Removes hidden spaces and special characters that cause bugs

Common Use Case

This is typically your first step in any data cleaning pipeline:

# Start of your cleaning workflow
df = pd.read_csv('messy_data.csv')
df = clean_column_headers(df)  # ← Clean headers first
# Now continue with other cleaning...