Best Uses Case
Fraud Detection, Customer Segmentation
Features Types
Categorical
Target Type
Categorical
Chi-Square Test (χ²) for Feature Selection - Complete Guide
📋 Table of Contents
- What is Chi-Square Test?
- When to Use It
- Mathematical Intuition
- Sklearn Implementation
- Step-by-Step Workflow
- Practical Examples
- Common Pitfalls
What is Chi-Square Test?
The Chi-Square test measures the independence between categorical variables. In feature selection, it tests whether a categorical feature and the target variable are independent or related.
- High χ² score → Strong dependency → Feature is important
- Low χ² score → Weak/no dependency → Feature is less important
When to Use It
✅ Use Chi-Square When:
- Both feature AND target are CATEGORICAL (or binned numerical data)
- You have discrete/nominal variables (e.g., color, gender, category)
- You want to rank features by their relationship with the target
- Working with classification problems
❌ Don't Use Chi-Square When:
- Features are continuous numerical (use ANOVA F-test instead)
- Target is continuous (use correlation or mutual information)
- You have negative values (Chi-Square requires non-negative integers)
- Sample size is very small (< 5 observations per cell)
Mathematical Intuition
Formula:
Observed: Actual frequency count in your data
Expected: Frequency if features were independent
P-value:
- p-value < 0.05: Feature is significantly related to target (reject independence hypothesis)
- p-value ≥ 0.05: No significant relationship (feature may be irrelevant)
Sklearn Implementation
Main Functions
1. chi2() - Calculate Chi-Square Statistics
from sklearn.feature_selection import chi2
chi_scores, p_values = chi2(X, y)
Parameters:
X: Feature matrix (2D array, non-negative values)y: Target vector (1D array, categorical)
Returns:
chi_scores: Chi-square statistic for each featurep_values: P-values for each feature
2. SelectKBest() - Select Top K Features
from sklearn.feature_selection import SelectKBest, chi2
selector = SelectKBest(score_func=chi2, k=5)
X_selected = selector.fit_transform(X, y)
Parameters:
score_func: Scoring function (usechi2)k: Number of top features to select (integer or 'all')
Methods:
.fit_transform(X, y): Fit and transform data.get_support(): Get boolean mask of selected features.scores_: Chi-square scores.pvalues_: P-values
3. SelectPercentile() - Select Top X% Features
from sklearn.feature_selection import SelectPercentile, chi2
selector = SelectPercentile(score_func=chi2, percentile=50)
X_selected = selector.fit_transform(X, y)
Parameters:
percentile: Percentage of features to keep (0-100)
Step-by-Step Workflow
Step 1: Import Libraries
import pandas as pd
import numpy as np
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
import seaborn as sns
Step 2: Load and Prepare Data
# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
# For Chi-Square: Convert continuous to categorical (binning)
from sklearn.preprocessing import KBinsDiscretizer
discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')
X_categorical = discretizer.fit_transform(X)
Step 3: Calculate Chi-Square Scores
# Calculate chi-square statistics
chi_scores, p_values = chi2(X_categorical, y)
# Create results dataframe
results = pd.DataFrame({
'Feature': X.columns,
'Chi2_Score': chi_scores,
'P_Value': p_values
}).sort_values('Chi2_Score', ascending=False)
print(results)
Step 4: Visualize Results
# Plot top 10 features
top_features = results.head(10)
plt.figure(figsize=(10, 6))
plt.barh(top_features['Feature'], top_features['Chi2_Score'])
plt.xlabel('Chi-Square Score')
plt.title('Top 10 Features by Chi-Square Test')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
Step 5: Select Features
# Method 1: Select top K features
selector = SelectKBest(chi2, k=10)
X_selected = selector.fit_transform(X_categorical, y)
# Get selected feature names
selected_features = X.columns[selector.get_support()].tolist()
print(f"Selected features: {selected_features}")
Step 6: Evaluate with ML Model
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
# Compare models
model = RandomForestClassifier(random_state=42)
# Before feature selection
score_before = cross_val_score(model, X_categorical, y, cv=5).mean()
# After feature selection
score_after = cross_val_score(model, X_selected, y, cv=5).mean()
print(f"Accuracy before selection: {score_before:.4f}")
print(f"Accuracy after selection: {score_after:.4f}")
Practical Examples
Example 1: Binary Classification (Breast Cancer)
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.preprocessing import KBinsDiscretizer
import pandas as pd
# Load data
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
# Discretize features (required for chi-square)
discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile')
X_discrete = discretizer.fit_transform(X)
# Apply chi-square test
chi_scores, p_values = chi2(X_discrete, y)
# Create results
results = pd.DataFrame({
'Feature': X.columns,
'Chi2_Score': chi_scores,
'P_Value': p_values,
'Significant': p_values < 0.05
}).sort_values('Chi2_Score', ascending=False)
print(results.head(10))
# Select top 5 features
selector = SelectKBest(chi2, k=5)
X_new = selector.fit_transform(X_discrete, y)
print(f"\nOriginal shape: {X_discrete.shape}")
print(f"Selected shape: {X_new.shape}")
Example 2: Multi-class Classification (Iris)
from sklearn.datasets import load_iris
# Load data
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target
# Discretize
discretizer = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='uniform')
X_discrete = discretizer.fit_transform(X)
# Chi-square test
chi_scores, p_values = chi2(X_discrete, y)
results = pd.DataFrame({
'Feature': X.columns,
'Chi2_Score': chi_scores,
'P_Value': p_values
}).sort_values('Chi2_Score', ascending=False)
print(results)
Example 3: Real Categorical Data
# Create sample categorical dataset
data = {
'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue', 'Green', 'Red', 'Blue'],
'Size': ['S', 'M', 'L', 'M', 'L', 'S', 'L', 'M'],
'Material': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
'Purchased': [1, 0, 1, 1, 0, 1, 1, 0]
}
df = pd.DataFrame(data)
# Encode categorical variables
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X = df[['Color', 'Size', 'Material']].apply(le.fit_transform)
y = df['Purchased']
# Chi-square test
chi_scores, p_values = chi2(X, y)
print(pd.DataFrame({
'Feature': X.columns,
'Chi2_Score': chi_scores,
'P_Value': p_values
}))
Common Pitfalls
1. Using Negative Values
# ❌ WRONG - Chi-square requires non-negative values
X_with_negatives = np.array([[-1, 2], [3, -4]])
# This will raise an error!
# ✅ CORRECT - Ensure all values are non-negative
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_positive = scaler.fit_transform(X_with_negatives)
2. Not Discretizing Continuous Features
# ❌ WRONG - Continuous features directly
chi2(X_continuous, y) # Results may be misleading
# ✅ CORRECT - Discretize first
from sklearn.preprocessing import KBinsDiscretizer
discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal')
X_discrete = discretizer.fit_transform(X_continuous)
chi2(X_discrete, y)
3. Ignoring P-values
# Don't just look at chi-square scores
# Also check p-values for statistical significance
results = pd.DataFrame({
'Feature': feature_names,
'Chi2': chi_scores,
'P_Value': p_values,
'Significant': p_values < 0.05 # ← Important!
})
4. Small Sample Sizes
# Chi-square assumes each cell has at least 5 observations
# Check your contingency table
from scipy.stats import contingency
for col in X.columns:
table = pd.crosstab(X[col], y)
print(f"\n{col}:")
print(table)
# Ensure no cell has count < 5
Quick Reference Card
Task | Code |
Basic chi-square | chi2(X, y) |
Select top K | SelectKBest(chi2, k=10).fit_transform(X, y) |
Select top 20% | SelectPercentile(chi2, percentile=20).fit_transform(X, y) |
Get selected indices | selector.get_support(indices=True) |
Get scores | selector.scores_ |
Get p-values | selector.pvalues_ |
🎯 Key Takeaways
- Chi-square is for categorical data only
- All values must be non-negative
- Always check p-values (< 0.05 = significant)
- Discretize continuous features before applying
- Interpret results carefully - high score ≠ causation
- Validate with ML models to confirm feature importance
Next Steps: Try ANOVA F-test for continuous features, or Mutual Information for mixed data types!