When to Use fit vs transform

When to Use fit vs transform

Backup Note and Guide for DictVectorizer

When to Use fit vs transform

  • fit:
    • Learns the structure of the data (e.g., unique categories of features and their encodings).
    • Used only on the training data.
    • Creates the rules for encoding, such as feature names and mappings.
  • transform:
    • Applies the encoding rules (learned during fit) to new data.
    • Used on validation or test data to ensure consistency with the training data.
    • Does not re-learn anything—just uses the existing rules.

Why You Don't Use fit on Validation/Test Data

  • Re-fitting on validation/test data will overwrite the encoding rules.
  • It would cause inconsistent feature mappings between training and validation/test datasets, which is bad for machine learning.
  • Always use the same encoding rules learned during training for validation or testing.

Syntax and Example

Here’s how you handle DictVectorizer for training and validation data step by step.

1. Import the Library

from sklearn.feature_extraction import DictVectorizer

2. Create Training and Validation Data

Use dictionaries where keys are feature names, and values are feature values.

# Training Data
dict_train = [
    {'color': 'red', 'size': 'M', 'price': 10.5},
    {'color': 'blue', 'size': 'L', 'price': 20.0}
]

# Validation Data
dict_val = [
    {'color': 'green', 'size': 'M', 'price': 15.0},
    {'color': 'red', 'size': 'S', 'price': 10.0}
]

3. Initialize DictVectorizer

dv = DictVectorizer(sparse=False)  # Use sparse=False for dense output

4. Fit and Transform the Training Data

  • Use fit_transform to:
    • Learn the mapping (feature names and categories).
    • Transform the training data into a numeric feature matrix.
X_train = dv.fit_transform(dict_train)

5. Transform the Validation Data

  • Use transform to apply the same encoding rules (learned from training) to the validation data.
X_val = dv.transform(dict_val

6. Inspect Results

You can check the transformed feature matrices and feature names for better understanding.

# Get Feature Names
print("Feature Names:", dv.get_feature_names_out())

# Print Transformed Data
print("X_train:", X_train)
print("X_val:", X_val)

Output

Here’s what you get after running the above code:

  • Feature Names:
plaintext
Copy code
['color=blue', 'color=red', 'color=green', 'size=L', 'size=M', 'size=S', 'price']

  • Transformed Training Data (X_train):
plaintext
Copy code
[[ 0.  1.  0.  0.  1.  0. 10.5]
 [ 1.  0.  0.  1.  0.  0. 20. ]]

  • Transformed Validation Data (X_val):
plaintext
Copy code
[[ 0.  0.  1.  0.  1.  0. 15. ]
 [ 0.  1.  0.  0.  0.  1. 10. ]]

Step-by-Step Reminder

  1. Use fit_transform on training data to:
    • Learn encoding rules.
    • Transform the training data into a numeric matrix.
  2. Use transform on validation/test data to:
    • Apply the same encoding rules (no re-fitting).
    • Keep feature mappings consistent.

Pro Tip

If you’re using DictVectorizer in machine learning pipelines:

  • Always fit on the training set.
  • Apply the same transformation to validation/test sets to prevent data leakage or inconsistencies.

You’re good to go! 😊