Preprocessing tools like scaling, normalization, and encoding.
📂

Preprocessing tools like scaling, normalization, and encoding.

FUNCTION

sklearn.preprocessing:

Multi-select
Status
In progress
URL

TABLE OF CONTENT

Section
Methods / Tools
Short Description
1. Scaling and Standardization
StandardScalerMinMaxScalerMaxAbsScalerRobustScalerscaleminmax_scalerobust_scale
Adjust numerical features to a standard range or distribution to improve model performance.
2. Encoding Categorical Features
LabelEncoderOneHotEncoderOrdinalEncoderLabelBinarizerMultiLabelBinarizerlabel_binarize
Convert categorical or label data into numeric format for models.
3. Normalization and Vector Scaling
Normalizernormalize
Scale feature vectors individually to unit norm (length = 1), useful for distance-based models.
4. Data Transformation
PowerTransformerQuantileTransformerFunctionTransformerSplineTransformer
Transform data to reduce skewness or make it resemble a normal distribution.
5. Binarization and Discretization
BinarizerKBinsDiscretizeradd_dummy_featurebinarize
Convert data into binary (0/1) or create bucketed intervals from continuous values.
6. Utility Functions
PolynomialFeaturesKernelCentererTargetEncoder
Create new features (polynomials/interactions), center kernel matrices, or encode categories based on target statistics.
7. Integrated Workflow
PipelineColumnTransformer
Combine preprocessing steps with modeling in a single, reusable pipeline.

EXPLANATION OF TABLE OF CONTENT

1. Scaling and Standardization

Used to transform numerical features so they have a consistent scale or distribution. This helps many machine learning models converge faster and perform better, especially those that rely on distances or assume normality.

  • StandardScaler: Centers data to mean 0 and scales to unit variance.
  • MinMaxScaler: Scales data to a fixed range, usually [0, 1].
  • MaxAbsScaler: Scales data to [-1, 1] by dividing by the max absolute value (preserves sparsity).
  • RobustScaler: Uses median and IQR instead of mean and standard deviation; useful when data has outliers.
  • scaleminmax_scalerobust_scale: Function versions of the scalers for quick use.

2. Encoding Categorical Features

Converts categorical values into numerical format so they can be used in machine learning models.

  • LabelEncoder: Encodes class labels (target variable) as integers (e.g., A → 0, B → 1).
  • OneHotEncoder: Converts categorical variables into binary vectors (one-hot format).
  • OrdinalEncoder: Assigns integers to categories in order (e.g., 'Low' = 0, 'High' = 2).
  • LabelBinarizer: Binarizes labels (0 or 1); useful for classification.
  • MultiLabelBinarizer: Binarizes lists of labels (multilabel data).
  • label_binarize: Converts multiclass labels to binary format for one-vs-rest classification.

3. Normalization and Vector Scaling

Applies vector-based normalization — each row (sample) is scaled to have a unit norm (length of 1). Often used for KNN, SVM, or cosine similarity.

  • Normalizer: Normalizes each row to unit norm (l1l2, or max).
  • normalize: Functional version to apply normalization directly to arrays.

4. Data Transformation

Transforms feature distributions to be more Gaussian-like or uniform. Helps models that assume normality (like linear regression) or are sensitive to skew.

  • PowerTransformer: Applies Box-Cox or Yeo-Johnson transform to make data more normal.
  • QuantileTransformer: Maps data to a uniform or normal distribution using quantile information.
  • FunctionTransformer: Wraps any custom Python function into a scikit-learn transformer.
  • SplineTransformer: Generates B-spline features for modeling smooth non-linear relationships.

5. Binarization and Discretization

Used to split continuous data into categories or simple binary values.

  • Binarizer: Converts numeric features to 0 or 1 based on a threshold.
  • KBinsDiscretizer: Converts continuous features into discrete bins using uniform, quantile, or k-means strategy.
  • add_dummy_feature: Adds a constant feature (often 1) to the dataset.
  • binarize: Functional version of Binarizer.

6. Utility Functions for Feature Engineering

Used to generate new features or modify data for better modeling.

  • PolynomialFeatures: Creates interaction and polynomial terms (e.g., x², x*y).
  • KernelCenterer: Centers precomputed kernel matrices for kernel-based models like SVM.
  • TargetEncoder (external): Encodes categorical variables with the mean (or another stat) of the target; reduces dimensionality vs. one-hot.
⚠️ TargetEncoder is not part of core sklearn but available via category_encoders.

7. Integrated Workflow Tools

These are not inside sklearn.preprocessing, but are essential for combining preprocessing steps:

  • Pipeline: Chains preprocessing and modeling steps together.
  • ColumnTransformer: Applies different transformations to different columns (e.g., scale numerical, encode categorical).