📂

Preprocessing tools like scaling, normalization, and encoding.

FUNCTION

sklearn.preprocessing:

Multi-select

Status

In progress

URL

TABLE OF CONTENT

Section	Methods / Tools	Short Description
1. Scaling and Standardization	`StandardScaler`, `MinMaxScaler`, `MaxAbsScaler`, `RobustScaler`, `scale`, `minmax_scale`, `robust_scale`	Adjust numerical features to a standard range or distribution to improve model performance.
2. Encoding Categorical Features	`LabelEncoder`, `OneHotEncoder`, `OrdinalEncoder`, `LabelBinarizer`, `MultiLabelBinarizer`, `label_binarize`	Convert categorical or label data into numeric format for models.
3. Normalization and Vector Scaling	`Normalizer`, `normalize`	Scale feature vectors individually to unit norm (length = 1), useful for distance-based models.
4. Data Transformation	`PowerTransformer`, `QuantileTransformer`, `FunctionTransformer`, `SplineTransformer`	Transform data to reduce skewness or make it resemble a normal distribution.
5. Binarization and Discretization	`Binarizer`, `KBinsDiscretizer`, `add_dummy_feature`, `binarize`	Convert data into binary (0/1) or create bucketed intervals from continuous values.
6. Utility Functions	`PolynomialFeatures`, `KernelCenterer`, `TargetEncoder`	Create new features (polynomials/interactions), center kernel matrices, or encode categories based on target statistics.
7. Integrated Workflow	`Pipeline`, `ColumnTransformer`	Combine preprocessing steps with modeling in a single, reusable pipeline.

EXPLANATION OF TABLE OF CONTENT

`1. Scaling and Standardization`

Used to transform numerical features so they have a consistent scale or distribution. This helps many machine learning models converge faster and perform better, especially those that rely on distances or assume normality.

StandardScaler: Centers data to mean 0 and scales to unit variance.
MinMaxScaler: Scales data to a fixed range, usually [0, 1].
MaxAbsScaler: Scales data to [-1, 1] by dividing by the max absolute value (preserves sparsity).
RobustScaler: Uses median and IQR instead of mean and standard deviation; useful when data has outliers.
scale, minmax_scale, robust_scale: Function versions of the scalers for quick use.

`2. Encoding Categorical Features`

Converts categorical values into numerical format so they can be used in machine learning models.

LabelEncoder: Encodes class labels (target variable) as integers (e.g., A → 0, B → 1).
OneHotEncoder: Converts categorical variables into binary vectors (one-hot format).
OrdinalEncoder: Assigns integers to categories in order (e.g., 'Low' = 0, 'High' = 2).
LabelBinarizer: Binarizes labels (0 or 1); useful for classification.
MultiLabelBinarizer: Binarizes lists of labels (multilabel data).
label_binarize: Converts multiclass labels to binary format for one-vs-rest classification.

`3. Normalization and Vector Scaling`

Applies vector-based normalization — each row (sample) is scaled to have a unit norm (length of 1). Often used for KNN, SVM, or cosine similarity.

Normalizer: Normalizes each row to unit norm (l1, l2, or max).
normalize: Functional version to apply normalization directly to arrays.

`4. Data Transformation`

Transforms feature distributions to be more Gaussian-like or uniform. Helps models that assume normality (like linear regression) or are sensitive to skew.

PowerTransformer: Applies Box-Cox or Yeo-Johnson transform to make data more normal.
QuantileTransformer: Maps data to a uniform or normal distribution using quantile information.
FunctionTransformer: Wraps any custom Python function into a scikit-learn transformer.
SplineTransformer: Generates B-spline features for modeling smooth non-linear relationships.

`5. Binarization and Discretization`

Used to split continuous data into categories or simple binary values.

Binarizer: Converts numeric features to 0 or 1 based on a threshold.
KBinsDiscretizer: Converts continuous features into discrete bins using uniform, quantile, or k-means strategy.
add_dummy_feature: Adds a constant feature (often 1) to the dataset.
binarize: Functional version of Binarizer.

`6. Utility Functions for Feature Engineering`

Used to generate new features or modify data for better modeling.

PolynomialFeatures: Creates interaction and polynomial terms (e.g., x², x*y).
KernelCenterer: Centers precomputed kernel matrices for kernel-based models like SVM.
TargetEncoder (external): Encodes categorical variables with the mean (or another stat) of the target; reduces dimensionality vs. one-hot.

⚠️ TargetEncoder is not part of core sklearn but available via category_encoders.

`7. Integrated Workflow Tools`

These are not inside sklearn.preprocessing, but are essential for combining preprocessing steps:

Pipeline: Chains preprocessing and modeling steps together.
ColumnTransformer: Applies different transformations to different columns (e.g., scale numerical, encode categorical).

Name	Status
📂 API Reference: Scaling and Standardization	Not started
📂 2. API Reference: Encoding Categorical Features	Not started
📂 3. API Reference: Normalization and Vector Scaling	Not started
📂 4. API Reference: Data Transformation	Not started
📂 5. API Reference: Binarization and Discretization	Not started
📂 6. API Reference: Utility Functions for Feature Engineering	Not started
📂 7. API Reference: Integrated Workflow Tools	Not started

Preprocessing tools like scaling, normalization, and encoding.

TABLE OF CONTENT

EXPLANATION OF TABLE OF CONTENT

1. Scaling and Standardization

2. Encoding Categorical Features

3. Normalization and Vector Scaling

4. Data Transformation

5. Binarization and Discretization

6. Utility Functions for Feature Engineering

7. Integrated Workflow Tools