sklearn.preprocessing
:
TABLE OF CONTENT
Section | Methods / Tools | Short Description |
1. Scaling and Standardization | StandardScaler , MinMaxScaler , MaxAbsScaler , RobustScaler , scale , minmax_scale , robust_scale | Adjust numerical features to a standard range or distribution to improve model performance. |
2. Encoding Categorical Features | LabelEncoder , OneHotEncoder , OrdinalEncoder , LabelBinarizer , MultiLabelBinarizer , label_binarize | Convert categorical or label data into numeric format for models. |
3. Normalization and Vector Scaling | Normalizer , normalize | Scale feature vectors individually to unit norm (length = 1), useful for distance-based models. |
4. Data Transformation | PowerTransformer , QuantileTransformer , FunctionTransformer , SplineTransformer | Transform data to reduce skewness or make it resemble a normal distribution. |
5. Binarization and Discretization | Binarizer , KBinsDiscretizer , add_dummy_feature , binarize | Convert data into binary (0/1) or create bucketed intervals from continuous values. |
6. Utility Functions | PolynomialFeatures , KernelCenterer , TargetEncoder | Create new features (polynomials/interactions), center kernel matrices, or encode categories based on target statistics. |
7. Integrated Workflow | Pipeline , ColumnTransformer | Combine preprocessing steps with modeling in a single, reusable pipeline. |
EXPLANATION OF TABLE OF CONTENT
1. Scaling and Standardization
Used to transform numerical features so they have a consistent scale or distribution. This helps many machine learning models converge faster and perform better, especially those that rely on distances or assume normality.
- StandardScaler: Centers data to mean 0 and scales to unit variance.
- MinMaxScaler: Scales data to a fixed range, usually [0, 1].
- MaxAbsScaler: Scales data to [-1, 1] by dividing by the max absolute value (preserves sparsity).
- RobustScaler: Uses median and IQR instead of mean and standard deviation; useful when data has outliers.
- scale, minmax_scale, robust_scale: Function versions of the scalers for quick use.
2. Encoding Categorical Features
Converts categorical values into numerical format so they can be used in machine learning models.
- LabelEncoder: Encodes class labels (target variable) as integers (e.g., A → 0, B → 1).
- OneHotEncoder: Converts categorical variables into binary vectors (one-hot format).
- OrdinalEncoder: Assigns integers to categories in order (e.g., 'Low' = 0, 'High' = 2).
- LabelBinarizer: Binarizes labels (0 or 1); useful for classification.
- MultiLabelBinarizer: Binarizes lists of labels (multilabel data).
- label_binarize: Converts multiclass labels to binary format for one-vs-rest classification.
3. Normalization and Vector Scaling
Applies vector-based normalization — each row (sample) is scaled to have a unit norm (length of 1). Often used for KNN, SVM, or cosine similarity.
- Normalizer: Normalizes each row to unit norm (
l1
,l2
, ormax
). - normalize: Functional version to apply normalization directly to arrays.
4. Data Transformation
Transforms feature distributions to be more Gaussian-like or uniform. Helps models that assume normality (like linear regression) or are sensitive to skew.
- PowerTransformer: Applies Box-Cox or Yeo-Johnson transform to make data more normal.
- QuantileTransformer: Maps data to a uniform or normal distribution using quantile information.
- FunctionTransformer: Wraps any custom Python function into a scikit-learn transformer.
- SplineTransformer: Generates B-spline features for modeling smooth non-linear relationships.
5. Binarization and Discretization
Used to split continuous data into categories or simple binary values.
- Binarizer: Converts numeric features to 0 or 1 based on a threshold.
- KBinsDiscretizer: Converts continuous features into discrete bins using uniform, quantile, or k-means strategy.
- add_dummy_feature: Adds a constant feature (often 1) to the dataset.
- binarize: Functional version of
Binarizer
.
6. Utility Functions for Feature Engineering
Used to generate new features or modify data for better modeling.
- PolynomialFeatures: Creates interaction and polynomial terms (e.g., x², x*y).
- KernelCenterer: Centers precomputed kernel matrices for kernel-based models like SVM.
- TargetEncoder (external): Encodes categorical variables with the mean (or another stat) of the target; reduces dimensionality vs. one-hot.
⚠️ TargetEncoder is not part of core sklearn but available via category_encoders.
7. Integrated Workflow Tools
These are not inside sklearn.preprocessing
, but are essential for combining preprocessing steps:
- Pipeline: Chains preprocessing and modeling steps together.
- ColumnTransformer: Applies different transformations to different columns (e.g., scale numerical, encode categorical).
Name | Status |
---|---|
Not started | |
Not started | |
Not started | |
Not started | |
Not started | |
Not started | |
Not started |