CatBoost Regressor
CatBoost Regressor

CatBoost Regressor

Classification
Boosting Algorithms
Import Statement

import catboost as cb

Model Name

CatBoost

Status 1
Done

CatBoost

CatBoost is a state-of-the-art, open-source gradient boosting library, specifically designed to handle categorical features effectively and excel in classification and regression tasks. It provides robust performance, ease of use, and efficient computation for structured data problems. Below is an all-encompassing guide tailored for academic purposes and your practice as a data scientist.

Core Models in CatBoost

CatBoost can be categorized into two main models:

Model
Description
CatBoostClassifier
For classification tasks (predicting discrete classes).
CatBoostRegressor
For regression tasks (predicting continuous values).

Key Features

  1. Native Handling of Categorical Features:
    • Automatically processes categorical variables without requiring one-hot encoding.
    • Simplifies preprocessing while improving performance.
  2. Handles Missing Values:
    • Directly supports missing data for both numerical and categorical features.
  3. Fast Training:
    • Optimized for speed on both CPU and GPU.
  4. Overfitting Protection:
    • Includes early_stopping_roundsl2_leaf_reg, and other advanced regularization techniques.
  5. Robust Cross-Validation:
    • Provides direct support for cross-validation to evaluate models reliably.
  6. Explainability:
    • Feature importance and SHAP values enable interpretability.

Key Parameters

Parameter
Description
iterations
Number of boosting iterations (trees).
learning_rate
Step size for weight updates.
depth
Maximum depth of each tree.
l2_leaf_reg
L2 regularization coefficient to avoid overfitting.
loss_function
Loss function to minimize (e.g., Logloss for classification, RMSE for regression).
eval_metric
Metric to evaluate model performance.
cat_features
List of indices or names of categorical features.
task_type
Specify training on CPU or GPU.
early_stopping_rounds
Stops training if no improvement is observed for specified rounds.
random_seed
Random seed for reproducibility.
verbose
Controls logging during training (e.g., TrueFalse, or an integer).

Templates for Practice

1. Data Preparation

Standardize the dataset preparation using the following syntax:

from sklearn.model_selection import train_test_split

# Split dataset into training, testing, and validation sets
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train, df_val = train_test_split(df_train, test_size=0.2, random_state=42)

# Separate features and target
X_train, y_train = df_train.drop('target', axis=1), df_train['target']
X_val, y_val = df_val.drop('target', axis=1), df_val['target']
X_test, y_test = df_test.drop('target', axis=1), df_test['target']

2. CatBoostClassifier Syntax

from catboost import CatBoostClassifier

# Initialize the CatBoostClassifier
model = CatBoostClassifier(
    iterations=500,
    learning_rate=0.1,
    depth=6,
    cat_features=['categorical_feature_1', 'categorical_feature_2'],
    loss_function='Logloss',  # Binary classification
    eval_metric='AUC',        # Evaluation metric
    random_seed=42,
    verbose=100
)

# Train the model
model.fit(
    X_train, y_train,
    eval_set=(X_val, y_val),
    early_stopping_rounds=50
)

# Evaluate the model
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy}")

# Predictions
y_pred = model.predict(X_test)

3. CatBoostRegressor Syntax


from catboost import CatBoostRegressor

# Initialize the CatBoostRegressor
model = CatBoostRegressor(
    iterations=500,
    learning_rate=0.1,
    depth=6,
    cat_features=['categorical_feature_1', 'categorical_feature_2'],
    loss_function='RMSE',     # Regression
    eval_metric='RMSE',       # Evaluation metric
    random_seed=42,
    verbose=100
)

# Train the model
model.fit(
    X_train, y_train,
    eval_set=(X_val, y_val),
    early_stopping_rounds=50
)

# Evaluate the model
rmse = model.get_best_score()['validation']['RMSE']
print(f"Validation RMSE: {rmse}")

# Predictions
y_pred = model.predict(X_test)

4. Feature Importance

python
Copy code
import matplotlib.pyplot as plt

# Get feature importance
feature_importances = model.get_feature_importance()
feature_names = X_train.columns

# Plot feature importance
plt.barh(feature_names, feature_importances)
plt.xlabel('Feature Importance')
plt.ylabel('Features')
plt.title('Feature Importance')
plt.show()

5. Hyperparameter Tuning

Automate hyper-parameter optimization with GridSearchCV:


from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'depth': [4, 6, 8],
    'learning_rate': [0.01, 0.1],
    'iterations': [200, 500],
}

# Initialize CatBoostClassifier
model = CatBoostClassifier(cat_features=['categorical_feature_1', 'categorical_feature_2'])

# GridSearchCV
grid = GridSearchCV(model, param_grid, scoring='accuracy', cv=3)
grid.fit(X_train, y_train)

print(f"Best Parameters: {grid.best_params_}")

6. Cross-Validation

CatBoost supports direct cross-validation for model evaluation:

python
Copy code
from catboost import Pool, cv

# Create a data pool
train_pool = Pool(X_train, y_train, cat_features=['categorical_feature_1', 'categorical_feature_2'])

# Perform cross-validation
cv_results = cv(
    train_pool,
    params={
        'iterations': 500,
        'learning_rate': 0.1,
        'depth': 6,
        'loss_function': 'Logloss'
    },
    fold_count=5,
    early_stopping_rounds=50,
    plot=True
)

print(cv_results)

Visualization Tools

  1. plot_tree():
    • Visualizes the decision tree structure using Matplotlib.
    • python
      Copy code
      from catboost import CatBoostClassifier, plot_tree
      
      plot_tree(model, pool=train_pool)
      
      
  2. SHAP Explainability:
    • Use SHAP values for interpreting predictions.
    • python
      Copy code
      import shap
      
      explainer = shap.Explainer(model)
      shap_values = explainer(X_test)
      shap.summary_plot(shap_values, X_test)
      
      

Tips for Academic and Professional Use

  1. Document Experiments:
    • Record parameters, metrics, and insights for reproducibility.
  2. Set Random Seeds:
    • Use random_seed to ensure consistent results.
  3. Feature Engineering:
    • Leverage cat_features for handling categorical data efficiently.
  4. Model Evaluation:
    • Use cross-validation for robust evaluation.
  5. Explainability:
    • Include feature importance and SHAP plots in your reports.

Summary

Component
Purpose
CatBoostClassifier
For classification tasks.
CatBoostRegressor
For regression tasks.
cat_features
Handles categorical variables natively.
iterations
Number of boosting iterations.
early_stopping_rounds
Stops training if no improvement is observed.
feature_importance
Identifies important features.
cv
Performs cross-validation.