Random Forest Regressor
Random Forest Regressor

Random Forest Regressor

Classification
Tree-Based Models
Import Statement

from sklearn.ensemble import RandomForestRegressor

Model Name

Status 1
Done

The RandomForestRegressor, from the Bagging family, is a Scikit-learn machine learning algorithm that builds a collection (forest) of decision trees for regression tasks. It combines the predictions of these trees to produce a final averaged prediction. This approach improves predictive accuracy and reduces overfitting.

How it works:

  • Creates multiple decision trees: Each tree is trained on a random subset of the dataset (with replacement).
  • Averages results: Predictions from all trees are averaged to make the final prediction.

Syntax Example

Basic Usage

python
CopyEdit
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression

# Generate sample data
X, y = make_regression(n_samples=1000, n_features=4, random_state=0, shuffle=False)

# Train RandomForestRegressor
reg = RandomForestRegressor(max_depth=2, random_state=0)

# Fitting the model
reg.fit(X, y)

# Make a prediction
print(reg.predict([[0, 0, 0, 0]]))

Common Parameters

python
CopyEdit
reg = RandomForestRegressor(
    n_estimators=200,         # 200 trees
    max_depth=10,             # Limit tree depth to 10
    min_samples_split=5,      # Node splits when ≥5 samples
    max_features='sqrt',      # Use √features at each split
    random_state=42           # Reproducibility
)

Key Parameters and Their Functions

Parameter
Default
Description
Options/Details
Best Practice Selection
n_estimators
100
Number of trees.
Any positive integer.
Start with 100, increase if needed (300-500) for better results.
criterion
"squared_error"
Metric to measure quality of a split.
"squared_error" (MSE), "absolute_error" (MAE), "poisson""friedman_mse".
"squared_error" for most cases. "absolute_error"for robustness to outliers.
max_depth
None
Maximum depth of the tree.
Positive integer or None.
Limit depth (5-20) to prevent overfitting.
min_samples_split
2
Minimum samples required to split an internal node.
Integer or float (fraction).
Increase (e.g., 5-10) for noisy datasets.
min_samples_leaf
1
Minimum samples at a leaf node.
Integer or float.
Use higher values (2-5) to smooth predictions.
max_features
1.0
Number of features considered at each split.
"sqrt""log2"None, int, float.
"sqrt" or "log2"recommended for high-dimensional data.
bootstrap
True
Whether sampling is done with replacement.
True or False.
True in most cases.
oob_score
False
Whether to use out-of-bag samples to estimate generalization score.
True or False.
Set True when no separate validation set is available.
random_state
None
Controls randomness for reproducibility.
Integer or None.
Use fixed integer (e.g., 42).
n_jobs
None
Number of jobs (parallelism).
None-1, integer.
-1 to use all cores.
warm_start
False
Reuse solution of previous call to add more estimators.
True or False.
True for incremental learning.

Features of RandomForestRegressor

Feature
Description
Key Parameter
Parallel Computation
Speed up training using multiple CPU cores.
n_jobs
Feature Importances
Evaluate importance of each feature.
feature_importances_
Out-of-Bag Estimates
Validate model without separate validation data.
oob_score
Tree Growth Control
Prevent overfitting by limiting tree growth.
max_depthmin_samples_leafmax_leaf_nodes
Warm Start
Expand model incrementally without retraining.
warm_start

Expanded Features of RandomForestRegressor with Examples

1. Parallel Computation (n_jobs)

Use multiple CPU cores to speed up model training, especially useful for large datasets.

  • Usage:
    • n_jobs=-1: Use all available CPU cores.
    • n_jobs=1: Single core.
    • n_jobs=2: Two cores.
  • Example:
python
CopyEdit
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=10000, n_features=20, random_state=42)

reg = RandomForestRegressor(n_estimators=100, n_jobs=-1, random_state=42)
reg.fit(X, y)

print("Model trained using all available cores.")

2. Feature Importances

Random forests rank features by their contribution to the splits and prediction.

  • Example:
python
CopyEdit
# Access feature importances
importances = reg.feature_importances_

# Print feature importances
for idx, importance in enumerate(importances):
    print(f"Feature {idx}: Importance = {importance:.4f}")

  • Output Example:
python
CopyEdit
Feature 0: Importance = 0.3945
Feature 1: Importance = 0.2661
Feature 2: Importance = 0.0701
Feature 3: Importance = 0.2693

Use this for feature selection or dimensionality reduction.

3. Out-of-Bag (OOB) Estimates

Useful for performance estimation without a validation set.

  • Usage:
python
CopyEdit
reg = RandomForestRegressor(n_estimators=100, oob_score=True, bootstrap=True, random_state=42)
reg.fit(X, y)

# Print OOB score
print(f"OOB Score: {reg.oob_score_:.4f}")

  • Example Output:
python
CopyEdit
OOB Score: 0.8973

4. Tree Growth Control

Control complexity to prevent overfitting.

  • Key Parameters:
    • max_depth
    • min_samples_leaf
    • max_leaf_nodes
  • Example:
python
CopyEdit
reg = RandomForestRegressor(
    n_estimators=50,
    max_depth=5,
    min_samples_leaf=10,
    max_leaf_nodes=20,
    random_state=42
)
reg.fit(X, y)

5. Warm Start

Add more trees incrementally without retraining from scratch.

  • Example:
python
CopyEdit
# Initial training with 50 trees
reg = RandomForestRegressor(n_estimators=50, warm_start=True, random_state=42)
reg.fit(X, y)

# Add 50 more trees
reg.n_estimators = 100
reg.fit(X, y)

print(f"Total trees after warm start: {len(reg.estimators_)}")

Practical Notes

  • Overfitting Prevention: Use max_depthmin_samples_leafmax_features.
  • High Dimensional Data: Handles many features well but training time increases.
  • Outliers: If data has many outliers, consider criterion="absolute_error" for better robustness.
  • Missing Values: Random forests do not handle missing values automatically. Preprocessing is required.

When to Use

RandomForestRegressor is ideal for:

  • Medium to large datasets.
  • Datasets with complex feature interactions.
  • Cases where model interpretability is less important than prediction accuracy.
  • Data with non-linear relationships.