Decision tree models for classification and regression.

FUNCTION

sklearn.tree:

Multi-select

Status

Not started

URL

Decision tree based models for classification and regression.
Exporting
Plotting

🔍 Study Note: `sklearn.tree`

sklearn.tree

Decision tree based models for classification and regression. User guide. See the Decision Trees section for further details. Exporting: Plotting:

scikit-learn.org

The sklearn.tree module in Scikit-learn is primarily used for decision tree algorithms and associated utilities. Below is a detailed list of the available classes, functions, and attributes under sklearn.tree:

The sklearn.tree contains three major model

1. Decision Tree Models	`DecisionTreeClassifier`	A decision tree classifier.
	`DecisionTreeRegressor`	A decision tree regressor.
2. Extra Tree Models	`ExtraTreeClassifier`	An extremely randomized tree classifier.
	`ExtraTreeRegressor`	An extremely randomized tree regressor.

Classes

‣

1. Decision Tree Models

‣

2. Extra Tree Models

‣

Functions

‣

Attributes

‣

Modules for Integration

‣

Quick Summary

Explanation of Key Parameters

export_graphviz Parameters:

clf: The trained decision tree model.
out_file=None: Returns the DOT data as a string instead of saving it to a file.
feature_names: Names of the features used in training the model.
class_names: Names of the classes for classification.
filled=True: Adds colors to the nodes for better readability.
rounded=True: Makes the edges and corners of the nodes rounded.
special_characters=True: Supports special characters (e.g., for feature names).

Graphviz Object:

graphviz.Source: Converts DOT data into a graph object that can be displayed or saved.

Visualization Output

Color Nodes: Nodes are color-coded to represent purity (e.g., Gini or entropy values).
Feature Names: Each split is annotated with the corresponding feature.
Class Names: Leaf nodes show the predicted class.

Notes

Graphviz Installation:

You must have Graphviz installed on your system for graphviz.Source to work.
Install via package manager:


Copy code
# On Ubuntu/Debian:
sudo apt-get install graphviz

# On macOS:
brew install graphviz

Python Library Installation:

Install the Python wrapper for Graphviz:

bash
Copy code
pip install graphviz

Alternative:

Use tree.plot_tree() for a Matplotlib-based visualization:

from sklearn.tree import plot_tree
plot_tree(clf, feature_names=iris.feature_names, class_names=iris.target_names, filled=True)
plt.show()

# Step 1: Import Necessary Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn import tree
import graphviz

# Step 2: Load and Prepare Dataset
# Assume `df` is your dataframe and 'status' is the target variable.
# Modify the dataset loading as per your specific project.
# Example: df = pd.read_csv('your_dataset.csv')

# Split the dataset into train and test datasets
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=11)

# Further split the train dataset into train and validation datasets
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=11)

# Reset indices
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

# Extract target variables
y_train = (df_train.status == 'default').astype('int').values
y_val = (df_val.status == 'default').astype('int').values
y_test = (df_test.status == 'default').astype('int').values

# Drop the target variable from features
df_train = df_train.drop(columns='status')
df_val = df_val.drop(columns='status')
df_test = df_test.drop(columns='status')

# Convert the datasets to dictionaries
dict_train = df_train.to_dict(orient='records')
dict_val = df_val.to_dict(orient='records')

# Step 3: Vectorize Features
dv = DictVectorizer(sparse=False)

# Fit on training data and transform both train and validation datasets
X_train = dv.fit_transform(dict_train)
X_val = dv.transform(dict_val)

# Extract feature names
feature_names = dv.get_feature_names_out()

# Step 4: Train the Decision Tree Model
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Step 5: Evaluate Model Performance
# Compute ROC AUC Score
y_val_pred = clf.predict_proba(X_val)[:, 1]  # Predict probabilities for positive class
roc_score = roc_auc_score(y_val, y_val_pred)
print(f"Validation ROC AUC Score: {roc_score:.2f}")

# Step 6: Export the Decision Tree for Visualization
dot_data = tree.export_graphviz(
    clf,
    out_file=None,                      # No file output, return as string
    feature_names=feature_names,       # Use DictVectorizer feature names
    class_names=["not default", "default"],  # Modify according to your project
    filled=True,                       # Color nodes
    rounded=True,                      # Rounded edges
    special_characters=True            # Allow special characters
)

# Visualize the Decision Tree
graph = graphviz.Source(dot_data)
graph