Decision tree models for classification and regression.

FUNCTION

sklearn.tree:

Multi-select
Status
Not started
URL
  • Decision tree based models for classification and regression.
  • Exporting
  • Plotting

šŸ” Study Note: sklearn.tree

TheĀ sklearn.treeĀ module in Scikit-learn is primarily used for decision tree algorithms and associated utilities. Below is a detailed list of the available classes, functions, and attributes underĀ sklearn.tree:

The sklearn.tree contains three major model

1. Decision Tree Models
A decision tree classifier.
A decision tree regressor.
2. Extra Tree Models
An extremely randomized tree classifier.
An extremely randomized tree regressor.

Classes

‣

1. Decision Tree Models

‣

2. Extra Tree Models

‣

Functions

‣

Attributes

‣

Modules for Integration

‣

Quick Summary

Explanation of Key Parameters

  1. export_graphvizĀ Parameters:
    • clf: The trained decision tree model.
    • out_file=None: Returns the DOT data as a string instead of saving it to a file.
    • feature_names: Names of the features used in training the model.
    • class_names: Names of the classes for classification.
    • filled=True: Adds colors to the nodes for better readability.
    • rounded=True: Makes the edges and corners of the nodes rounded.
    • special_characters=True: Supports special characters (e.g., for feature names).
  2. Graphviz Object:
    • graphviz.Source: Converts DOT data into a graph object that can be displayed or saved.

Visualization Output

  • Color Nodes: Nodes are color-coded to represent purity (e.g., Gini or entropy values).
  • Feature Names: Each split is annotated with the corresponding feature.
  • Class Names: Leaf nodes show the predicted class.

Notes

  1. Graphviz Installation:
    • You must have Graphviz installed on your system forĀ graphviz.SourceĀ to work.
    • Install via package manager:
    • 
      Copy code
      # On Ubuntu/Debian:
      sudo apt-get install graphviz
      
      # On macOS:
      brew install graphviz
      
      
  2. Python Library Installation:
    • Install the Python wrapper for Graphviz:
    • bash
      Copy code
      pip install graphviz
      
      
  3. Alternative:
    • UseĀ tree.plot_tree()Ā for a Matplotlib-based visualization:
    • from sklearn.tree import plot_tree
      plot_tree(clf, feature_names=iris.feature_names, class_names=iris.target_names, filled=True)
      plt.show()
      
      
# Step 1: Import Necessary Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn import tree
import graphviz

# Step 2: Load and Prepare Dataset
# Assume `df` is your dataframe and 'status' is the target variable.
# Modify the dataset loading as per your specific project.
# Example: df = pd.read_csv('your_dataset.csv')

# Split the dataset into train and test datasets
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=11)

# Further split the train dataset into train and validation datasets
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=11)

# Reset indices
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

# Extract target variables
y_train = (df_train.status == 'default').astype('int').values
y_val = (df_val.status == 'default').astype('int').values
y_test = (df_test.status == 'default').astype('int').values

# Drop the target variable from features
df_train = df_train.drop(columns='status')
df_val = df_val.drop(columns='status')
df_test = df_test.drop(columns='status')

# Convert the datasets to dictionaries
dict_train = df_train.to_dict(orient='records')
dict_val = df_val.to_dict(orient='records')

# Step 3: Vectorize Features
dv = DictVectorizer(sparse=False)

# Fit on training data and transform both train and validation datasets
X_train = dv.fit_transform(dict_train)
X_val = dv.transform(dict_val)

# Extract feature names
feature_names = dv.get_feature_names_out()

# Step 4: Train the Decision Tree Model
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Step 5: Evaluate Model Performance
# Compute ROC AUC Score
y_val_pred = clf.predict_proba(X_val)[:, 1]  # Predict probabilities for positive class
roc_score = roc_auc_score(y_val, y_val_pred)
print(f"Validation ROC AUC Score: {roc_score:.2f}")

# Step 6: Export the Decision Tree for Visualization
dot_data = tree.export_graphviz(
    clf,
    out_file=None,                      # No file output, return as string
    feature_names=feature_names,       # Use DictVectorizer feature names
    class_names=["not default", "default"],  # Modify according to your project
    filled=True,                       # Color nodes
    rounded=True,                      # Rounded edges
    special_characters=True            # Allow special characters
)

# Visualize the Decision Tree
graph = graphviz.Source(dot_data)
graph