FUNCTION
sklearn.tree
:
Multi-select
Status
Not started
URL
- Decision tree based models for classification and regression.
- Exporting
- Plotting
š Study Note: sklearn.tree
TheĀ sklearn.tree
Ā module in Scikit-learn is primarily used for decision tree algorithms and associated utilities. Below is a detailed list of the available classes, functions, and attributes underĀ sklearn.tree
:
The sklearn.tree
contains three major model
1. Decision Tree Models | A decision tree classifier. | |
A decision tree regressor. | ||
2. Extra Tree Models | An extremely randomized tree classifier. | |
An extremely randomized tree regressor. |
Classes
ā£
1. Decision Tree Models
ā£
2. Extra Tree Models
ā£
Functions
ā£
Attributes
ā£
Modules for Integration
ā£
Quick Summary
Explanation of Key Parameters
export_graphviz
Ā Parameters:clf
: The trained decision tree model.out_file=None
: Returns the DOT data as a string instead of saving it to a file.feature_names
: Names of the features used in training the model.class_names
: Names of the classes for classification.filled=True
: Adds colors to the nodes for better readability.rounded=True
: Makes the edges and corners of the nodes rounded.special_characters=True
: Supports special characters (e.g., for feature names).- Graphviz Object:
graphviz.Source
: Converts DOT data into a graph object that can be displayed or saved.
Visualization Output
- Color Nodes: Nodes are color-coded to represent purity (e.g., Gini or entropy values).
- Feature Names: Each split is annotated with the corresponding feature.
- Class Names: Leaf nodes show the predicted class.
Notes
- Graphviz Installation:
- You must have Graphviz installed on your system forĀ
graphviz.Source
Ā to work. - Install via package manager:
- Python Library Installation:
- Install the Python wrapper for Graphviz:
- Alternative:
- UseĀ
tree.plot_tree()
Ā for a Matplotlib-based visualization:
Copy code
# On Ubuntu/Debian:
sudo apt-get install graphviz
# On macOS:
brew install graphviz
bash
Copy code
pip install graphviz
from sklearn.tree import plot_tree
plot_tree(clf, feature_names=iris.feature_names, class_names=iris.target_names, filled=True)
plt.show()
# Step 1: Import Necessary Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn import tree
import graphviz
# Step 2: Load and Prepare Dataset
# Assume `df` is your dataframe and 'status' is the target variable.
# Modify the dataset loading as per your specific project.
# Example: df = pd.read_csv('your_dataset.csv')
# Split the dataset into train and test datasets
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=11)
# Further split the train dataset into train and validation datasets
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=11)
# Reset indices
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
# Extract target variables
y_train = (df_train.status == 'default').astype('int').values
y_val = (df_val.status == 'default').astype('int').values
y_test = (df_test.status == 'default').astype('int').values
# Drop the target variable from features
df_train = df_train.drop(columns='status')
df_val = df_val.drop(columns='status')
df_test = df_test.drop(columns='status')
# Convert the datasets to dictionaries
dict_train = df_train.to_dict(orient='records')
dict_val = df_val.to_dict(orient='records')
# Step 3: Vectorize Features
dv = DictVectorizer(sparse=False)
# Fit on training data and transform both train and validation datasets
X_train = dv.fit_transform(dict_train)
X_val = dv.transform(dict_val)
# Extract feature names
feature_names = dv.get_feature_names_out()
# Step 4: Train the Decision Tree Model
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
# Step 5: Evaluate Model Performance
# Compute ROC AUC Score
y_val_pred = clf.predict_proba(X_val)[:, 1] # Predict probabilities for positive class
roc_score = roc_auc_score(y_val, y_val_pred)
print(f"Validation ROC AUC Score: {roc_score:.2f}")
# Step 6: Export the Decision Tree for Visualization
dot_data = tree.export_graphviz(
clf,
out_file=None, # No file output, return as string
feature_names=feature_names, # Use DictVectorizer feature names
class_names=["not default", "default"], # Modify according to your project
filled=True, # Color nodes
rounded=True, # Rounded edges
special_characters=True # Allow special characters
)
# Visualize the Decision Tree
graph = graphviz.Source(dot_data)
graph