Decision Trees: An Overview and Practical Guide

Decision trees are a powerful tool for supervised learning, and they can be used to solve a wide range of problems, including classification and regression. It is a tree-like model that makes decisions by mapping input data to output labels or numerical values based on a set of rules learned from the training data.

Table of Contents

Decision Tree Family in Machine Learning

In machine learning, decision trees belong to the family of algorithms known as “Tree-based models” or “Tree-based methods.” This family includes different variations and enhancements of the basic decision tree model. Some key members of the decision tree family are:

Decision Tree (Basic): The fundamental decision tree model we previously described, where the tree is built by recursively splitting the data based on feature values to make predictions for classification or regression tasks.
Random Forest: Random Forest is a machine learning algorithm that builds multiple decision trees and aggregates their predictions to improve accuracy and avoid overfitting. It creates a “forest” of decision trees, where each tree is trained on a random subset of the data and a random subset of features. The final prediction is a consensus of the predictions of all the individual trees.
Gradient Boosting Trees: Gradient Boosting is another ensemble method that builds decision trees sequentially, where each new tree corrects the errors of the previous ones. It combines weak learners (shallow decision trees) to form a strong predictive model. Popular implementations include XGBoost, LightGBM, and CatBoost.
1. Extreme Gradient Boosting (XGBoost): XGBoost is an optimized implementation of gradient boosting that includes additional features like regularization, handling missing values, and parallel processing, making it a powerful and widely used algorithm for structured/tabular data.
2. LightGBM: LightGBM is another gradient-boosting framework that aims to improve training speed and memory efficiency compared to traditional gradient-boosting implementations. It achieves this through novel techniques such as histogram-based splitting and leaf-wise tree growth.
3. CatBoost: CatBoost is a gradient boosting library specifically designed to handle categorical features efficiently without the need for extensive preprocessing. It is robust against overfitting and typically requires less hyperparameter tuning.
CART (Classification and Regression Trees): CART is a general term for decision trees that can be used for both classification and regression. It is the underlying algorithm for most decision tree implementations.
MART (Multiple Additive Regression Trees): MART is a variant of gradient boosting specifically tailored for regression tasks. It combines the outputs of multiple regression trees to improve prediction accuracy.
Conditional Inference Trees: These are decision trees constructed based on statistical tests to ensure that splits are only performed when they are statistically significant.

Components of Decision Trees

Decision trees consist of several components that define their structure and predictive capabilities:

Root Node: The starting point of the tree, representing the entire dataset. It makes the initial decision on which feature to split the data based on.
Internal Nodes: These nodes represent attribute tests and result in further splits based on the values of specific features.
Leaf Nodes: Terminal nodes that do not get split further. They contain the final predictions or values for the target variable.
Branches: Edges connecting the nodes represent the decision paths based on feature values.

Making Predictions with Decision Trees

To make predictions using a decision tree, input data follows the decision paths from the root node to the appropriate leaf node. The decision-making process is straightforward and interpretable, making it easy to understand the reasoning behind predictions.

Advantages of Decision Trees

Interpretability: Decision trees offer a transparent and human-readable representation, making them ideal for scenarios where explainability is crucial.
Handle Nonlinearity: Decision trees can capture complex relationships between features and the target variable without requiring extensive preprocessing.
Can Handle Mixed Data Types: Decision trees can handle both numerical and categorical features without requiring extensive feature engineering.
No Data Normalization Required: Decision trees are unaffected by data scaling, making them robust to variations in the feature scales.
Applicability to Various Tasks: Decision trees are versatile and can handle classification, regression, and anomaly detection tasks.

Limitations of Decision Trees

Overfitting: Decision trees can easily overfit the training data if not pruned or limited in depth, leading to poor generalization of unseen data.
Instability: Small changes in the training data can lead to significantly different decision trees, affecting the model’s robustness.
Lack of Smoothness: Decision trees create piecewise constant predictions, which can result in discontinuities and jaggedness.
Bias towards Features with Many Levels: Decision trees with information gain or Gini impurity may favor features with many levels over those with fewer levels.
Limited Expressiveness: A single decision tree might not capture complex relationships present in the data, requiring ensemble methods for better performance.

Ensemble Methods with Decision Trees

Ensemble methods combine multiple decision trees to create more robust and accurate models. Two popular ensemble methods are Random Forest and Gradient Boosting.

Random Forest: Random Forest creates an ensemble of decision trees, where each tree is trained on a random subset of the data and features. The final prediction is obtained by aggregating the predictions of all individual trees. This technique reduces overfitting and improves generalization, making it suitable for high-dimensional datasets. Moreover, Random Forest can handle missing data and assess feature importance, aiding in better feature selection.
Gradient Boosting: Gradient Boosting builds decision trees sequentially, with each tree correcting the errors of the previous ones. It combines weak learners to form a strong predictive model. This method often outperforms single decision trees and other machine learning algorithms. Variants like XGBoost, LightGBM, and CatBoost provide enhancements to traditional Gradient Boosting with faster computation and better memory efficiency.

Pruning Decision Trees

Decision trees can easily overfit the training data, leading to poor generalization. To mitigate this, pruning techniques are applied to remove branches that do not significantly improve performance on the test data. One such method is cost-complexity pruning, which adds a penalty term to the tree’s complexity, encouraging simpler trees that generalize better.

Handling Imbalanced Data

In classification tasks with imbalanced class distributions, decision trees may be biased towards the majority class. To address this issue, techniques like class weighting, resampling, or using different splitting criteria can be employed to ensure fair representation of all classes.

Feature Selection and Importance

Decision trees provide a measure of feature importance, indicating how much each feature contributes to the model’s predictions. This information can guide feature selection and aid in understanding the key drivers behind the target variable. Selecting the most informative features can improve model performance and reduce computation time.

Interpretability vs. Model Complexity

The simplicity and interpretability of decision trees come at the cost of limited expressiveness compared to more complex models like deep neural networks. While decision trees are excellent at handling tabular data and providing transparent decisions, they might not capture highly intricate relationships present in the data. In such cases, a trade-off between interpretability and model complexity should be considered based on the problem requirements.

Let’s move forward with the understanding of code.

Dataset being used

Name: Pokerhand dataset

Size: 829201 Rows 11 Columns(including Index column)

[829201 Rows 11 Columns],

Feature_names: [‘s1’, ‘r1’, ‘s2’, ‘r2’, ‘s3’, ‘r3’, ‘s4’, ‘r4’, ‘s5’, ‘r5’]

'Feature_names': ['s1', 'r1', 's2', 'r2', 's3', 'r3', 's4', 'r4', 's5', 'r5'],

Target Name : ‘Class’

'Target Name' : [‘Class’],

Importing necessary libraries

import os
import warnings
from time import time

import numpy as np
import pandas as pd
import xgboost as xgb
from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_openml
from sklearn.metrics import balanced_accuracy_score
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.tree import DecisionTreeClassifier

from dataheroes import CoresetTreeServiceDTC

Data Loading and Pre-processing

X, y = fetch_openml("pokerhand", return_X_y=True)

This code loads the PokerHand dataset from the OpenML dataset repository. The dataset contains information about the cards in a poker hand and the resulting hand. The code stores the data in the X and y variables.

one_hot = OneHotEncoder(sparse=False)
one_hot.fit(X)
X = one_hot.transform(X)
X = X.astype(np.int8)

The sparse=False argument tells the OneHotEncoder class to return a dense matrix. This is necessary because the XGBoost algorithm that we will use later requires a dense matrix.
The X = X.astype(np.int8) line converts the X dataset to the np.int8 data type. This is necessary because the XGBoost algorithm that we will use later requires the data to be in this data type.

label_enc = LabelEncoder()
y = label_enc.fit_transform(y)
y[y > 3] = 4

This code label-encodes the target variable. This means that the target variable is converted into a series of integers, where each integer represents a different class. The LabelEncoder class is used to perform this task.
Setting any values greater than 3 to 4.

scaler = StandardScaler()
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

tree_query_level = 1

This code applies standard scaling to the X dataset. This means that the values in the X dataset are scaled so that they have a mean of 0 and a standard deviation of 1. The StandardScaler class is used to perform this task.
The train_test_split() function is used to split the X and y datasets into train and test sets. The test_size=0.2 argument tells the function to use 20% of the data for the test set. The random_state=42 argument tells the function to use the same random seed for splitting the data so that we can get reproducible results.

Building Coreset

t = time()
service_obj = CoresetTreeServiceDTC(optimized_for='training',
                                   n_classes=7,
                                   n_instances=663_360,
                                    coreset_size=120_000
                                    )
service_obj.build(X_train, y_train)
coreset_build_secs = time() - t
print(f"Coreset build time (sec): {coreset_build_secs:.2f}")

The code snippet builds a coreset for a decision tree classifier using the CoresetTreeServiceDTC class. The coreset is built using the training data, and the time it takes to build the coreset is printed to the console.

Training model on coreset

t = time()
# Get the coreset from the tree.
coreset = service_obj.get_coreset(level=tree_query_level)
indices_coreset_, X_coreset, y_coreset = coreset['data']
w_coreset = coreset['w']
decisiontreeclassifier_coreset_model = DecisionTreeClassifier()
decisiontreeclassifier_coreset_model.fit(X_coreset, y_coreset, sample_weight=w_coreset)
n_samples_coreset = len(y_coreset)
decisiontreeclassifier_coreset_secs = time() - t

The code snippet gets a coreset from a tree, trains a decision tree classifier on the coreset, and prints the time it took to train the model.

Training model on Full Dataset

t = time()
decisiontreeclassifier_full_model = DecisionTreeClassifier()
decisiontreeclassifier_full_model.fit(X_train, y_train)
decisiontree_classifier_full_secs = time() - t

The code snippet trains a decision tree classifier on the full dataset and prints the time it took to train the model to the console.

Model Evaluation

coreset_score=balanced_accuracy_score(y_test,decisiontreeclassifier_coreset_model.predict(X_test))

decisiontreeclassifier_full_score = balanced_accuracy_score(y_test, decisiontreeclassifier_full_model.predict(X_test))

Calculating balanced accuracy for a model trained on coreset and model trained on the full dataset.

Conclusion

In conclusion, decision trees offer a powerful and interpretable solution for both classification and regression tasks in machine learning. With their ability to handle mixed data types and minimal data preprocessing requirements, decision trees are versatile and applicable to various domains. However, careful consideration of their limitations, such as overfitting and lack of expressiveness, is essential. By combining pruning, ensemble techniques, and effective data handling, decision trees become valuable tools for data-driven decision-making and contribute significantly to the success of machine learning projects.

Decision Trees: An Overview and Practical Guide

Decision Tree Family in Machine Learning

Components of Decision Trees

Making Predictions with Decision Trees

Advantages of Decision Trees

Limitations of Decision Trees

Ensemble Methods with Decision Trees

Pruning Decision Trees

Handling Imbalanced Data

Feature Selection and Importance

Interpretability vs. Model Complexity

Dataset being used

Conclusion

Related Articles

Hyperparameter Tuning Methods Every Data Scientist Should Know

Unleashing the Power of ML: The Art of Training Models and Its Vital Significance

Comparing Customer Segmentation Techniques: KMeans vs. KMeans Coreset from DataHeroes

Decision Trees: An Overview and Practical Guide

Decision Tree Family in Machine Learning

Components of Decision Trees

Making Predictions with Decision Trees

Advantages of Decision Trees

Limitations of Decision Trees

Ensemble Methods with Decision Trees

Pruning Decision Trees

Handling Imbalanced Data

Feature Selection and Importance

Interpretability vs. Model Complexity

Dataset being used

Conclusion

Subscribe to Our Blog

Related Articles

Hyperparameter Tuning Methods Every Data Scientist Should Know

Unleashing the Power of ML: The Art of Training Models and Its Vital Significance

Comparing Customer Segmentation Techniques: KMeans vs. KMeans Coreset from DataHeroes