Step one in constructing any machine studying mannequin is to load and put together the information. Right here, we use datasets from sklearn.datasets
, which offers a number of built-in datasets generally used for testing and analysis.
The script helps any dataset accessible in sklearn.datasets
with a load_
prefix. For instance:
wine
: Classifies several types of wines based mostly on chemical properties.iris
: Classifies iris flowers into species based mostly on sepal and petal measurements.digits
: Classifies photographs of handwritten digits (0-9).breast_cancer
: Classifies whether or not breast most cancers is malignant or benign.diabetes
: Regression dataset predicting illness development.
To specify the dataset, use the --dataset
argument. For instance:
python decision_tree.py --dataset wine
As soon as the dataset is loaded, we cut up it into coaching and testing units utilizing train_test_split
from sklearn.model_selection
. This ensures that the mannequin is evaluated on unseen information for a good evaluation of its efficiency.
Code Clarification:
information, goal
: The options (information
) and labels (goal
) are extracted from the dataset.train_test_split
: The information is cut up into:
- Coaching set: Used to coach the choice tree.
- Testing set: Used to guage its accuracy.
test_size
: The proportion of the dataset allotted to testing. For instance,test_size=0.25
reserves 25% of the information for testing.
Code:
# Load the dataset
information, goal = getattr(sklearn.datasets, "load_{}".format(args.dataset))(return_X_y=True)# Break up information into coaching and testing units
train_data, test_data, train_target, test_target = sklearn.model_selection.train_test_split(
information, goal, test_size=args.test_size, random_state=args.seed)
The core of the implementation is the DecisionTree
class. This class builds the tree, splits nodes, and predicts targets.
Every node of the tree is represented as an example of the Node
class. A node can both be a leaf or a choice node.
class Node:
def __init__(self, situations, prediction):
self.is_leaf = True # Begin as a leaf
self.situations = situations # Indices of knowledge factors within the node
self.prediction = prediction # Most frequent class labeldef cut up(self, function, worth, left, proper):
self.is_leaf = False # Turns into a choice node
self.function = function # Splitting function
self.worth = worth # Threshold worth for splitting
self.left = left # Left baby node
self.proper = proper # Proper baby node
The DecisionTree
class handles tree building via recursive or adaptive splitting.
Standards for Splitting
Two widespread standards for splitting are:
- Gini Impurity:
the place pipi is the proportion of sophistication ii within the node.
2. Entropy:
Code:
def _criterion_gini(self, situations):
bins = np.bincount(self._targets[instances])
return np.sum(bins * (1 - bins / len(situations)))def _criterion_entropy(self, situations):
bins = np.bincount(self._targets[instances])
bins = bins[np.nonzero(bins)]
return -np.sum(bins * np.log(bins / len(situations)))
Recursive Splitting
Recursive splitting divides nodes based mostly on the criterion that reduces impurity probably the most.
- Discover the perfect function and worth to separate the information.
- Break up the information into left and proper subsets.
- Recursively repeat for every baby node.
Code:
def _split_recursively(self, node, depth):
if not self._can_split(node, depth):
return_, function, worth, left, proper = self._best_split(node)
node.cut up(function, worth, self._leaf(left), self._leaf(proper))
self._split_recursively(node.left, depth + 1)
self._split_recursively(node.proper, depth + 1)
Adaptive Splitting
Adaptive splitting controls the utmost variety of leaves. It makes use of a precedence queue to make sure probably the most helpful splits are utilized first.
Code:
def _split_adaptively(self):
def split_value(node, index, depth):
best_split = self._best_split(node)
return (best_split[0], index, depth, node, *best_split[1:])heap = [split_value(self._root, 0, 0)]
for i in vary(self._max_leaves - 1):
_, _, depth, node, function, worth, left, proper = heapq.heappop(heap)
node.cut up(function, worth, self._leaf(left), self._leaf(proper))
if self._can_split(node.left, depth + 1):
heapq.heappush(heap, split_value(node.left, 2 * i + 1, depth + 1))
if self._can_split(node.proper, depth + 1):
heapq.heappush(heap, split_value(node.proper, 2 * i + 2, depth + 1))
if not heap:
break
The very best cut up is set by evaluating all attainable thresholds for all options and deciding on the one which minimizes impurity.
Code:
def _best_split(self, node):
best_criterion = None
for function in vary(self._data.form[1]):
sorted_indices = node.situations[np.argsort(self._data[node.instances, feature])]
for i in vary(len(sorted_indices) - 1):
if self._data[sorted_indices[i], function] == self._data[sorted_indices[i + 1], function]:
proceed
worth = (self._data[sorted_indices[i], function] + self._data[sorted_indices[i + 1], function]) / 2
left, proper = sorted_indices[:i + 1], sorted_indices[i + 1:]
criterion = self._criterion(left) + self._criterion(proper)
if best_criterion is None or criterion < best_criterion:
best_criterion, best_feature, best_value, best_left, best_right =
criterion, function, worth, left, properreturn best_criterion - self._criterion(node.situations), best_feature, best_value, best_left, best_right
To foretell, the information traverses the tree till it reaches a leaf node. The prediction is probably the most frequent class label within the leaf.
Code:
def predict(self, information):
outcomes = np.zeros(len(information), dtype=np.int32)
for i in vary(len(information)):
node = self._root
whereas not node.is_leaf:
node = node.left if information[i][node.feature] <= node.worth else node.proper
outcomes[i] = node.prediction
return outcomes
We consider the mannequin’s accuracy on each coaching and testing datasets.
Code:
train_accuracy = sklearn.metrics.accuracy_score(train_target, decision_tree.predict(train_data))
test_accuracy = sklearn.metrics.accuracy_score(test_target, decision_tree.predict(test_data))print("Prepare accuracy: {:.1f}%".format(100 * train_accuracy))
print("Take a look at accuracy: {:.1f}%".format(100 * test_accuracy))
Right here’s the entire implementation:
#!/usr/bin/env python3
import argparse
import heapq
import subprocess
import numpy as np
import sklearn.datasets
import sklearn.metrics
import sklearn.model_selectionparser = argparse.ArgumentParser()
parser.add_argument("--criterion", default="gini", kind=str, assist="Criterion to make use of; both `gini` or `entropy`")
parser.add_argument("--dataset", default="wine", kind=str, assist="Dataset to make use of")
parser.add_argument("--max_depth", default=None, kind=int, assist="Most resolution tree depth")
parser.add_argument("--max_leaves", default=None, kind=int, assist="Most variety of leaf nodes")
parser.add_argument("--min_to_split", default=2, kind=int, assist="Minimal examples required to separate")
parser.add_argument("--recodex", default=False, motion="store_true", assist="Working in ReCodEx")
parser.add_argument("--seed", default=42, kind=int, assist="Random seed")
parser.add_argument("--test_size", default=0.25, kind=lambda x: int(x) if x.isdigit() else float(x), assist="Take a look at measurement")
parser.add_argument("--plot", default=False, const=True, nargs="?", kind=str, assist="Plot the predictions")
class DecisionTree:
class Node:
def __init__(self, situations, prediction):
self.is_leaf = True
self.situations = situations
self.prediction = prediction
def cut up(self, function, worth, left, proper):
self.is_leaf = False
self.function = function
self.worth = worth
self.left = left
self.proper = proper
def __init__(self, criterion, max_depth, min_to_split, max_leaves):
self._criterion = getattr(self, "_criterion_" + criterion)
self._max_depth = max_depth
self._min_to_split = min_to_split
self._max_leaves = max_leaves
def match(self, information, targets):
self._data = information
self._targets = targets
self._root = self._leaf(np.arange(len(self._data)))
if self._max_leaves is None:
self._split_recursively(self._root, 0)
else:
self._split_adaptively()
def predict(self, information):
outcomes = np.zeros(len(information), dtype=np.int32)
for i in vary(len(information)):
node = self._root
whereas not node.is_leaf:
node = node.left if information[i][node.feature] <= node.worth else node.proper
outcomes[i] = node.prediction
return outcomes
def _split_recursively(self, node, depth):
if not self._can_split(node, depth):
return
_, function, worth, left, proper = self._best_split(node)
node.cut up(function, worth, self._leaf(left), self._leaf(proper))
self._split_recursively(node.left, depth + 1)
self._split_recursively(node.proper, depth + 1)
def _split_adaptively(self):
def split_value(node, index, depth):
best_split = self._best_split(node)
return (best_split[0], index, depth, node, *best_split[1:])
heap = [split_value(self._root, 0, 0)]
for i in vary(self._max_leaves - 1):
_, _, depth, node, function, worth, left, proper = heapq.heappop(heap)
node.cut up(function, worth, self._leaf(left), self._leaf(proper))
if self._can_split(node.left, depth + 1):
heapq.heappush(heap, split_value(node.left, 2 * i + 1, depth + 1))
if self._can_split(node.proper, depth + 1):
heapq.heappush(heap, split_value(node.proper, 2 * i + 2, depth + 1))
if not heap:
break
def _can_split(self, node, depth):
return (
(self._max_depth is None or depth < self._max_depth) and
len(node.situations) >= self._min_to_split and
not np.array_equiv(self._targets[node.instances], node.prediction)
)
def _best_split(self, node):
best_criterion = None
for function in vary(self._data.form[1]):
sorted_indices = node.situations[np.argsort(self._data[node.instances, feature])]
for i in vary(len(sorted_indices) - 1):
if self._data[sorted_indices[i], function] == self._data[sorted_indices[i + 1], function]:
proceed
worth = (self._data[sorted_indices[i], function] + self._data[sorted_indices[i + 1], function]) / 2
left, proper = sorted_indices[:i + 1], sorted_indices[i + 1:]
criterion = self._criterion(left) + self._criterion(proper)
if best_criterion is None or criterion < best_criterion:
best_criterion, best_feature, best_value, best_left, best_right =
criterion, function, worth, left, proper
return best_criterion - self._criterion(node.situations), best_feature, best_value, best_left, best_right
def _leaf(self, situations):
return self.Node(situations, np.argmax(np.bincount(self._targets[instances])))
def _criterion_gini(self, situations):
bins = np.bincount(self._targets[instances])
return np.sum(bins * (1 - bins / len(situations)))
def _criterion_entropy(self, situations):
bins = np.bincount(self._targets[instances])
bins = bins[np.nonzero(bins)]
return -np.sum(bins * np.log(bins / len(situations)))
def most important(args: argparse.Namespace) -> tuple[float, float]:
information, goal = getattr(sklearn.datasets, "load_{}".format(args.dataset))(return_X_y=True)
train_data, test_data, train_target, test_target = sklearn.model_selection.train_test_split(
information, goal, test_size=args.test_size, random_state=args.seed)
decision_tree = DecisionTree(args.criterion, args.max_depth, args.min_to_split, args.max_leaves)
decision_tree.match(train_data, train_target)
train_accuracy = sklearn.metrics.accuracy_score(train_target, decision_tree.predict(train_data))
test_accuracy = sklearn.metrics.accuracy_score(test_target, decision_tree.predict(test_data))
if args.plot:
lessons = np.max(goal) + 1
feature_names = getattr(sklearn.datasets, "load_{}".format(args.dataset))().feature_names
dot = ["digraph Tree {node [shape=box]; bgcolor=invis;"]
def plot(index, node, guardian):
if guardian is just not None: dot.append("{} -> {}".format(guardian, index))
dot.append("{} [fontname="serif"; label="{}c_{} = {:.2f}ninstances = {}ncounts = [{}]"];".format(
index, "f. {} ({}) <= {:.3f}n".format(node.function, feature_names[node.feature], node.worth) if not node.is_leaf else "",
args.criterion, decision_tree._criterion(node.situations), len(node.situations),
", ".be part of(map(str, np.bincount(decision_tree._targets[node.instances], minlength=lessons)))))
if not node.is_leaf:
index = plot(plot(index + 1, node.left, index), node.proper, index)
return index + 1
plot(0, decision_tree._root, None)
dot.append("}")
subprocess.run(["dot", "-Txlib"] if args.plot is True else ["dot", "-Tsvg", "-o{}".format(args.plot)],
enter="n".be part of(dot), encoding="utf-8")
return 100 * train_accuracy, 100 * test_accuracy
if __name__ == "__main__":
args = parser.parse_args([] if "__file__" not in globals() else None)
train_accuracy, test_accuracy = most important(args)
print("Prepare accuracy: {:.1f}%".format(train_accuracy))
print("Take a look at accuracy: {:.1f}%".format(test_accuracy))