Close Menu
    Trending
    • Why Entrepreneurs Should Stop Obsessing Over Growth
    • Implementing IBCS rules in Power BI
    • What comes next for AI copyright lawsuits?
    • Why PDF Extraction Still Feels LikeHack
    • GenAI Will Fuel People’s Jobs, Not Replace Them. Here’s Why
    • Millions of websites to get ‘game-changing’ AI bot blocker
    • I Worked Through Labor, My Wedding and Burnout — For What?
    • Cloudflare will now block AI bots from crawling its clients’ websites by default
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Machine Learning»Implementing a Custom Decision Tree Classifier from Scratch | by Diellorhoxhaj | Jan, 2025
    Machine Learning

    Implementing a Custom Decision Tree Classifier from Scratch | by Diellorhoxhaj | Jan, 2025

    Team_AIBS NewsBy Team_AIBS NewsJanuary 13, 2025No Comments6 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Step one in constructing any machine studying mannequin is to load and put together the information. Right here, we use datasets from sklearn.datasets, which offers a number of built-in datasets generally used for testing and analysis.

    The script helps any dataset accessible in sklearn.datasets with a load_ prefix. For instance:

    • wine: Classifies several types of wines based mostly on chemical properties.
    • iris: Classifies iris flowers into species based mostly on sepal and petal measurements.
    • digits: Classifies photographs of handwritten digits (0-9).
    • breast_cancer: Classifies whether or not breast most cancers is malignant or benign.
    • diabetes: Regression dataset predicting illness development.

    To specify the dataset, use the --dataset argument. For instance:

    python decision_tree.py --dataset wine

    As soon as the dataset is loaded, we cut up it into coaching and testing units utilizing train_test_split from sklearn.model_selection. This ensures that the mannequin is evaluated on unseen information for a good evaluation of its efficiency.

    Code Clarification:

    1. information, goal: The options (information) and labels (goal) are extracted from the dataset.
    2. train_test_split: The information is cut up into:
    • Coaching set: Used to coach the choice tree.
    • Testing set: Used to guage its accuracy.
    1. test_size: The proportion of the dataset allotted to testing. For instance, test_size=0.25 reserves 25% of the information for testing.

    Code:

    # Load the dataset
    information, goal = getattr(sklearn.datasets, "load_{}".format(args.dataset))(return_X_y=True)

    # Break up information into coaching and testing units
    train_data, test_data, train_target, test_target = sklearn.model_selection.train_test_split(
    information, goal, test_size=args.test_size, random_state=args.seed)

    The core of the implementation is the DecisionTree class. This class builds the tree, splits nodes, and predicts targets.

    Every node of the tree is represented as an example of the Node class. A node can both be a leaf or a choice node.

    class Node:
    def __init__(self, situations, prediction):
    self.is_leaf = True # Begin as a leaf
    self.situations = situations # Indices of knowledge factors within the node
    self.prediction = prediction # Most frequent class label

    def cut up(self, function, worth, left, proper):
    self.is_leaf = False # Turns into a choice node
    self.function = function # Splitting function
    self.worth = worth # Threshold worth for splitting
    self.left = left # Left baby node
    self.proper = proper # Proper baby node

    The DecisionTree class handles tree building via recursive or adaptive splitting.

    Standards for Splitting

    Two widespread standards for splitting are:

    1. Gini Impurity:

    the place pipi​ is the proportion of sophistication ii within the node.

    2. Entropy:

    Code:

    def _criterion_gini(self, situations):
    bins = np.bincount(self._targets[instances])
    return np.sum(bins * (1 - bins / len(situations)))

    def _criterion_entropy(self, situations):
    bins = np.bincount(self._targets[instances])
    bins = bins[np.nonzero(bins)]
    return -np.sum(bins * np.log(bins / len(situations)))

    Recursive Splitting

    Recursive splitting divides nodes based mostly on the criterion that reduces impurity probably the most.

    1. Discover the perfect function and worth to separate the information.
    2. Break up the information into left and proper subsets.
    3. Recursively repeat for every baby node.

    Code:

    def _split_recursively(self, node, depth):
    if not self._can_split(node, depth):
    return

    _, function, worth, left, proper = self._best_split(node)
    node.cut up(function, worth, self._leaf(left), self._leaf(proper))
    self._split_recursively(node.left, depth + 1)
    self._split_recursively(node.proper, depth + 1)

    Adaptive Splitting

    Adaptive splitting controls the utmost variety of leaves. It makes use of a precedence queue to make sure probably the most helpful splits are utilized first.

    Code:

    def _split_adaptively(self):
    def split_value(node, index, depth):
    best_split = self._best_split(node)
    return (best_split[0], index, depth, node, *best_split[1:])

    heap = [split_value(self._root, 0, 0)]
    for i in vary(self._max_leaves - 1):
    _, _, depth, node, function, worth, left, proper = heapq.heappop(heap)
    node.cut up(function, worth, self._leaf(left), self._leaf(proper))
    if self._can_split(node.left, depth + 1):
    heapq.heappush(heap, split_value(node.left, 2 * i + 1, depth + 1))
    if self._can_split(node.proper, depth + 1):
    heapq.heappush(heap, split_value(node.proper, 2 * i + 2, depth + 1))
    if not heap:
    break

    The very best cut up is set by evaluating all attainable thresholds for all options and deciding on the one which minimizes impurity.

    Code:

    def _best_split(self, node):
    best_criterion = None
    for function in vary(self._data.form[1]):
    sorted_indices = node.situations[np.argsort(self._data[node.instances, feature])]
    for i in vary(len(sorted_indices) - 1):
    if self._data[sorted_indices[i], function] == self._data[sorted_indices[i + 1], function]:
    proceed
    worth = (self._data[sorted_indices[i], function] + self._data[sorted_indices[i + 1], function]) / 2
    left, proper = sorted_indices[:i + 1], sorted_indices[i + 1:]
    criterion = self._criterion(left) + self._criterion(proper)
    if best_criterion is None or criterion < best_criterion:
    best_criterion, best_feature, best_value, best_left, best_right =
    criterion, function, worth, left, proper

    return best_criterion - self._criterion(node.situations), best_feature, best_value, best_left, best_right

    To foretell, the information traverses the tree till it reaches a leaf node. The prediction is probably the most frequent class label within the leaf.

    Code:

    def predict(self, information):
    outcomes = np.zeros(len(information), dtype=np.int32)
    for i in vary(len(information)):
    node = self._root
    whereas not node.is_leaf:
    node = node.left if information[i][node.feature] <= node.worth else node.proper
    outcomes[i] = node.prediction
    return outcomes

    We consider the mannequin’s accuracy on each coaching and testing datasets.

    Code:

    train_accuracy = sklearn.metrics.accuracy_score(train_target, decision_tree.predict(train_data))
    test_accuracy = sklearn.metrics.accuracy_score(test_target, decision_tree.predict(test_data))

    print("Prepare accuracy: {:.1f}%".format(100 * train_accuracy))
    print("Take a look at accuracy: {:.1f}%".format(100 * test_accuracy))

    Right here’s the entire implementation:

    #!/usr/bin/env python3
    import argparse
    import heapq
    import subprocess
    import numpy as np
    import sklearn.datasets
    import sklearn.metrics
    import sklearn.model_selection

    parser = argparse.ArgumentParser()
    parser.add_argument("--criterion", default="gini", kind=str, assist="Criterion to make use of; both `gini` or `entropy`")
    parser.add_argument("--dataset", default="wine", kind=str, assist="Dataset to make use of")
    parser.add_argument("--max_depth", default=None, kind=int, assist="Most resolution tree depth")
    parser.add_argument("--max_leaves", default=None, kind=int, assist="Most variety of leaf nodes")
    parser.add_argument("--min_to_split", default=2, kind=int, assist="Minimal examples required to separate")
    parser.add_argument("--recodex", default=False, motion="store_true", assist="Working in ReCodEx")
    parser.add_argument("--seed", default=42, kind=int, assist="Random seed")
    parser.add_argument("--test_size", default=0.25, kind=lambda x: int(x) if x.isdigit() else float(x), assist="Take a look at measurement")
    parser.add_argument("--plot", default=False, const=True, nargs="?", kind=str, assist="Plot the predictions")

    class DecisionTree:
    class Node:
    def __init__(self, situations, prediction):
    self.is_leaf = True
    self.situations = situations
    self.prediction = prediction

    def cut up(self, function, worth, left, proper):
    self.is_leaf = False
    self.function = function
    self.worth = worth
    self.left = left
    self.proper = proper

    def __init__(self, criterion, max_depth, min_to_split, max_leaves):
    self._criterion = getattr(self, "_criterion_" + criterion)
    self._max_depth = max_depth
    self._min_to_split = min_to_split
    self._max_leaves = max_leaves

    def match(self, information, targets):
    self._data = information
    self._targets = targets
    self._root = self._leaf(np.arange(len(self._data)))
    if self._max_leaves is None:
    self._split_recursively(self._root, 0)
    else:
    self._split_adaptively()

    def predict(self, information):
    outcomes = np.zeros(len(information), dtype=np.int32)
    for i in vary(len(information)):
    node = self._root
    whereas not node.is_leaf:
    node = node.left if information[i][node.feature] <= node.worth else node.proper
    outcomes[i] = node.prediction
    return outcomes

    def _split_recursively(self, node, depth):
    if not self._can_split(node, depth):
    return
    _, function, worth, left, proper = self._best_split(node)
    node.cut up(function, worth, self._leaf(left), self._leaf(proper))
    self._split_recursively(node.left, depth + 1)
    self._split_recursively(node.proper, depth + 1)

    def _split_adaptively(self):
    def split_value(node, index, depth):
    best_split = self._best_split(node)
    return (best_split[0], index, depth, node, *best_split[1:])

    heap = [split_value(self._root, 0, 0)]
    for i in vary(self._max_leaves - 1):
    _, _, depth, node, function, worth, left, proper = heapq.heappop(heap)
    node.cut up(function, worth, self._leaf(left), self._leaf(proper))
    if self._can_split(node.left, depth + 1):
    heapq.heappush(heap, split_value(node.left, 2 * i + 1, depth + 1))
    if self._can_split(node.proper, depth + 1):
    heapq.heappush(heap, split_value(node.proper, 2 * i + 2, depth + 1))
    if not heap:
    break

    def _can_split(self, node, depth):
    return (
    (self._max_depth is None or depth < self._max_depth) and
    len(node.situations) >= self._min_to_split and
    not np.array_equiv(self._targets[node.instances], node.prediction)
    )

    def _best_split(self, node):
    best_criterion = None
    for function in vary(self._data.form[1]):
    sorted_indices = node.situations[np.argsort(self._data[node.instances, feature])]
    for i in vary(len(sorted_indices) - 1):
    if self._data[sorted_indices[i], function] == self._data[sorted_indices[i + 1], function]:
    proceed
    worth = (self._data[sorted_indices[i], function] + self._data[sorted_indices[i + 1], function]) / 2
    left, proper = sorted_indices[:i + 1], sorted_indices[i + 1:]
    criterion = self._criterion(left) + self._criterion(proper)
    if best_criterion is None or criterion < best_criterion:
    best_criterion, best_feature, best_value, best_left, best_right =
    criterion, function, worth, left, proper

    return best_criterion - self._criterion(node.situations), best_feature, best_value, best_left, best_right

    def _leaf(self, situations):
    return self.Node(situations, np.argmax(np.bincount(self._targets[instances])))

    def _criterion_gini(self, situations):
    bins = np.bincount(self._targets[instances])
    return np.sum(bins * (1 - bins / len(situations)))

    def _criterion_entropy(self, situations):
    bins = np.bincount(self._targets[instances])
    bins = bins[np.nonzero(bins)]
    return -np.sum(bins * np.log(bins / len(situations)))

    def most important(args: argparse.Namespace) -> tuple[float, float]:
    information, goal = getattr(sklearn.datasets, "load_{}".format(args.dataset))(return_X_y=True)
    train_data, test_data, train_target, test_target = sklearn.model_selection.train_test_split(
    information, goal, test_size=args.test_size, random_state=args.seed)
    decision_tree = DecisionTree(args.criterion, args.max_depth, args.min_to_split, args.max_leaves)
    decision_tree.match(train_data, train_target)
    train_accuracy = sklearn.metrics.accuracy_score(train_target, decision_tree.predict(train_data))
    test_accuracy = sklearn.metrics.accuracy_score(test_target, decision_tree.predict(test_data))
    if args.plot:
    lessons = np.max(goal) + 1
    feature_names = getattr(sklearn.datasets, "load_{}".format(args.dataset))().feature_names
    dot = ["digraph Tree {node [shape=box]; bgcolor=invis;"]
    def plot(index, node, guardian):
    if guardian is just not None: dot.append("{} -> {}".format(guardian, index))
    dot.append("{} [fontname="serif"; label="{}c_{} = {:.2f}ninstances = {}ncounts = [{}]"];".format(
    index, "f. {} ({}) <= {:.3f}n".format(node.function, feature_names[node.feature], node.worth) if not node.is_leaf else "",
    args.criterion, decision_tree._criterion(node.situations), len(node.situations),
    ", ".be part of(map(str, np.bincount(decision_tree._targets[node.instances], minlength=lessons)))))
    if not node.is_leaf:
    index = plot(plot(index + 1, node.left, index), node.proper, index)
    return index + 1
    plot(0, decision_tree._root, None)
    dot.append("}")
    subprocess.run(["dot", "-Txlib"] if args.plot is True else ["dot", "-Tsvg", "-o{}".format(args.plot)],
    enter="n".be part of(dot), encoding="utf-8")
    return 100 * train_accuracy, 100 * test_accuracy

    if __name__ == "__main__":
    args = parser.parse_args([] if "__file__" not in globals() else None)
    train_accuracy, test_accuracy = most important(args)
    print("Prepare accuracy: {:.1f}%".format(train_accuracy))
    print("Take a look at accuracy: {:.1f}%".format(test_accuracy))



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleOpenAI Courts Trump With Vision for ‘A.I. in America’
    Next Article Building Deterministic GenAI Chatbots In Regulated Industries | by Ashley Peacock | Jan, 2025
    Team_AIBS News
    • Website

    Related Posts

    Machine Learning

    Why PDF Extraction Still Feels LikeHack

    July 1, 2025
    Machine Learning

    🚗 Predicting Car Purchase Amounts with Neural Networks in Keras (with Code & Dataset) | by Smruti Ranjan Nayak | Jul, 2025

    July 1, 2025
    Machine Learning

    Reinforcement Learning in the Age of Modern AI | by @pramodchandrayan | Jul, 2025

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Why Entrepreneurs Should Stop Obsessing Over Growth

    July 1, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    rjjrjjrjdjejjwi

    January 25, 2025

    09055770486 – 888pooiuyyihgff 888bkbkvuhvvh – Medium

    May 6, 2025

    Your Mac Ran Out of Storage 😳? This Is Why I Never Have To Worry Again | by Jannis Douloumis | Jan, 2025

    January 13, 2025
    Our Picks

    Why Entrepreneurs Should Stop Obsessing Over Growth

    July 1, 2025

    Implementing IBCS rules in Power BI

    July 1, 2025

    What comes next for AI copyright lawsuits?

    July 1, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.