Consider machine studying like cooking:
- Components = your dataset (the uncooked info)
- Recipe = the algorithm (a step-by-step methodology to show information into perception)
- Instruments = Python libraries (pandas, NumPy, and many others.)
- Oven = mannequin coaching (the place studying occurs)
- Style-test = mannequin analysis (to test how nicely it really works)
Machine studying is actually about utilizing examples from the previous to make predictions in regards to the future.
We’ll be working with the well-known Iris dataset, which comprises measurements of 150 flowers from three species: setosa, versicolor, and virginica.
Every flower has the next options:
- Sepal size (cm)
- Sepal width (cm)
- Petal size (cm)
- Petal width (cm)
Our aim is to construct a mannequin that predicts the species primarily based on these 4 measurements.
We begin by importing libraries that assist us deal with information, visualize patterns, and construct fashions.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import warnings
warnings.simplefilter(motion='ignore', class=FutureWarning)
The Iris dataset comes constructed into Scikit-learn, and we are able to simply load it:
iris = load_iris()
X = iris.information
y = iris.goal
feature_names = iris.feature_names
target_names = iris.target_names
We then convert it right into a Pandas DataFrame for simpler exploration
df = pd.DataFrame(information=X, columns=feature_names)
df['species'] = y
df['species'] = df['species'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})
Earlier than constructing a mannequin, it’s necessary to know the dataset.
print(df.form)
print(df.information())
print(df.describe())
print(df['species'].value_counts())
- The dataset has 150 rows and 5 columns
- There are not any lacking values
- Every species seems precisely 50 occasions, making it well-balanced
Histograms
plt.determine(figsize=(8, 6))
for i, characteristic in enumerate(feature_names):
plt.subplot(2, 2, i+1)
sns.histplot(df[feature], bins=20, kde=True)
plt.title(f'Histogram of {characteristic}')
plt.tight_layout()
plt.present()
This exhibits how every characteristic (size/width) is distributed.
Pairplot
sns.pairplot(information=df, hue='species')
plt.present()
This plot exhibits relationships between options and the way species cluster primarily based on measurements.
We break up our information into enter options (X) and goal labels (y):
X = df.drop('species', axis=1)
y = df['species']
Then, we break up them additional into coaching and testing units:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
70% of the information is used for coaching
30% is held again for testing
We now use a Resolution Tree classifier, which is straightforward to know and visualize:
clf = DecisionTreeClassifier()
clf.match(X_train, y_train)
A call tree works like a flowchart. It repeatedly asks questions like:
- “Is petal size ≤ 2.45?”
- If sure → in all probability setosa
- If no → ask one other query till a choice is made
We test how correct the mannequin is on unseen (take a look at) information:
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
With clear datasets like Iris, fashions usually carry out very nicely — generally reaching 100% accuracy.
This step helps you perceive how the mannequin is making predictions.
from sklearn import tree
plt.determine(figsize=(15, 10))
tree.plot_tree(clf, feature_names=feature_names, class_names=target_names, crammed=True)
plt.title("Resolution Tree")
plt.present()
You’ll see how options like petal size are key in classifying the flowers.