we take care of classification algorithms in machine studying like Logistic Regression, Okay-Nearest Neighbors, Help Vector Classifiers, and so forth., we don’t use analysis metrics like Imply Absolute Error (MAE), Imply Squared Error (MSE) or Root Imply Squared Error (RMSE).
As a substitute, we generate a confusion matrix, and based mostly on the confusion matrix, a classification report.
On this weblog, we goal to know what a confusion matrix is, the right way to calculate Accuracy, Precision, Recall and F1-Rating utilizing it, and the right way to choose the related metric based mostly on the traits of the info.
To grasp the confusion matrix and classification metrics, let’s use the Breast Cancer Wisconsin Dataset.
This dataset consists of 569 rows, and every row offers data on varied options of a tumor together with its prognosis, whether or not it’s malignant (cancerous) or benign (non-cancerous).
Now let’s construct a classification mannequin for this knowledge to categorise the tumors based mostly on their options.
We now apply Logistic Regression to coach a mannequin on this dataset.
Code:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt
# Load the dataset
column_names = [
"id", "diagnosis", "radius_mean", "texture_mean", "perimeter_mean", "area_mean", "smoothness_mean",
"compactness_mean", "concavity_mean", "concave_points_mean", "symmetry_mean", "fractal_dimension_mean",
"radius_se", "texture_se", "perimeter_se", "area_se", "smoothness_se", "compactness_se", "concavity_se",
"concave_points_se", "symmetry_se", "fractal_dimension_se", "radius_worst", "texture_worst",
"perimeter_worst", "area_worst", "smoothness_worst", "compactness_worst", "concavity_worst",
"concave_points_worst", "symmetry_worst", "fractal_dimension_worst"
]
df = pd.read_csv("C:/wdbc.knowledge", header=None, names=column_names)
# Drop ID column
df = df.drop(columns=["id"])
# Encode goal: M=1 (malignant), B=0 (benign)
df["diagnosis"] = df["diagnosis"].map({"M": 1, "B": 0})
# Cut up options and goal
X = df.drop(columns=["diagnosis"])
y = df["diagnosis"]
# Prepare-test break up
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
# Scale the options
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.remodel(X_test)
# Prepare logistic regression
mannequin = LogisticRegression(max_iter=10000)
mannequin.match(X_train, y_train)
# Predict
y_pred = mannequin.predict(X_test)
# Confusion Matrix and Classification Report
conf_matrix = confusion_matrix(y_test, y_pred, labels=[1, 0]) # 1 = Malignant, 0 = Benign
report = classification_report(y_test, y_pred, labels=[1, 0], target_names=["Malignant", "Benign"])
# Show outcomes
print("Confusion Matrix:n", conf_matrix)
print("nClassification Report:n", report)
# Plot Confusion Matrix
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Purples", xticklabels=["Malignant", "Benign"], yticklabels=["Malignant", "Benign"])
plt.xlabel("Predicted")
plt.ylabel("Precise")
plt.title("Confusion Matrix")
plt.tight_layout()
plt.present()
Right here, after making use of logistic regression to the info, we generated a confusion matrix and a classification report to judge the mannequin’s efficiency.
First let’s perceive the confusion matrix
From the above confusion matrix
’60’ represents the accurately predicted Malignant Tumors, which we consult with as “True Positives”.
‘4’ represents the incorrectly predicted Benign Tumors which are literally Malignant Tumors, which we consult with as “False Negatives”.
‘1’ represents the incorrectly predicted Malignant Tumors which are literally Benign Tumors, which we consult with as “False Positives”.
‘106’ represents the accurately predicted Benign Tumors, which we consult with as “True Negatives”.
Now let’s see what we are able to do with these values.
For that we think about the classification report.

From the above classification report, we are able to say that
For Malignant:
– Precision is 0.98, which suggests when the mannequin predicts the tumor as Malignant, it’s appropriate 98% of the time.
– Recall is 0.94, which suggests the mannequin accurately recognized 94% of all Malignant Tumors.
– F1-score is 0.96, which balances each the precision and recall.
For Benign:
– Precision is 0.96, which suggests when the mannequin predicts the tumor as Benign, it’s appropriate 96% of the time.
– Recall is 0.99, which suggests the mannequin accurately recognized 99% of all Benign Tumors.
– F1-score is 0.98.
From the report we are able to observe that the accuracy of the mannequin is 97%.
We even have Macro Common and Weighted Common, let’s see how these are calculated.
Macro Common
Macro Common calculates the common of all metrics (precision, recall and f1-score) throughout each courses giving equal weight to every class, no matter what number of samples every class accommodates.
We use macro common, after we need to know the efficiency of mannequin throughout all courses, ignoring class imbalances.
For this knowledge:

Weighted Common
Weighted Common additionally calculates the common of all metrics however provides extra weight to the category with extra samples.
Within the above code, we used test_size = 0.3
, which suggests we put aside 30% for testing which suggests we’re utilizing 171 samples from an information of 569 samples for a check set.
The confusion matrix and classification report are based mostly on this check set.
Out of 171 samples of check set, we have now 64 Malignant tumors and 107 Benign tumors.
Now let’s see how this weighted common is calculated for all metrics.

Weighted common provides us a extra sensible efficiency measure when we have now the category imbalanced datasets.
We now received an thought of each time period within the classification report and in addition the right way to calculate the macro and weighted averages.
Now let’s see what’s the usage of confusion matrix for producing a classification report.
In classification report we have now completely different metrics like accuracy, precision and so forth. and these metrics are calculated utilizing the values within the confusion matrix.
From the confusion matrix we have now
True Positives (TP) = 60
False Negatives (FN) = 4
False Positives (FP) = 1
True Negatives (TN) = 106
Now let’s calculate the classification metrics utilizing these values.

That is how we calculate the classification metrics utilizing a confusion matrix.
However why do we have now 4 completely different classification metrics as an alternative of 1 metric like accuracy? It’s as a result of the completely different metrics present completely different strengths and weaknesses of the classifier based mostly on the context of the info.
Now let’s come again to the Wisconsin Breast Most cancers Dataset which we used right here.
Once we utilized a logistic regression mannequin to this knowledge, we received an accuracy of 97% which is excessive, which can make us suppose that the mannequin is environment friendly.
However let’s think about one other metric referred to as ‘recall’ which is 0.94 for this mannequin, which suggests out of all of the malignant tumors we have now within the check set the mannequin was capable of establish 94% of them accurately.
Right here the mannequin missed 6% of malignant circumstances.
In real-world eventualities, primarily healthcare purposes like most cancers detection, if we miss a constructive case, it would delay the prognosis and therapy.
By this we are able to perceive that even when we have now an accuracy of 97%, we have to look deeper based mostly on context of knowledge by contemplating completely different metrics.
So, what we are able to do now, ought to we goal for a recall worth of 1.0 which suggests all of the malignant tumors are recognized accurately, but when we push recall to 1.0 then the precision drops as a result of the mannequin might classify extra benign tumors as malignant.
When the mannequin classifies extra benign tumors as malignant, there could be pointless anxiousness, and it might require further exams or remedies.
Right here we should always goal to maximise ‘recall’ by preserving the ‘precision’ moderately excessive.
We will do that by altering the thresholds set by classifiers to categorise the samples.
Many of the classifiers set the edge to 0.5, and if we alter it 0.3, we’re saying that even whether it is 30% assured, classify it as malignant.
Now let’s use a customized threshold of 0.3.
Code:
# Prepare logistic regression
mannequin = LogisticRegression(max_iter=10000)
mannequin.match(X_train, y_train)
# Predict possibilities
y_probs = mannequin.predict_proba(X_test)[:, 1]
# Apply customized threshold
threshold = 0.3
y_pred_custom = (y_probs >= threshold).astype(int)
# Classification Report
report = classification_report(y_test, y_pred_custom, target_names=["Benign", "Malignant"])
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred_custom, labels=[1, 0])
# Plot Confusion Matrix
plt.determine(figsize=(6, 4))
sns.heatmap(
conf_matrix,
annot=True,
fmt="d",
cmap="Purples",
xticklabels=["Malignant", "Benign"],
yticklabels=["Malignant", "Benign"]
)
plt.xlabel("Predicted")
plt.ylabel("Precise")
plt.title("Confusion Matrix (Threshold = 0.3)")
plt.tight_layout()
plt.present()
Right here we utilized a customized threshold of 0.3 and generated a confusion matrix and a classification report.

Classification Report:

Right here, the accuracy elevated to 98% and the recall for malignant elevated to 97% and the precision remained the identical.
We earlier mentioned that there could be a lower in precision if we attempt to maximize the recall however right here the precision stays identical, this will depend on the info (whether or not balanced or not), preprocessing steps and tuning the edge.
For medical datasets like this, maximizing recall is usually most well-liked over accuracy or precision.
Once we think about datasets like spam detection or fraud detection, we choose precision and identical as in above technique we attempt to enhance precision by tuning threshold accordingly and in addition by balancing the tradeoff between precision and recall.
We use f1-score when the info is imbalanced, and after we choose each precision and recall the place neither false positives nor false negatives could be ignored.
Dataset Supply
Wisconsin Breast Cancer Dataset
Wolberg, W., Mangasarian, O., Road, N., & Road, W. (1993). Breast Most cancers Wisconsin (Diagnostic) [Dataset]. UCI Machine Studying Repository. https://doi.org/10.24432/C5DW2B.
This dataset is licensed underneath a Creative Commons Attribution 4.0 International (CC BY 4.0) license and is free to make use of for business or instructional functions so long as correct credit score is given to authentic supply.
Right here we mentioned what a confusion matrix is and the way it’s used to calculate the completely different classification metrics like accuracy, precision, recall and f1-score.
We additionally explored when to prioritize which classification metric, utilizing the Wisconsin most cancers dataset for example, the place we most well-liked maximizing recall.
I hope you discovered this weblog useful in understanding confusion matrix and classification metrics extra clearly.
Thanks for studying.