DIABETES PREDICTION APP WITH MACHINE LEARNING | by Fadairo Oluwajuwon

PROJECT INTRODUCTION

Diabetes is a situation that impacts how your physique processes sugar (glucose). Sometimes, your physique makes use of insulin to assist regulate blood sugar ranges, however in diabetes, this course of will get disrupted. There are two most important sorts:

Kind 1 Diabetes: The physique doesn’t produce insulin in any respect. It normally develops early in life and requires insulin injections.
Kind 2 Diabetes: happens when the physique both doesn’t produce sufficient insulin or can’t use it correctly. It’s extra frequent and sometimes linked to life-style components like food plan and train.

If left unmanaged, diabetes can result in critical well being issues, however with the fitting care, like a balanced food plan, train, and medicine, it may be managed. That’s the place your Diabetes Prediction App is available in, serving to individuals get an early indication and take motion!

PROJECT AIM

The dataset for this venture was downloaded from Kaggle. This venture goals to develop an app that may predict whether or not a affected person is diabetic. Knowledge dealing with and visualization will even happen to achieve perception. A Logistic Regression and Random Forest classifier mannequin can be created, and the best-performing mannequin can be used to find out if a affected person is diabetic or not.

The dataset is obtained from https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database

Import Required Libraries

#Import required librariesimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
sns.set_style('whitegrid')

Load The Dataset

df = pd.read_csv('diabetes.csv')
df.head(5)

Get Dataset Info

#info of datasetdf.data()


RangeIndex: 768 entries, 0 to 767
Knowledge columns (complete 9 columns):
#   Column                    Non-Null Rely  Dtype  
---  ------                    --------------  -----  
0   Pregnancies               768 non-null    int64  
1   Glucose                   768 non-null    int64  
2   BloodPressure             768 non-null    int64  
3   SkinThickness             768 non-null    int64  
4   Insulin                   768 non-null    int64  
5   BMI                       768 non-null    float64
6   DiabetesPedigreeFunction  768 non-null    float64
7   Age                       768 non-null    int64  
8   Consequence                   768 non-null    int64  
dtypes: float64(2), int64(7)
reminiscence utilization: 54.1 KB

Details about dataset attributes

Pregnancies: To specific the Variety of pregnancies
Glucose: To specific the Glucose stage in blood
BloodPressure: To specific the Blood strain measurement
SkinThickness: To specific the thickness of the pores and skin
Insulin: To specific the Insulin stage in blood
BMI: To specific the Physique mass index
DiabetesPedigreeFunction: To specific the Diabetes proportion
Age: To specific the age
Consequence: To specific the ultimate outcome, 1 is Sure and 0 is No

Dataset Statistics

#verify statistics of datasetdf.describe().T

Commentary:

Trying on the dataset’s statistics, the minimal values of Glucose, Blood Strain, Pores and skin Thickness, Insulin, and BMI can not realistically be 0, so this can be a case that must be handled.

DATA HANDLING

We’d verify for lacking values on this side and deal with them accordingly.

#Test for lacking valuesdf.isna().sum()

Commentary:

No lacking values within the dataset.

HANDLING ZERO VALUES

On this side, we might deal with the zeros within the dataset.

Firstly, we might verify the place the zero seems.

#verify for the place 0 is current in every columnprint(df[df['Glucose'] == 0].form[0])
print(df[df['BloodPressure'] == 0].form[0])
print(df[df['SkinThickness'] == 0].form[0])
print(df[df['Insulin'] == 0].form[0])
print(df[df['BMI'] == 0].form[0])

Output:
5
35
227
374
11

Subsequent, we might visualize the plot to verify the distribution of every column.

#Test distribution of every column within the datasetdf.hist(figsize=(20,20))
plt.present()

Commentary:

A few of the columns have a skewed distribution, so the imply is affected by outliers than the median. Glucose and Blood Strain have regular distribution, therefore we substitute 0 values in these columns with imply values. SkinThickness, Insulin and BMI have skewed distributions, therefore the median is a better option as it’s much less affected by outliers.

#Dealing with Zero Valuesdf['Glucose'] = df['Glucose'].substitute(0, df['Glucose'].imply())
df['BloodPressure'] = df['BloodPressure'].substitute(0, df['BloodPressure'].imply())
df['SkinThickness'] = df['SkinThickness'].substitute(0, df['SkinThickness'].median())
df['Insulin'] = df['Insulin'].substitute(0, df['Insulin'].median())
df['BMI'] = df['BMI'].substitute(0, df['BMI'].median())

DATA VISUALIZATIONS

On this side, we might carry out a easy visualization the place we verify the connection between the goal column(Consequence) with the opposite columns.

#Get numerical columnsnum_col = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']

#Visualize columns in respect to the end result.# Variety of rows wanted (assuming you need 2 histograms per row)
nrows = (len(num_col) + 1) // 2  # it will spherical up the division
fig, axes = plt.subplots(nrows=nrows, ncols=2, figsize=(10, nrows * 5))
# Flatten axes array to make it simpler to iterate over
axes = axes.flatten()
for i, col in enumerate(num_col):
sns.histplot(df, x=col, hue=df['Outcome'], ax=axes[i])
axes[i].set_title(f'Distribution of {col} by Consequence')
# Cover any unused subplots if there are an odd variety of columns
for j in vary(i + 1, len(axes)):
axes[j].axis('off')
plt.tight_layout()
plt.present()

Correlation Heatmap

#correlation heatmapplt.determine(figsize=(10,6))
sns.heatmap(df.corr(), annot=True, fmt=' .2f')
plt.title('CORRELATION HEATMAP')
plt.present()

Knowledge Preparation

On this side, I might first scale the dataset utilizing the usual scaler and break up into X(Function variable) and y(Goal variable).

#Scale knowledgefrom sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(df.drop(columns=['Outcome'])), columns=df.columns[:-1])

y = df['Outcome']
y

Then, I might break up the dataset into practice and check splits utilizing the scikit-learn TrainTestSplit.

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

Commentary:

The dataset was break up into Options [X] and Goal[y] variable
It was then break up into our Prepare and Check splits utilizing TestTrainSplit.
The dataset was break up into 80% practice knowledge and 20% check knowledge.

Mannequin Choice and Analysis

We used two fashions for this prediction venture, fashions used are:

Logistic Regression: a statistical methodology used to foretell the chance of a binary final result (like sure/no, 0/1) primarily based on a number of impartial variables, primarily predicting the probability of an occasion occurring.
Random Forest Classifier: a machine studying algorithm that makes use of an ensemble of determination timber to categorise knowledge, making predictions by averaging the predictions of particular person timber. It’s a robust and versatile software identified for its accuracy and effectivity.

Logistic Regression

Construct the mannequin for prediction.

#import required libraries
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrixlr = LogisticRegression()
lr.match(X_train,y_train)
#predictions
train_pred = lr.predict(X_train)  #prediction on coaching set
test_pred = lr.predict(X_test)    #Prediction on check set
#Accuracy scores
train_acc = accuracy_score(y_train,train_pred)
test_acc = accuracy_score(y_test, test_pred)
print('Prepare Set Accuracy: ', train_acc * 100)
print('Check Set Accuracy: ', test_acc * 100)
print()
#Confusion matrix and classification report
print('Confusion Matrix:n', confusion_matrix(y_test,test_pred))
print('Classification Report:n', classification_report(y_test,test_pred))

#Visualize the Logistic Regression confusion matrix#convert to matrix
conf_matrix = np.array([[82, 18], [27,27]])
#convert to dataframe
df_cm = pd.DataFrame(conf_matrix, index=['Actual Negative', 'Actual Positive'], columns=['Predicted Negative', 'Predicted Positive'])
#heatmap
plt.determine(figsize=(8,6))
sns.heatmap(df_cm, annot=True, fmt='d', cmap='Blues')
plt.title('Logistic Regression Confusion Matrix')
plt.present()

Commentary:

The mannequin achieves 79.48% accuracy on the coaching set and 70.78% accuracy on the check set, indicating a reasonable drop in efficiency, which suggests some overfitting.
From the confusion matrix, the mannequin accurately classifies 82 non-diabetic sufferers however misclassifies 18 as diabetic. It additionally accurately classifies 27 diabetic sufferers however misclassifies 27 as non-diabetic, which can point out issue in distinguishing diabetic circumstances.
The classification report exhibits that the mannequin has larger precision (0.75) and recall (0.82) for non-diabetic circumstances in comparison with diabetic circumstances (precision = 0.60, recall = 0.50). This implies that the mannequin is best at figuring out non-diabetic sufferers however struggles with diabetic circumstances, seemingly resulting from class imbalance or characteristic illustration.

Random Forest Classifier

Construct the mannequin for prediction.

#import required libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV#hyperparameter grid
param_grid = {
'n_estimators': [50, 100, 200, 300],
'max_depth': [10, 20 ,30],
'min_samples_split': [2, 5, 10]
}
#Carry out gridsearch with cross validation
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.match(X_train,y_train)
#get the perfect estimator
print('Greatest param: ', grid.best_params_)
rfc = grid.best_estimator_
#predictions
rf_train_pred = rfc.predict(X_train)
rf_test_pred = rfc.predict(X_test)
#Accuracy rating
rf_train_acc = accuracy_score(y_train,rf_train_pred)
rf_test_acc = accuracy_score(y_test, rf_test_pred)
print('Prepare Set Accuracy: ', rf_train_acc * 100)
print('Check Set Accuracy: ', rf_test_acc * 100)
print()
#Confusion matrix and classification report
print('Confusion Matrix:n', confusion_matrix(y_test,rf_test_pred))
print('Classification Report:n', classification_report(y_test,rf_test_pred))

#visualize the confusion matrix#convert to matrix
rf_matrix = np.array([[82,18],[22,32]])
#convert to dataframe
rf_df = pd.DataFrame(rf_matrix, index=['Actual Negative', 'Actual Positive'], columns=['Predicted Negative', 'Predicted Positive'])
#heatmap
plt.determine(figsize=(8,6))
sns.heatmap(rf_df, annot=True, fmt='d', cmap='Blues')
plt.title('Random Forest Confusion Matrix')
plt.present()

Commentary:

The mannequin’s coaching accuracy improved to 93.16%, whereas check accuracy elevated to 76.62%, exhibiting higher generalization however nonetheless some overfitting.
The confusion matrix signifies that the mannequin accurately classifies 83 non-diabetic and 35 diabetic sufferers, with fewer misclassifications in comparison with the earlier mannequin. Nonetheless, 17 non-diabetic and 19 diabetic sufferers are nonetheless misclassified.
The classification report exhibits an enchancment in detecting diabetic circumstances (precision = 0.67, recall = 0.65, f1-score = 0.66), that means the mannequin is now barely higher at figuring out diabetes, although it nonetheless favors non-diabetic predictions (precision = 0.81, recall = 0.83).

Save The Mannequin

The Random Forest classifier is the better-performing mannequin; will probably be saved utilizing the pickle library and is beneficial in constructing our app. The usual scaler would even be saved for use within the app When our person inputs particulars, the mannequin would first scale the inputs earlier than passing them into the mannequin for prediction.

#import required library
import picklepickle.dump(rfc, open('mannequin.pkl', 'wb'))
pickle.dump(scaler, open('scaler.pkl', 'wb'))

BUILD AND DEPLOY THE APP

Now, we might construct and deploy the app utilizing STREAMLIT.

import streamlit as st
import pickle
import numpy as np
import time# Load the educated mannequin and scaler
mannequin = pickle.load(open('mannequin.pkl', 'rb'))
scaler = pickle.load(open('scaler.pkl', 'rb'))
# Streamlit app styling
st.markdown(
"""
""",
unsafe_allow_html=True
)
# Title
st.markdown("""
This app predicts the probability of diabetes primarily based on affected person medical particulars.

""", unsafe_allow_html=True)# Sidebar for person inputs
st.sidebar.header("Enter Affected person Particulars 🏥")
pregnancies = st.sidebar.slider('Pregnancies 🤰', 0, 20, 0)
glucose = st.sidebar.slider('Glucose Stage 🍬', 0, 300, 120)
blood_pressure = st.sidebar.slider('Blood Strain 💉', 0, 200, 80)
skin_thickness = st.sidebar.slider('Pores and skin Thickness 📏', 0, 100, 20)
insulin = st.sidebar.slider('Insulin Stage 💊', 0, 900, 80)
bmi = st.sidebar.slider('BMI ⚖️', 0.0, 70.0, 25.0, step=0.1)
dpf = st.sidebar.slider('Diabetes Pedigree Perform 🧬', 0.0, 3.0, 0.5, step=0.01)
age = st.sidebar.slider('Age 🎂', 0, 120, 30)
# Prediction button
if st.sidebar.button('🔍 Predict Diabetes'):
# Create enter array
input_data = np.array([[pregnancies, glucose, blood_pressure, skin_thickness, insulin, bmi, dpf, age]])
# Scale the information
input_data_scaled = scaler.rework(input_data)
# Loading animation
with st.spinner('Analyzing medical knowledge... ⏳'):
time.sleep(2)
# Make prediction
prediction = mannequin.predict(input_data_scaled)
# Show the outcome with higher styling and animations
if prediction[0] == 1:
st.markdown("""

🚨 Prediction: DIABETES DETECTED!
Please seek the advice of a medical skilled.

""", unsafe_allow_html=True)
else:
st.markdown("""
✅ Prediction: NO DIABETES!
Keep a wholesome life-style! 🏃‍♂️🥗

""", unsafe_allow_html=True)

Above are pictures of the app working that predicts No Diabetes or Diabetes Detected.

CONCLUSION

On this article, utilizing the Diabetes dataset, we’ve demonstrated an end-to-end machine studying and deployment venture from starting to finish. Knowledge cleansing and visualization have been our first steps. Then, to offer higher knowledge to coach with the machine studying mannequin, the information was scaled utilizing the usual scaler. After that, we constructed two fashions, the Logistic Regression and the Random Forest Classifier, wherein the Random Forest was the better-performing mannequin, and the mannequin was saved and used for constructing our app utilizing Streamlit. Tho the mannequin can nonetheless be improved utilizing extra superior machine fashions, which weren’t mentioned on this article, as the primary objective of this text is to point out the utilization of the Random forest classifier and the streamlit app constructing.

You may take a look at the GitHub file right here: Raw File

Source link

Why Netflix Seems to Know You Better Than Your Friends | by Rahul Mishra | Coding Nexus | Aug, 2025

Designing a Machine Learning System: Part Five | by Mehrshad Asadi | Aug, 2025

Mastering Fine-Tuning Foundation Models in Amazon Bedrock: A Comprehensive Guide for Developers and IT Professionals | by Nishant Gupta | Aug, 2025

Why Netflix Seems to Know You Better Than Your Friends | by Rahul Mishra | Coding Nexus | Aug, 2025

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

Tried Smallppt So You Don’t Have To: My Honest Review

M&S Click & Collect returns 15 weeks after cyber attack

IEEE Launches Climate Tech Magazine

Our Picks

Why Netflix Seems to Know You Better Than Your Friends | by Rahul Mishra | Coding Nexus | Aug, 2025

EdgeConneX and Lambda to Build AI Factory Infrastructure in Chicago and Atlanta

French streamer’s death ‘not traumatic’, autopsy finds

DIABETES PREDICTION APP WITH MACHINE LEARNING | by Fadairo Oluwajuwon | Apr, 2025

PROJECT INTRODUCTION

PROJECT AIM

Knowledge Preparation

Mannequin Choice and Analysis

BUILD AND DEPLOY THE APP

CONCLUSION

Related Posts