PROJECT INTRODUCTION
Diabetes is a situation that impacts how your physique processes sugar (glucose). Sometimes, your physique makes use of insulin to assist regulate blood sugar ranges, however in diabetes, this course of will get disrupted. There are two most important sorts:
- Kind 1 Diabetes: The physique doesn’t produce insulin in any respect. It normally develops early in life and requires insulin injections.
- Kind 2 Diabetes: happens when the physique both doesn’t produce sufficient insulin or can’t use it correctly. It’s extra frequent and sometimes linked to life-style components like food plan and train.
If left unmanaged, diabetes can result in critical well being issues, however with the fitting care, like a balanced food plan, train, and medicine, it may be managed. That’s the place your Diabetes Prediction App is available in, serving to individuals get an early indication and take motion!
PROJECT AIM
The dataset for this venture was downloaded from Kaggle. This venture goals to develop an app that may predict whether or not a affected person is diabetic. Knowledge dealing with and visualization will even happen to achieve perception. A Logistic Regression and Random Forest classifier mannequin can be created, and the best-performing mannequin can be used to find out if a affected person is diabetic or not.
The dataset is obtained from https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
Import Required Libraries
#Import required librariesimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
sns.set_style('whitegrid')
Load The Dataset
df = pd.read_csv('diabetes.csv')
df.head(5)
Get Dataset Info
#info of datasetdf.data()
RangeIndex: 768 entries, 0 to 767
Knowledge columns (complete 9 columns):
# Column Non-Null Rely Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Consequence 768 non-null int64
dtypes: float64(2), int64(7)
reminiscence utilization: 54.1 KB
Details about dataset attributes
- Pregnancies: To specific the Variety of pregnancies
- Glucose: To specific the Glucose stage in blood
- BloodPressure: To specific the Blood strain measurement
- SkinThickness: To specific the thickness of the pores and skin
- Insulin: To specific the Insulin stage in blood
- BMI: To specific the Physique mass index
- DiabetesPedigreeFunction: To specific the Diabetes proportion
- Age: To specific the age
- Consequence: To specific the ultimate outcome, 1 is Sure and 0 is No
Dataset Statistics
#verify statistics of datasetdf.describe().T
Commentary:
- Trying on the dataset’s statistics, the minimal values of Glucose, Blood Strain, Pores and skin Thickness, Insulin, and BMI can not realistically be 0, so this can be a case that must be handled.
DATA HANDLING
We’d verify for lacking values on this side and deal with them accordingly.
#Test for lacking valuesdf.isna().sum()
Commentary:
- No lacking values within the dataset.
HANDLING ZERO VALUES
On this side, we might deal with the zeros within the dataset.
Firstly, we might verify the place the zero seems.
#verify for the place 0 is current in every columnprint(df[df['Glucose'] == 0].form[0])
print(df[df['BloodPressure'] == 0].form[0])
print(df[df['SkinThickness'] == 0].form[0])
print(df[df['Insulin'] == 0].form[0])
print(df[df['BMI'] == 0].form[0])
Output:
5
35
227
374
11
Subsequent, we might visualize the plot to verify the distribution of every column.
#Test distribution of every column within the datasetdf.hist(figsize=(20,20))
plt.present()
Commentary:
- A few of the columns have a skewed distribution, so the imply is affected by outliers than the median. Glucose and Blood Strain have regular distribution, therefore we substitute 0 values in these columns with imply values. SkinThickness, Insulin and BMI have skewed distributions, therefore the median is a better option as it’s much less affected by outliers.
#Dealing with Zero Valuesdf['Glucose'] = df['Glucose'].substitute(0, df['Glucose'].imply())
df['BloodPressure'] = df['BloodPressure'].substitute(0, df['BloodPressure'].imply())
df['SkinThickness'] = df['SkinThickness'].substitute(0, df['SkinThickness'].median())
df['Insulin'] = df['Insulin'].substitute(0, df['Insulin'].median())
df['BMI'] = df['BMI'].substitute(0, df['BMI'].median())
DATA VISUALIZATIONS
On this side, we might carry out a easy visualization the place we verify the connection between the goal column(Consequence) with the opposite columns.
#Get numerical columnsnum_col = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']
#Visualize columns in respect to the end result.# Variety of rows wanted (assuming you need 2 histograms per row)
nrows = (len(num_col) + 1) // 2 # it will spherical up the division
fig, axes = plt.subplots(nrows=nrows, ncols=2, figsize=(10, nrows * 5))
# Flatten axes array to make it simpler to iterate over
axes = axes.flatten()
for i, col in enumerate(num_col):
sns.histplot(df, x=col, hue=df['Outcome'], ax=axes[i])
axes[i].set_title(f'Distribution of {col} by Consequence')
# Cover any unused subplots if there are an odd variety of columns
for j in vary(i + 1, len(axes)):
axes[j].axis('off')
plt.tight_layout()
plt.present()
Correlation Heatmap
#correlation heatmapplt.determine(figsize=(10,6))
sns.heatmap(df.corr(), annot=True, fmt=' .2f')
plt.title('CORRELATION HEATMAP')
plt.present()
Knowledge Preparation
On this side, I might first scale the dataset utilizing the usual scaler and break up into X(Function variable) and y(Goal variable).
#Scale knowledgefrom sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(df.drop(columns=['Outcome'])), columns=df.columns[:-1])
y = df['Outcome']
y
Then, I might break up the dataset into practice and check splits utilizing the scikit-learn TrainTestSplit.
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
Commentary:
- The dataset was break up into Options [X] and Goal[y] variable
- It was then break up into our Prepare and Check splits utilizing TestTrainSplit.
- The dataset was break up into 80% practice knowledge and 20% check knowledge.
Mannequin Choice and Analysis
We used two fashions for this prediction venture, fashions used are:
- Logistic Regression: a statistical methodology used to foretell the chance of a binary final result (like sure/no, 0/1) primarily based on a number of impartial variables, primarily predicting the probability of an occasion occurring.
- Random Forest Classifier: a machine studying algorithm that makes use of an ensemble of determination timber to categorise knowledge, making predictions by averaging the predictions of particular person timber. It’s a robust and versatile software identified for its accuracy and effectivity.
Logistic Regression
Construct the mannequin for prediction.
#import required libraries
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrixlr = LogisticRegression()
lr.match(X_train,y_train)
#predictions
train_pred = lr.predict(X_train) #prediction on coaching set
test_pred = lr.predict(X_test) #Prediction on check set
#Accuracy scores
train_acc = accuracy_score(y_train,train_pred)
test_acc = accuracy_score(y_test, test_pred)
print('Prepare Set Accuracy: ', train_acc * 100)
print('Check Set Accuracy: ', test_acc * 100)
print()
#Confusion matrix and classification report
print('Confusion Matrix:n', confusion_matrix(y_test,test_pred))
print('Classification Report:n', classification_report(y_test,test_pred))
#Visualize the Logistic Regression confusion matrix#convert to matrix
conf_matrix = np.array([[82, 18], [27,27]])
#convert to dataframe
df_cm = pd.DataFrame(conf_matrix, index=['Actual Negative', 'Actual Positive'], columns=['Predicted Negative', 'Predicted Positive'])
#heatmap
plt.determine(figsize=(8,6))
sns.heatmap(df_cm, annot=True, fmt='d', cmap='Blues')
plt.title('Logistic Regression Confusion Matrix')
plt.present()
Commentary:
- The mannequin achieves 79.48% accuracy on the coaching set and 70.78% accuracy on the check set, indicating a reasonable drop in efficiency, which suggests some overfitting.
- From the confusion matrix, the mannequin accurately classifies 82 non-diabetic sufferers however misclassifies 18 as diabetic. It additionally accurately classifies 27 diabetic sufferers however misclassifies 27 as non-diabetic, which can point out issue in distinguishing diabetic circumstances.
- The classification report exhibits that the mannequin has larger precision (0.75) and recall (0.82) for non-diabetic circumstances in comparison with diabetic circumstances (precision = 0.60, recall = 0.50). This implies that the mannequin is best at figuring out non-diabetic sufferers however struggles with diabetic circumstances, seemingly resulting from class imbalance or characteristic illustration.
Random Forest Classifier
Construct the mannequin for prediction.
#import required libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV#hyperparameter grid
param_grid = {
'n_estimators': [50, 100, 200, 300],
'max_depth': [10, 20 ,30],
'min_samples_split': [2, 5, 10]
}
#Carry out gridsearch with cross validation
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.match(X_train,y_train)
#get the perfect estimator
print('Greatest param: ', grid.best_params_)
rfc = grid.best_estimator_
#predictions
rf_train_pred = rfc.predict(X_train)
rf_test_pred = rfc.predict(X_test)
#Accuracy rating
rf_train_acc = accuracy_score(y_train,rf_train_pred)
rf_test_acc = accuracy_score(y_test, rf_test_pred)
print('Prepare Set Accuracy: ', rf_train_acc * 100)
print('Check Set Accuracy: ', rf_test_acc * 100)
print()
#Confusion matrix and classification report
print('Confusion Matrix:n', confusion_matrix(y_test,rf_test_pred))
print('Classification Report:n', classification_report(y_test,rf_test_pred))
#visualize the confusion matrix#convert to matrix
rf_matrix = np.array([[82,18],[22,32]])
#convert to dataframe
rf_df = pd.DataFrame(rf_matrix, index=['Actual Negative', 'Actual Positive'], columns=['Predicted Negative', 'Predicted Positive'])
#heatmap
plt.determine(figsize=(8,6))
sns.heatmap(rf_df, annot=True, fmt='d', cmap='Blues')
plt.title('Random Forest Confusion Matrix')
plt.present()
Commentary:
- The mannequin’s coaching accuracy improved to 93.16%, whereas check accuracy elevated to 76.62%, exhibiting higher generalization however nonetheless some overfitting.
- The confusion matrix signifies that the mannequin accurately classifies 83 non-diabetic and 35 diabetic sufferers, with fewer misclassifications in comparison with the earlier mannequin. Nonetheless, 17 non-diabetic and 19 diabetic sufferers are nonetheless misclassified.
- The classification report exhibits an enchancment in detecting diabetic circumstances (precision = 0.67, recall = 0.65, f1-score = 0.66), that means the mannequin is now barely higher at figuring out diabetes, although it nonetheless favors non-diabetic predictions (precision = 0.81, recall = 0.83).
Save The Mannequin
The Random Forest classifier is the better-performing mannequin; will probably be saved utilizing the pickle library and is beneficial in constructing our app. The usual scaler would even be saved for use within the app When our person inputs particulars, the mannequin would first scale the inputs earlier than passing them into the mannequin for prediction.
#import required library
import picklepickle.dump(rfc, open('mannequin.pkl', 'wb'))
pickle.dump(scaler, open('scaler.pkl', 'wb'))
BUILD AND DEPLOY THE APP
Now, we might construct and deploy the app utilizing STREAMLIT.
import streamlit as st
import pickle
import numpy as np
import time# Load the educated mannequin and scaler
mannequin = pickle.load(open('mannequin.pkl', 'rb'))
scaler = pickle.load(open('scaler.pkl', 'rb'))
# Streamlit app styling
st.markdown(
"""
""",
unsafe_allow_html=True
)
# Title
st.markdown("""
This app predicts the probability of diabetes primarily based on affected person medical particulars.
""", unsafe_allow_html=True)# Sidebar for person inputs
st.sidebar.header("Enter Affected person Particulars 🏥")
pregnancies = st.sidebar.slider('Pregnancies 🤰', 0, 20, 0)
glucose = st.sidebar.slider('Glucose Stage 🍬', 0, 300, 120)
blood_pressure = st.sidebar.slider('Blood Strain 💉', 0, 200, 80)
skin_thickness = st.sidebar.slider('Pores and skin Thickness 📏', 0, 100, 20)
insulin = st.sidebar.slider('Insulin Stage 💊', 0, 900, 80)
bmi = st.sidebar.slider('BMI ⚖️', 0.0, 70.0, 25.0, step=0.1)
dpf = st.sidebar.slider('Diabetes Pedigree Perform 🧬', 0.0, 3.0, 0.5, step=0.01)
age = st.sidebar.slider('Age 🎂', 0, 120, 30)
# Prediction button
if st.sidebar.button('🔍 Predict Diabetes'):
# Create enter array
input_data = np.array([[pregnancies, glucose, blood_pressure, skin_thickness, insulin, bmi, dpf, age]])
# Scale the information
input_data_scaled = scaler.rework(input_data)
# Loading animation
with st.spinner('Analyzing medical knowledge... ⏳'):
time.sleep(2)
# Make prediction
prediction = mannequin.predict(input_data_scaled)
# Show the outcome with higher styling and animations
if prediction[0] == 1:
st.markdown("""
🚨 Prediction: DIABETES DETECTED!
Please seek the advice of a medical skilled.
""", unsafe_allow_html=True)
else:
st.markdown("""
✅ Prediction: NO DIABETES!
Keep a wholesome life-style! 🏃♂️🥗
""", unsafe_allow_html=True)
Above are pictures of the app working that predicts No Diabetes or Diabetes Detected.
CONCLUSION
On this article, utilizing the Diabetes dataset, we’ve demonstrated an end-to-end machine studying and deployment venture from starting to finish. Knowledge cleansing and visualization have been our first steps. Then, to offer higher knowledge to coach with the machine studying mannequin, the information was scaled utilizing the usual scaler. After that, we constructed two fashions, the Logistic Regression and the Random Forest Classifier, wherein the Random Forest was the better-performing mannequin, and the mannequin was saved and used for constructing our app utilizing Streamlit. Tho the mannequin can nonetheless be improved utilizing extra superior machine fashions, which weren’t mentioned on this article, as the primary objective of this text is to point out the utilization of the Random forest classifier and the streamlit app constructing.
You may take a look at the GitHub file right here: Raw File