Concerning the Firm
Porter is India’s largest market for intra-city logistics, revolutionizing last-mile deliveries throughout varied sectors. As a frontrunner within the nation’s $40 billion intra-city logistics market, Porter has considerably enhanced operational effectivity and livelihood alternatives for over 150,000 driver-partners. The corporate has efficiently fulfilled greater than 5 million buyer orders, providing a seamless and technology-driven logistics expertise.
Goal
Environment friendly meals supply is essential for buyer satisfaction, and one of many key features is estimating supply time precisely. Porter collaborates with a number of eating places and has a fleet of supply companions, however predicting supply instances may be advanced resulting from quite a few influencing elements. The target of this undertaking is to construct a machine studying mannequin that may predict the estimated supply time primarily based on:
- Order particulars: Variety of objects, distinct objects, complete value, and many others.
- Restaurant data: Market ID, retailer class.
- Logistics information: Availability of supply companions, excellent orders, and order success capability.
By leveraging historic information and machine studying strategies, we intention to develop a strong regression mannequin that precisely predicts supply time.
Ideas Used
This undertaking entails a number of key ideas from information science and machine studying, together with:
- Exploratory Knowledge Evaluation (EDA): Understanding information distribution, relationships, and have significance.
- Function Engineering & Preprocessing: Dealing with lacking values, encoding categorical variables, and have scaling.
- Regression Modelling: Utilizing algorithms like Neural Networks, XGBoost Regressor, and Linear Regression to foretell supply time.
- Mannequin Analysis: Evaluating fashions utilizing error metrics equivalent to MAE (Imply Absolute Error) and R-2 (R-Squared).
Challenges and Concerns
Some challenges in estimating supply time embody:
- Actual-time Associate Availability: The variety of free supply companions varies dynamically.
- Order Complexity: Bigger orders with a number of objects could take longer to arrange.
- Visitors and Environmental Components: Unpredictable circumstances could influence supply velocity.
Addressing these elements requires a well-designed predictive mannequin that captures related options and adapts to real-world variations.
Earlier than constructing our predictive mannequin, it’s essential to discover the dataset and perceive the underlying patterns. This part covers information inspection, preprocessing, and key insights derived from the exploratory evaluation.
The dataset consists of information the place every row represents a novel meals supply order. The primary options embody:
- Order Info: Market ID, retailer class, order protocol, variety of objects, and complete value.
- Logistics Knowledge: Variety of on-shift and busy supply companions, excellent orders.
- Time Stamps: Order placement time and precise supply time.
We begin by loading the dataset and checking its construction.
From the output, we affirm that the dataset incorporates lacking values, categorical variables, and timestamps that want conversion.
The created_at
and actual_delivery_time
columns are transformed to datetime format. We then derive further options:
- Time Taken for Supply: Distinction between
actual_delivery_time
andcreated_at
, transformed to minutes. - Day and Time Options: Extracting day of the week, month, hour, and minute from timestamps to seize time-based traits.
# Changing date columns to datetime format
df['created_at'] = pd.to_datetime(df['created_at'])
df['actual_delivery_time'] = pd.to_datetime(df['actual_delivery_time'])# Making a column with the supply time of the order
df['time_taken'] = df['actual_delivery_time'] - df['created_at']
df['time_taken'] = df['time_taken'].dt.total_seconds() / 60
df['time_taken'] = np.spherical(df['time_taken'])
df['day_of_week'] = df['created_at'].dt.day_of_week
df['year_o'] = df['created_at'].dt.yr.astype('int64')
df['month_o'] = df['created_at'].dt.month.astype('int64')
df['day_o'] = df['created_at'].dt.day.astype('int64')
df['hour_o'] = df['created_at'].dt.hour.astype('int64')
df['minute_o'] = df['created_at'].dt.minute.astype('int64')
df['second_o'] = df['created_at'].dt.second.astype('int64')
df['year_d'] = df['actual_delivery_time'].dt.yr.astype('int64')
df['month_d'] = df['actual_delivery_time'].dt.month.astype('int64')
df['day_d'] = df['actual_delivery_time'].dt.day.astype('int64')
df['hour_d'] = df['actual_delivery_time'].dt.hour.astype('int64')
df['minute_d'] = df['actual_delivery_time'].dt.minute.astype('int64')
df['second_d'] = df['actual_delivery_time'].dt.second.astype('int64')
df['day_of_week_d'] = df['actual_delivery_time'].dt.day_of_week
# Dropping the date columns as we've extracted the required data.
df.drop(['created_at', 'actual_delivery_time'], axis=1, inplace=True)
To make sure information integrity, lacking values are dealt with as follows:
- Categorical Options: Crammed utilizing the mode (most frequent worth).
- Numerical Options: Crammed utilizing the median for robustness in opposition to outliers.
df['store_primary_category'] = df['store_primary_category'].fillna(df['store_primary_category'].mode()[0])
df['total_onshift_partners'] = df['total_onshift_partners'].fillna(df['total_onshift_partners'].median())
df['total_busy_partners'] = df['total_busy_partners'].fillna(df['total_busy_partners'].median())
df['total_outstanding_orders'] = df['total_outstanding_orders'].fillna(df['total_outstanding_orders'].median())
df['market_id'] = df['market_id'].fillna(df['market_id'].mode()[0])
df['order_protocol'] = df['order_protocol'].fillna(df['order_protocol'].mode()[0])
Distribution of Supply Time
We plot the histogram of supply instances to grasp its distribution:
plt.determine(figsize=(15, 6))
sns.set_style('whitegrid')
sns.histplot(df['time_taken'])
plt.xscale('log')
plt.present()
Perception: Most deliveries take between 10 to 100 minutes, indicating a right-skewed distribution.
We visualize the highest 10 restaurant classes with the very best order counts:
plt.determine(figsize=(12, 8))top_categories = df['store_primary_category'].value_counts().nlargest(10)
sns.barplot(
x=top_categories.values,
y=top_categories.index,
palette='Blues_r'
)
for i, v in enumerate(top_categories.values):
plt.textual content(v + 5, i, str(v), va='heart', fontsize=10, colour='black')
plt.title('High 10 Retailer Major Classes', fontsize=16)
plt.xlabel('Depend', fontsize=14)
plt.ylabel('Class', fontsize=14)
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.present()
We analyze when orders peak in the course of the day and throughout the week.
Insights:
- Orders peak throughout early morning hours (2 AM — 4 AM).
- Fridays, Saturdays, and Sundays observe the very best variety of orders.
fig, ax = plt.subplots(figsize=(15, 8), nrows=1, ncols=2)
sns.countplot(information=df, x='hour_d', ax=ax[0], colour='#82b1ff')
sns.countplot(information=df, x='day_of_week_d', ax=ax[1], colour='#ffcc80')
days = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
ax[1].set_xticks(vary(7))
ax[1].set_xticklabels(days)ax[0].set_title('Depend of orders by hour of day')
ax[1].set_title('Depend of orders by day of week')
plt.present()
We look at the variety of orders throughout completely different markets and order protocols.
Insights:
- Market ID 2 and 4 obtain the very best orders.
- Sure order protocols dominate the order placement strategies.
fig2, ax = plt.subplots(figsize=(15,6), nrows=1, ncols=2)sns.countplot(information=df, x='market_id', ax=ax[0], colour='#82b1ff')
ax[0].set_title('Variety of orders by Market ID')
sns.countplot(information=df, x='order_protocol', ax=ax[1], colour='#ffcc80')
ax[1].set_title('Variety of orders by Order Protocol')
plt.present()
To grasp supply-demand steadiness, we analyze common on-shift and busy companions per market:
Perception: Markets 2 and 4 have the very best accomplice availability and excellent orders, indicating excessive demand.
avg_data_market = pd.DataFrame({
'Avg. Onshift Companions': df.groupby('market_id')['total_onshift_partners'].imply(),
'Avg. Busy Companions': df.groupby('market_id')['total_busy_partners'].imply(),
'Avg. Excellent Orders': df.groupby('market_id')['total_outstanding_orders'].imply()
}).sort_values('Avg. Excellent Orders', ascending=False)avg_data_market.plot(
type='barh',
stacked=True,
figsize=(12, 8),
colour=['#82b1ff', '#ffcc80', '#a5d6a7']
)
plt.title('Common Companions and Orders per Market', fontsize=16)
plt.xlabel('Common Depend', fontsize=14)
plt.ylabel('Market ID', fontsize=14)
plt.legend(title='Metrics', fontsize=10, title_fontsize=12)
plt.tight_layout()
plt.present()
We test how most merchandise value varies throughout completely different markets and order strategies.
Insights:
- Market ID 4 receives the highest-value orders, warranting prioritized useful resource allocation.
- Order Protocol 1 sees high-value transactions, making it a most well-liked technique for costly deliveries.
avg_data = pd.DataFrame({
'Avg. Onshift Companions': df.groupby('day_of_week_d')['total_onshift_partners'].imply(),
'Avg. Busy Companions': df.groupby('day_of_week_d')['total_busy_partners'].imply(),
'Avg. Excellent Orders': df.groupby('day_of_week_d')['total_outstanding_orders'].imply()
})avg_data.plot(
type='bar',
stacked=True,
figsize=(10, 6),
colour=['#82b1ff', '#ffcc80', '#a5d6a7']
)
plt.title('Avg. Companions and Orders per Day of Week', fontsize=16)
plt.xlabel('Day of the Week', fontsize=14)
plt.ylabel('Common Depend', fontsize=14)
plt.xticks(ticks=vary(7), labels=['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'], rotation=45)
plt.legend(title='Metrics', fontsize=10, title_fontsize=12)
plt.tight_layout()
plt.present()
We analyze whether or not supply time and order worth fluctuate by day of the week.
Insights:
- Weekends see longer supply instances, possible resulting from excessive order quantity.
- Greater-value orders are additionally positioned on weekends, reinforcing the weekend demand surge.
plt.determine(figsize = (15,6))
sns.set_style("whitegrid")
palette = sns.color_palette("Paired")
days = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']plt.subplot(1,2,1)
sns.lineplot(information = df_3, x = 'day_of_week_d', y = 'time_taken', colour = 'mediumturquoise')
ax = plt.gca()
ax.set_xticks(vary(7))
ax.set_xticklabels(days)
ax.set_title('Common time taken by orders by day of week')
plt.subplot(1,2,2)
sns.lineplot(information = df_3, x = 'day_of_week_d', y = 'subtotal', colour = 'mediumturquoise')
ax = plt.gca()
ax.set_xticks(vary(7))
ax.set_xticklabels(days)
ax.set_title('Common subtotal of orders by day of week')
plt.tight_layout()
plt.present()
- Supply time is skewed, with most orders taking between 10–100 minutes.
- Peak order hours are between 2 AM — 4 AM, and weekends see probably the most orders.
- Market ID 2 and 4 have the highest demand and accomplice availability.
- Costlier orders are positioned through Order Protocol 1 and Market 4.
- Supply time will increase on weekends, aligning with high-value orders.
To grasp the relationships between numerical variables, we first compute the correlation matrix:
df_corr = df.drop(['market_id','store_primary_category','order_protocol','day_of_week','year_o','month_o','day_o','hour_o','minute_o','second_o','year_d','month_d','day_d','minute_d','second_d'],axis =1).corr()
plt.determine(figsize=(15, 10))
sns.heatmap(df_corr, annot=True, cmap='viridis')
plt.title('Correlation Matrix', fontsize=16)
Robust correlation (0.94) is noticed between total_onshift_partners
, total_busy_partners
, and total_outstanding_orders
.
To mitigate multicollinearity, we progressively take away extremely correlated options and recompute the correlation matrix.
df_corr = df.drop(['market_id','store_primary_category','order_protocol','day_of_week','year_o','month_o','day_o','hour_o','minute_o','second_o','year_d','month_d','day_d','minute_d','second_d', 'total_busy_partners', 'total_outstanding_orders'], axis=1).corr()
plt.determine(figsize=(15, 10))
sns.heatmap(df_corr, annot=True, cmap='viridis')
plt.title('Correlation Matrix', fontsize=16)
- After removing, we observe no excessive correlations, confirming our choice.
- We completely drop
total_busy_partners
andtotal_outstanding_orders
.
df.drop(['total_busy_partners', 'total_outstanding_orders'], axis=1, inplace=True)
Outliers can considerably have an effect on mannequin efficiency by forcing the mannequin to be taught excessive datapoints that doesn’t signify the final pattern. We analyze outliers in key numerical options:
num_cols = ['total_onshift_partners', 'max_item_price','min_item_price', 'subtotal', 'time_taken']fig6, ax = plt.subplots(nrows=5, ncols=2, figsize=(15, 15))
for i, col in enumerate(num_cols):
sns.distplot(df[col], ax=ax[i, 0], colour='#E6A9EC')
sns.boxplot(information=df, x=df[col], ax=ax[i, 1], colour='#F08080')
ax[i, 0].set_title(f"{col} Distribution")
ax[i, 1].set_title(f"Boxplot of {col}")
plt.tight_layout()
- Many numerical options exhibit outliers past the 99th percentile.
- We resolve to take away values above the 99th percentile as potential outliers.
p1 = np.percentile(df['total_onshift_partners'], 99)
p2 = np.percentile(df['max_item_price'], 99)
p3 = np.percentile(df['subtotal'], 99)
p4 = np.percentile(df['time_taken'], 99)
p5 = np.percentile(df['min_item_price'], 99)df = df[~(df['total_onshift_partners'] > p1)]
df = df[~(df['max_item_price'] > p2)]
df = df[~(df['subtotal'] > p3)]
df = df[~(df['time_taken'] > p4)]
df = df[~(df['min_item_price'] > p5)]
Categorical options should be reworked for machine studying fashions. We apply goal encoding to store_primary_category
, changing every class with its imply supply time (time_taken
).
# Performing Goal encoding on 'store_primary_category'.df['store_primary_category'] = df.groupby('store_primary_category')['time_taken'].remodel('imply')
To arrange information for mannequin coaching, we break up it into coaching and take a look at units and apply normal scaling.
#Prepare set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)#Validation set
X_train_val, X_test_val, y_train_val, y_test_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_train_val_scaled = scaler.fit_transform(X_train_val)
X_test_val_scaled = scaler.remodel(X_test_val)
X_test_scaled = scaler.remodel(X_test)
After finishing the preprocessing steps, we proceeded with coaching a number of fashions to foretell supply instances. We explored a Neural Community (NN), XGBoost Regressor, and Linear Regression, evaluating their Imply Absolute Error (MAE) and R² scores to guage efficiency.
Neural Community Mannequin
We designed a deep studying mannequin utilizing Keras Sequential API, consisting of a number of dense layers. The structure was as follows:
- 256 neurons within the enter layer
- Hidden layers: 128, 64, 32, and 16 neurons
- LeakyReLU activations for higher gradient stream
- Batch Normalization to stabilize coaching
- Linear activation within the output layer
mannequin = Sequential([
Dense(256, input_shape=(X_train_val_scaled.shape[1],)),
LeakyReLU(),Dense(128),
BatchNormalization(),
LeakyReLU(),
Dense(64),
LeakyReLU(),
Dense(32),
LeakyReLU(),
Dense(16),
BatchNormalization(),
LeakyReLU(),
Dense(1, activation="linear")
])
Coaching Technique
To optimize coaching, we utilized:
- Studying Fee Scheduling: The educational charge decreases over epochs utilizing an adaptive decay perform.
- Early Stopping: Prevents overfitting by monitoring validation MAE and stopping coaching when efficiency now not improves.
- TensorBoard Logging: Used for monitoring the mannequin’s coaching progress.
def advanced_lr_decay(epoch, lr):
if epoch < 50:
return lr * 0.97
elif epoch < 100:
return lr * 0.95
elif epoch < 150:
return lr * 0.93
else:
return lr * 0.90scheduler = tf.keras.callbacks.LearningRateScheduler(advanced_lr_decay)
Mannequin Coaching
mannequin.compile(optimizer='adam', loss=Huber(delta=1.0), metrics=["mae"])early_stop = EarlyStopping(monitor="val_mae", endurance=30, restore_best_weights=True)
log_dir = "logs/tuning/" + datetime.datetime.now().strftime("%Ypercentmpercentd-%HpercentMpercentS")
tensorboard_callback = TensorBoard(log_dir=log_dir, histogram_freq=1)
ModelCheckpointCallback = tf.keras.callbacks.ModelCheckpoint(filepath='best_model_reg.h5',
monitor='val_accuracy',
save_best_only=True,
mode='max')
historical past = mannequin.match(X_train_val_scaled, y_train_val, epochs=200, batch_size=256, validation_data=(X_test_val_scaled, y_test_val), callbacks=[early_stop,scheduler,tensorboard_callback,ModelCheckpointCallback])
Coaching Efficiency
After coaching for 200 epochs, the mannequin achieved:
- Validation MAE: 0.2326
- R² Rating: 0.9837
Loss vs. Epochs Plot
epochs = historical past.epoch
loss = historical past.historical past['loss']
mae = historical past.historical past['mae']
val_loss = historical past.historical past['val_loss']
val_mae = historical past.historical past['val_mae']plt.determine()
plt.plot(epochs, loss, label="prepare")
plt.plot(epochs, val_loss, label="val")
plt.legend()
plt.title("Loss VS Epochs")
plt.present()
plt.determine()
plt.plot(epochs, mae, label="prepare")
plt.plot(epochs, val_mae, label="validation")
plt.legend()
plt.title("MAE VS Epochs")
plt.present()