Close Menu
    Trending
    • Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025
    • The New Career Crisis: AI Is Breaking the Entry-Level Path for Gen Z
    • Musk’s X appoints ‘king of virality’ in bid to boost growth
    • Why Entrepreneurs Should Stop Obsessing Over Growth
    • Implementing IBCS rules in Power BI
    • What comes next for AI copyright lawsuits?
    • Why PDF Extraction Still Feels LikeHack
    • GenAI Will Fuel People’s Jobs, Not Replace Them. Here’s Why
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Artificial Intelligence»Building a Regression Model to Predict Delivery Durations: A Practical Guide | by Jimin Kang | Dec, 2024
    Artificial Intelligence

    Building a Regression Model to Predict Delivery Durations: A Practical Guide | by Jimin Kang | Dec, 2024

    Team_AIBS NewsBy Team_AIBS NewsJanuary 27, 2025No Comments7 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Information Preparation & Exploratory Evaluation

    Now that we’ve outlined our method, let’s check out our information and what sort of options we’re working with.

    From the above, we see our information incorporates ~197,000 deliveries, with a wide range of numeric & non-numeric options. Not one of the options are lacking a big proportion of values (lowest non-null rely ~181,000), so we probably gained’t have to fret about dropping any options completely.

    Let’s verify if our information incorporates any duplicated deliveries, and if there are any observations that we can not compute supply time for.

    print(f"Variety of duplicates: {df.duplicated().sum()} n")

    print(pd.DataFrame({'Lacking Depend': df[['created_at', 'actual_delivery_time']].isna().sum()}))

    We see that each one the deliveries are distinctive. Nonetheless, there are 7 deliveries which are lacking a worth for actual_delivery_time, which suggests we gained’t have the ability to compute the supply length for these orders. Since there’s solely a handful of those, we’ll take away these observations from our information.

    Now, let’s create our prediction goal. We wish to predict the supply length (in seconds), which is the elapsed time between when the shopper positioned the order (‘created_at’) and after they recieved the order (‘actual_delivery_time’).

    # convert columns to datetime 
    df['created_at'] = pd.to_datetime(df['created_at'], utc=True)
    df['actual_delivery_time'] = pd.to_datetime(df['actual_delivery_time'], utc=True)

    # create prediction goal
    df['seconds_to_delivery'] = (df['actual_delivery_time'] - df['created_at']).dt.total_seconds()

    The very last thing we’ll do earlier than splitting our information into prepare/take a look at is verify for lacking values. We already considered the non-null counts for every function above, however let’s view the proportions to get a greater image.

    We see that the market options (‘onshift_dashers’, ‘busy_dashers’, ‘outstanding_orders’) have the very best proportion of lacking values (~8% lacking). The function with the second-highest lacking information fee is ‘store_primary_category’ (~2%). All different options have < 1% lacking.

    Since not one of the options have a excessive lacking rely, we gained’t take away any of them. Afterward, we’ll have a look at the function distributions to assist us resolve how one can appropriately take care of lacking observations for every function.

    However first, let’s break up our information into prepare/take a look at. We are going to proceed with an 80/20 break up, and we’ll write this take a look at information to a separate file which we gained’t contact till evaluating our last mannequin.

    from sklearn.model_selection import train_test_split
    import os

    # shuffle
    df = df.pattern(frac=1, random_state=42)
    df = df.reset_index(drop=True)

    # break up
    train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

    # write take a look at information to separate file
    listing = 'datasets'
    file_name = 'test_data.csv'
    file_path = os.path.be part of(listing, file_name)
    os.makedirs(listing, exist_ok=True)
    test_df.to_csv(file_path, index=False)

    Now, let’s dive into the specifics of our prepare information. We’ll set up our numeric & categorical options, to make it clear which columns are being referenced in later exploratory steps.

    categorical_feats = [
    'market_id',
    'store_id',
    'store_primary_category',
    'order_protocol'
    ]

    numeric_feats = [
    'total_items',
    'subtotal',
    'num_distinct_items',
    'min_item_price',
    'max_item_price',
    'total_onshift_dashers',
    'total_busy_dashers',
    'total_outstanding_orders',
    'estimated_order_place_duration',
    'estimated_store_to_consumer_driving_duration'
    ]

    Let’s revisit the specific options with lacking values (‘market_id’, ‘store_primary_category’, ‘order_protocol’). Since there was little lacking information amongst these options (< 3%), we’ll merely impute these lacking values with an “unknown” class.

    • This fashion, we gained’t need to take away information from different options.
    • Maybe the absence of function values holds some predictive energy for supply length i.e. these options usually are not missing at random.
    • Moreover, we’ll add this imputation step to our preprocessing pipeline throughout modeling, in order that we gained’t need to manually duplicate this work on our take a look at set.
    missing_cols_categorical = ['market_id', 'store_primary_category', 'order_protocol']

    train_df[missing_cols_categorical] = train_df[missing_cols_categorical].fillna("unknown")

    Let’s have a look at our categorical options.

    pd.DataFrame({'Cardinality': train_df[categorical_feats].nunique()}).rename_axis('Function')

    Since ‘market_id’ & ‘order_protocol’ have low cardinality, we are able to visualize their distributions simply. Alternatively, ‘store_id’ & ‘store_primary_category’ are excessive cardinality options. We’ll take a deeper have a look at these later.

    import seaborn as sns
    import matplotlib.pyplot as plt

    categorical_feats_subset = [
    'market_id',
    'order_protocol'
    ]

    # Arrange the grid
    fig, axes = plt.subplots(1, len(categorical_feats_subset), figsize=(13, 5), sharey=True)

    # Create barplots for every variable
    for i, col in enumerate(categorical_feats_subset):
    sns.countplot(x=col, information=train_df, ax=axes[i])
    axes[i].set_title(f"Frequencies: {col}")

    # Alter structure
    plt.tight_layout()
    plt.present()

    Some key issues to notice:

    • ~70% of orders positioned have ‘market_id’ of 1, 2, 4
    • < 1% of orders have ‘order_protocol’ of 6 or 7

    Sadly, we don’t have any further details about these variables, similar to which ‘market_id’ values are related to which cities/places, and what every ‘order_protocol’ quantity represents. At this level, asking for added information regarding this info could also be a good suggestion, as it could assist for investigating developments in supply length throughout broader area/location categorizations.

    Let’s have a look at our larger cardinality categorical options. Maybe every ‘store_primary_category’ has an related ‘store_id’ vary? If that’s the case, we could not want ‘store_id’, as ‘store_primary_category’ would already encapsulate numerous the details about the shop being ordered from.

    store_info = train_df[['store_id', 'store_primary_category']]

    store_info.groupby('store_primary_category')['store_id'].agg(['min', 'max'])

    Clearly not the case: we see that ‘store_id’ ranges overlap throughout ranges of ‘store_primary_category’.

    A fast have a look at the distinct values and related frequencies for ‘store_id’ & ‘store_primary_category’ exhibits that these options have high cardinality and are sparsely distributed. On the whole, excessive cardinality categorical options could also be problematic in regression duties, notably for regression algorithms that require solely numeric information. When these excessive cardinality options are encoded, they might enlarge the function area drastically, making the out there information sparse and reducing the mannequin’s skill to generalize to new observations in that function area. For a greater & extra skilled clarification of the phenomena, you’ll be able to learn extra about it here.

    Let’s get a way of how sparsely distributed these options are.

    store_id_values = train_df['store_id'].value_counts()

    # Plot the histogram
    plt.determine(figsize=(8, 5))
    plt.bar(store_id_values.index, store_id_values.values, coloration='skyblue')

    # Add titles and labels
    plt.title('Worth Counts: store_id', fontsize=14)
    plt.xlabel('store_id', fontsize=12)
    plt.ylabel('Frequency', fontsize=12)
    plt.xticks(rotation=45) # Rotate x-axis labels for higher readability
    plt.tight_layout()
    plt.present()

    We see that there are a handful of shops which have lots of of orders, however the majority of them have a lot lower than 100.

    To deal with the excessive cardinality of ‘store_id’, we’ll create one other function, ‘store_id_freq’, that teams the ‘store_id’ values by frequency.

    • We’ll group the ‘store_id’ values into 5 completely different percentile bins proven beneath.
    • ‘store_id_freq’ could have a lot decrease cardinality than ‘store_id’, however will retain related info relating to the recognition of the shop the supply was ordered from.
    • For extra inspiration behind this logic, try this thread.
    def encode_frequency(freq, percentiles) -> str:
    if freq < percentiles[0]:
    return '[0-50)'
    elif freq < percentiles[1]:
    return '[50-75)'
    elif freq < percentiles[2]:
    return '[75-90)'
    elif freq < percentiles[3]:
    return '[90-99)'
    else:
    return '99+'

    value_counts = train_df['store_id'].value_counts()
    percentiles = np.percentile(value_counts, [50, 75, 90, 99])

    # apply encode_frequency to every store_id based mostly on their variety of orders
    train_df['store_id_freq'] = train_df['store_id'].apply(lambda x: encode_frequency(value_counts[x], percentiles))

    pd.DataFrame({'Depend':train_df['store_id_freq'].value_counts()}).rename_axis('Frequency Bin')

    Our encoding exhibits us that ~60,000 deliveries have been ordered from shops catgorized within the 90–99th percentile by way of reputation, whereas ~12,000 deliveries have been ordered from shops that have been within the 0–fiftieth percentile in reputation.

    Now that we’ve (tried) to seize related ‘store_id’ info in a decrease dimension, let’s attempt to do one thing comparable with ‘store_primary_category’.

    Let’s have a look at the preferred ‘store_primary_category’ ranges.

    A fast look exhibits us that many of those ‘store_primary_category’ ranges usually are not unique to one another (ex: ‘american’ & ‘burger’). Additional investigation exhibits many extra examples of this sort of overlap.

    So, let’s attempt to map these distinct retailer classes into just a few fundamental, all-encompassing teams.

    store_category_map = {
    'american': ['american', 'burger', 'sandwich', 'barbeque'],
    'asian': ['asian', 'chinese', 'japanese', 'indian', 'thai', 'vietnamese', 'dim-sum', 'korean',
    'sushi', 'bubble-tea', 'malaysian', 'singaporean', 'indonesian', 'russian'],
    'mexican': ['mexican'],
    'italian': ['italian', 'pizza'],
    }

    def map_to_category_type(class: str) -> str:
    for category_type, classes in store_category_map.gadgets():
    if class in classes:
    return category_type
    return "different"

    train_df['store_category_type'] = train_df['store_primary_category'].apply(lambda x: map_to_category_type(x))

    value_counts = train_df['store_category_type'].value_counts()

    # Plot pie chart
    plt.determine(figsize=(6, 6))
    value_counts.plot.pie(autopct='%1.1f%%', startangle=90, cmap='viridis', labels=value_counts.index)
    plt.title('Class Distribution')
    plt.ylabel('') # Cover y-axis label for aesthetics
    plt.present()

    This grouping might be brutally easy, and there could very properly be a greater solution to group these retailer classes. Let’s proceed with it for now for simplicity.

    We’ve carried out a great deal of investigation into our categorical options. Let’s have a look at the distributions for our numeric options.

    # Create grid for boxplots
    fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(12, 15)) # Alter determine measurement
    axes = axes.flatten() # Flatten the 5x2 axes right into a 1D array for simpler iteration

    # Generate boxplots for every numeric function
    for i, column in enumerate(numeric_feats):
    sns.boxplot(y=train_df[column], ax=axes[i])
    axes[i].set_title(f"Boxplot for {column}")
    axes[i].set_ylabel(column)

    # Take away any unused subplots (if any)
    for i in vary(len(numeric_feats), len(axes)):
    fig.delaxes(axes[i])

    # Alter structure for higher spacing
    plt.tight_layout()
    plt.present()

    Boxplots for a subset of our numeric options

    Most of the distributions seem like extra proper skewed then they’re because of the presence of outliers.

    Specifically, there appears to be an order with 400+ gadgets. This appears unusual as the subsequent largest order is lower than 100 gadgets.

    Let’s look extra into that 400+ merchandise order.

    train_df[train_df['total_items']==train_df['total_items'].max()]



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleMac Mini M4 vs Mac Studio M2 Max: Mini Takes on the Studio Max | by Nauman Ahmad | Jan, 2025
    Next Article Uber CEO: Autonomous Vehicles Will Take Over Drivers Soon
    Team_AIBS News
    • Website

    Related Posts

    Artificial Intelligence

    Implementing IBCS rules in Power BI

    July 1, 2025
    Artificial Intelligence

    Become a Better Data Scientist with These Prompt Engineering Tips and Tricks

    July 1, 2025
    Artificial Intelligence

    Lessons Learned After 6.5 Years Of Machine Learning

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025

    July 1, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    A Guide for LLM Development

    December 12, 2024

    Gen Z wants in-office sex to be a totally acceptable thing

    June 16, 2025

    U.A.W. Seeks Union Election at Ford Battery Plant in Kentucky

    January 9, 2025
    Our Picks

    Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025

    July 1, 2025

    The New Career Crisis: AI Is Breaking the Entry-Level Path for Gen Z

    July 1, 2025

    Musk’s X appoints ‘king of virality’ in bid to boost growth

    July 1, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.