Close Menu
    Trending
    • How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1
    • From Training to Drift Monitoring: End-to-End Fraud Detection in Python | by Aakash Chavan Ravindranath, Ph.D | Jul, 2025
    • Using Graph Databases to Model Patient Journeys and Clinical Relationships
    • Cuba’s Energy Crisis: A Systemic Breakdown
    • AI Startup TML From Ex-OpenAI Exec Mira Murati Pays $500,000
    • STOP Building Useless ML Projects – What Actually Works
    • Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025
    • The New Career Crisis: AI Is Breaking the Entry-Level Path for Gen Z
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Artificial Intelligence»The Data Scientist’s Dilemma: Answering “What If?” Questions Without Experiments | by Rémy Garnier | Jan, 2025
    Artificial Intelligence

    The Data Scientist’s Dilemma: Answering “What If?” Questions Without Experiments | by Rémy Garnier | Jan, 2025

    Team_AIBS NewsBy Team_AIBS NewsJanuary 9, 2025No Comments6 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Now, we have now the options for our mannequin. We’ll break up our knowledge into 3 units:

    1- Coaching dataset : It’s the set of knowledge the place we are going to prepare our mannequin

    2 – Take a look at dataset : Information used to judge the efficiency of our mannequin.

    3- After modification dataset: Information used to compute the uplift utilizing our mannequin.

    from sklearn.model_selection import train_test_split

    start_modification_date = dt.datetime(2024, 2,1)

    X_before_modification = X[X.index < start_modification_date]
    y_before_modification = y[y.index < start_modification_date].kpi
    X_after_modification = X[X.index >= start_modification_date]
    y_after_modification = y[y.index >= start_modification_date].kpi

    X_train, X_test , y_train , y_test = train_test_split(X_before_modification, y_before_modification, test_size= 0.25, shuffle = False)

    Observe : You need to use a fourth subset of knowledge to carry out some mannequin choice. Right here we received’t do plenty of mannequin choice, so it doesn’t matter loads. However it is going to for those who begin to choose your mannequin amongst tenths of others.

    Observe 2: Cross-validation can also be very doable and really helpful.

    Observe 3 : I do suggest splitting knowledge with out shuffling (shuffling = False). It’ll permit you to concentrate on the eventual temporal drift of your mannequin.

    from sklearn.ensemble import RandomForestRegressor

    mannequin = RandomForestRegressor(min_samples_split=4)
    mannequin.match(X_train, y_train)
    y_pred = mannequin.predict(X_test)

    And right here you prepare your predictor. We use a random forest regressor for its comfort as a result of it permits us to deal with non-linearity, lacking knowledge, and outliers. Gradients Boosting Bushes algorithms are additionally superb for this use.

    Many papers about Artificial Management will use linear regression right here, however we expect it’s not helpful right here as a result of we aren’t actually within the mannequin’s interpretability. Furthermore, decoding such regression can be tricky.

    Counterfactual Analysis

    Our prediction shall be on the testing set. The primary speculation we are going to make is that the efficiency of the mannequin will keep the identical after we compute the uplift. That’s the reason we have a tendency to make use of plenty of knowledge in our We think about 3 completely different key indicators to judge the standard of the counterfactual prediction :

    1-Bias : Bias controls the presence of a niche between your counterfactual and the actual knowledge. It’s a robust restrict in your skill to compute as a result of it received’t be decreased by ready extra time after the modification.

    bias = float((y_pred -  y_test).imply()/(y_before_modification.imply()))
    bias
    > 0.0030433481322823257

    We usually categorical the bias as a proportion of the typical worth of the KPI. It’s smaller than 1%, so we should always not anticipate to measure results greater than that. In case your bias is simply too large, it’s best to test for a temporal drift (and add a development to your prediction). You can too right your prediction and deduce the bias from the prediction, supplied you management the impact of this correction of contemporary knowledge.

    2-Customary Deviation σ: We additionally need to management how dispersed are the predictions across the true values. We subsequently use the usual deviation, once more expressed as a proportion of the typical worth of the kpi.

    sigma = float((y_pred -  y_test).std()/(y_before_modification.imply()))
    sigma
    > 0.0780972738325956

    The excellent news is that the uncertainty created by the deviation is decreased when the variety of knowledge factors improve. We want a predictor with out bias, so it could possibly be mandatory to just accept a rise within the deviation if allowed to restrict the bias.

    It can be attention-grabbing to take a look at bias and variance by trying on the distribution of the forecasting errors. It may be helpful to see if our calculation of bias and deviation is legitimate, or whether it is affected by outliers and excessive values.

    import seaborn as sns 
    import matplotlib.pyplot as plt

    f, ax = plt.subplots(figsize=(8, 6))
    sns.histplot(pd.DataFrame((y_pred - y_test)/y_past.imply()), x = 'kpi', bins = 35, kde = True, stat = 'likelihood')
    f.suptitle('Relative Error Distribution')
    ax.set_xlabel('Relative Error')
    plt.present()

    3- Auto-correlation α: Basically, errors are auto-correlated. It implies that in case your prediction is above the true worth on a given day, it has extra probability of being above the subsequent day. It’s a drawback as a result of most classical statistical instruments require independence between observations. What occurred on a given day ought to have an effect on the subsequent one. We use auto-correlation as a measure of dependence between someday and the subsequent.

    df_test = pd.DataFrame(zip(y_pred, y_test), columns = ['Prevision','Real'], index = y_test.index)
    df_test = df_test.assign(
    ecart = df_test.Prevision - df_test.Actual)
    alpha = df_test.ecart.corr(df_test.ecart.shift(1))
    alpha
    > 0.24554635095548982

    A excessive auto-correlation is problematic however might be managed. A doable causes for it are unobserved covariates. If as an illustration, the shop you need to measure organized a particular occasion, it may improve its gross sales for a number of days. This may result in an sudden sequence of days above the prevision.

    df_test = pd.DataFrame(zip(y_pred, y_test), columns = ['Prevision','Reel'], index = y_test.index)

    f, ax = plt.subplots(figsize=(15, 6))
    sns.lineplot(knowledge = df_test, x = 'date', y= 'Reel', label = 'True Worth')
    sns.lineplot(knowledge = df_test, x = 'date', y= 'Prevision', label = 'Forecasted Worth')
    ax.axvline(start_modification_date, ls = '--', shade = 'black', label = 'Begin of the modification')
    ax.legend()
    f.suptitle('KPI TX_1')
    plt.present()

    True worth and forecasted worth on the analysis set.

    Within the determine above, you may see an illustration of the auto-correlation phenomenon. In late April 2023, for a number of days, forecasted values are above the true worth. Errors usually are not impartial of each other.

    Impression Calculation

    Now we are able to compute the affect of the modification. We examine the prediction after the modification with the precise worth. As at all times, it’s expressed as a proportion of the imply worth of the KPI.

    y_pred_after_modification = mannequin.predict(X_after_modification)
    uplift =float((y_after_modification - y_pred_after_modification).imply()/y_before_modification.imply())
    uplift
    > 0.04961773643584396

    We get a relative improve of 4.9% The “true” worth (the info used had been artificially modified) was 3.0%, so we aren’t removed from it. And certainly, the true worth is commonly above the prediction :

    True worth and forecasted worth after the modification

    We will compute a confidence interval for this worth. If our predictor has no bias, the dimensions of its confidence interval might be expressed with:

    Customary deviation of the estimator

    The place σ is the usual deviation of the prediction, α its auto-correlation, and N the variety of days after the modification.

    N = y_after_modification.form[0]
    ec = sigma/(sqrt(N) *(1-alpha))

    print('68%% IC : [%.2f %% , %.2f %%]' % (100*(uplift - ec),100 * (uplift + ec) ))
    print('95%% IC : [%.2f %% , %.2f %%]' % (100*(uplift -2 *ec),100 * (uplift +2*ec) ))

    68% IC : [3.83 % , 6.09 %]
    95% IC : [2.70 % , 7.22 %]

    The vary of the 95% CI is round 4.5% for 84 days. It’s affordable for a lot of purposes, as a result of it’s doable to run an experiment or a proof of idea for 3 months.

    Observe: the boldness interval could be very delicate to the deviation of the preliminary predictor. That’s the reason it’s a good suggestion to take a while to carry out mannequin choice (on the coaching set solely) earlier than deciding on a very good mannequin.

    Mathematical formulation of the mannequin

    To date we have now tried to keep away from maths, to permit for a neater comprehension. On this part, we are going to current the mathematical mannequin beneath the mannequin.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleUnderstanding Algorithms: Powering the Digital World | by Tech With Jehn | Jan, 2025
    Next Article 3 AI Leadership Lessons for 2025
    Team_AIBS News
    • Website

    Related Posts

    Artificial Intelligence

    How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1

    July 1, 2025
    Artificial Intelligence

    STOP Building Useless ML Projects – What Actually Works

    July 1, 2025
    Artificial Intelligence

    Implementing IBCS rules in Power BI

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1

    July 1, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    Mercedes’s Most Affordable Sedan Will Be Electric

    March 13, 2025

    The AI Tool That Will 10x Your Output in 2025 (And It’s Not ChatGPT)

    January 25, 2025

    Microsoft Continues A.I. Spending Growth as Profit Grows 10%

    January 30, 2025
    Our Picks

    How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1

    July 1, 2025

    From Training to Drift Monitoring: End-to-End Fraud Detection in Python | by Aakash Chavan Ravindranath, Ph.D | Jul, 2025

    July 1, 2025

    Using Graph Databases to Model Patient Journeys and Clinical Relationships

    July 1, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.