Close Menu
    Trending
    • Using Graph Databases to Model Patient Journeys and Clinical Relationships
    • Cuba’s Energy Crisis: A Systemic Breakdown
    • AI Startup TML From Ex-OpenAI Exec Mira Murati Pays $500,000
    • STOP Building Useless ML Projects – What Actually Works
    • Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025
    • The New Career Crisis: AI Is Breaking the Entry-Level Path for Gen Z
    • Musk’s X appoints ‘king of virality’ in bid to boost growth
    • Why Entrepreneurs Should Stop Obsessing Over Growth
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Machine Learning»Behind the Crash: A Data Scientist’s Journey to Understand Road Fatalities | by Rahaditya Zuhdi | Jun, 2025
    Machine Learning

    Behind the Crash: A Data Scientist’s Journey to Understand Road Fatalities | by Rahaditya Zuhdi | Jun, 2025

    Team_AIBS NewsBy Team_AIBS NewsJune 29, 2025No Comments5 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    With the dataset correctly break up and stratified, the following step was to establish probably the most related predictors for modeling accident severity.

    Efficient characteristic choice is vital on this context — not solely to enhance mannequin efficiency and cut back overfitting, but in addition to make sure that the ensuing insights are interpretable and actionable for real-world highway security functions.

    The characteristic choice course of on this undertaking concerned three key elements:

    1. Predictive Power Evaluation: The Info Worth (IV) of every variable was calculated utilizing supervised optimum binning, carried out through the OptBinning Python library (Navas-Palencia, 2022). This technique leverages convex optimization to generate binning schemes that finest separate the binary end result lessons. IV scores served as a quantitative metric of every variable’s predictive energy in distinguishing deadly from non-fatal outcomes.
    2. Redundancy & Multicollinearity Evaluation: To handle redundancy and multicollinearity, Pearson correlation coefficients had been computed for numerical options, whereas Normalized Mutual Info (NMI) was used for categorical ones. Variables with excessive correlation or dependency had been eliminated, with precedence given to these providing stronger predictive worth and higher interpretability.
    3. Ahead Stepwise Choice: As a last refinement step, a ahead stepwise characteristic choice course of was utilized. Guided by ROC-AUC efficiency, this method incrementally added variables that contributed meaningfully to the mannequin’s discriminative capability, yielding a compact and efficient characteristic set for subsequent modeling.

    Part 1: Predictive Power Evaluation

    The primary part of characteristic choice centered on evaluating the particular person predictive energy of every characteristic utilizing Info Worth (IV) — a broadly used metric in credit score threat modeling. IV quantifies how nicely a variable separates “good” and “dangerous” outcomes primarily based on the distribution of binned values, and is particularly helpful when mixed with Weight of Proof (WOE) transformation for interpretability.

    Desk 2: Info worth classification and have distribution

    IV was computed for every of the preliminary 51 options within the dataset utilizing supervised optimum binning, a way that partitions variables in a approach that maximizes class separation. Based mostly on this evaluation (Desk 2), 41 out of 51 options demonstrated IV values higher than 0.02, indicating a minimum of weak predictive energy. These had been retained for the following stage of research.

    The remaining 10 options, with IV values at or beneath 0.02, had been excluded as non-predictive, in step with thresholds advisable by Siddiqi (2006). This helped cut back noise and focus the mannequin on options almost certainly to contribute significant insights.

    Part 2: Redundancy & Multicollinearity Evaluation

    After filtering primarily based on Info Worth, the following step was to take away redundant options that would introduce multicollinearity — an issue that may distort mannequin interpretation and inflate the variance of parameter estimates.

    This part handled numerical and categorical variables individually, utilizing statistical measures suited to their information varieties:

    • For numerical options, pairwise Pearson correlation coefficients had been computed. Pairs with an absolute correlation above |r| > 0.7 had been flagged as extremely correlated — a typical threshold in predictive modeling observe.
    • For categorical options, inter-variable dependence was assessed utilizing Normalized Mutual Info (NMI), which captures each linear and nonlinear associations. A threshold of NMI > 0.4 was used to establish strongly dependent pairs.

    In each circumstances, when a extremely dependent pair was recognized, the characteristic with the decrease IV rating was eliminated — preserving variables with increased predictive power.

    Determine 1: Correlation matrix for numerical options
    Determine 2: NMI matrix for categorical options

    Because of this course of:

    • 2 out of 17 numerical options had been eliminated as a consequence of excessive Pearson correlation (see Determine 1).
    • 6 out of 24 categorical options had been excluded primarily based on excessive NMI (see Determine 2).

    This step ensured the ultimate characteristic set was not solely predictive, but in addition freed from redundancy, serving to produce extra secure and interpretable fashions.

    Part 3: Ahead Stepwise Characteristic Choice

    To additional refine the characteristic set, a ahead stepwise choice technique was employed — guided by mannequin efficiency as measured by ROC-AUC. The aim of this part was to establish a subset of options that meaningfully improved the mannequin’s capability to differentiate between deadly and non-fatal outcomes.

    The process used 5-fold stratified cross-validation on the coaching dataset, making certain that the proportion of deadly (“dangerous”) and non-fatal (“good”) circumstances remained constant in every fold. Inside every fold, options had been added separately to a logistic regression mannequin, following a descending order of their Info Worth (IV).

    After every characteristic was added, the mannequin was retrained and evaluated on the validation break up, and the ensuing ROC-AUC was recorded. This course of was repeated throughout all 5 folds.

    To judge the outcomes, an AUC trajectory plot was generated (Determine 3), exhibiting the common ROC-AUC rating throughout folds as options had been incrementally added. This plot helped visualize how every characteristic influenced the mannequin’s efficiency over time.

    Determine 3: AUC trajectory plot for ahead stepwise choice

    Slightly than relying strictly on the order of addition, last characteristic choice was guided by visible inspection of the AUC trajectory. Options that contributed to notable upward shifts in efficiency had been prioritized — permitting for a extra versatile, judgment-driven inclusion course of that balanced each early and late-stage predictors.

    Because of this part, the next 14 options had been chosen to be used in subsequent modeling phases:

    Desk 3: Chosen last options



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleWhat the ‘Big, Beautiful Bill’ Means for Franchise Owners — And Workers
    Next Article Twitch CEO Talks Social Media, AI and the Creator Economy
    Team_AIBS News
    • Website

    Related Posts

    Machine Learning

    Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025

    July 1, 2025
    Machine Learning

    Why PDF Extraction Still Feels LikeHack

    July 1, 2025
    Machine Learning

    🚗 Predicting Car Purchase Amounts with Neural Networks in Keras (with Code & Dataset) | by Smruti Ranjan Nayak | Jul, 2025

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Using Graph Databases to Model Patient Journeys and Clinical Relationships

    July 1, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    The Misunderstood Intelligence: Addressing Public Concerns About AI | by Professor Lucius | May, 2025

    May 2, 2025

    Tench Coxe Is Now an Nvidia Billionaire, Like Jensen Huang

    January 3, 2025

    Why has Trump set up a US crypto stockpile?

    March 8, 2025
    Our Picks

    Using Graph Databases to Model Patient Journeys and Clinical Relationships

    July 1, 2025

    Cuba’s Energy Crisis: A Systemic Breakdown

    July 1, 2025

    AI Startup TML From Ex-OpenAI Exec Mira Murati Pays $500,000

    July 1, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.