Close Menu
    Trending
    • 3D Printer Breaks Kickstarter Record, Raises Over $46M
    • People are using AI to ‘sit’ with them while they trip on psychedelics
    • Reinforcement Learning in the Age of Modern AI | by @pramodchandrayan | Jul, 2025
    • How This Man Grew His Beverage Side Hustle From $1k a Month to 7 Figures
    • Finding the right tool for the job: Visual Search for 1 Million+ Products | by Elliot Ford | Kingfisher-Technology | Jul, 2025
    • How Smart Entrepreneurs Turn Mid-Year Tax Reviews Into Long-Term Financial Wins
    • Become a Better Data Scientist with These Prompt Engineering Tips and Tricks
    • Meanwhile in Europe: How We Learned to Stop Worrying and Love the AI Angst | by Andreas Maier | Jul, 2025
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Machine Learning»Automated Data Preprocessing for Machine Learning of Tabular dataset | by Akshay Paranjape | Mar, 2025
    Machine Learning

    Automated Data Preprocessing for Machine Learning of Tabular dataset | by Akshay Paranjape | Mar, 2025

    Team_AIBS NewsBy Team_AIBS NewsMarch 20, 2025No Comments4 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    At this time’s data-driven world calls for environment friendly strategies to organize and analyze huge quantities of data. Whereas many advances have targeted on enhancing mannequin architectures or creating automated Machine Studying (AutoML) pipelines, comparatively fewer options deal with superior knowledge preprocessing — a foundational step that may make or break a predictive mannequin.

    On this article, we’ll discover a framework for automated knowledge preprocessing that goes past the everyday knowledge cleansing and encoding steps, delving into extra refined strategies like characteristic choice, characteristic engineering, and a novel sampling method. We’ll additionally see how these methods could be mixed in a single pipeline to spice up mannequin accuracy and scale back computational overhead.

    Preprocessing ensures that your dataset is “model-ready.” Typical approaches normally cease at:

    • Dealing with lacking values
    • Encoding categorical variables
    • Normalization or standardization

    However these steps solely scratch the floor. Superior preprocessing can create fully new options, take away wasteful ones, and resample giant datasets with out dropping vital knowledge patterns.

    Within the paper, the authors current an automated preprocessing pipeline composed of 4 principal steps:

    1. Characteristic Choice — Determine and take away redundant, extremely correlated, or statistically insignificant options.
    2. Sampling — Use a novel “Bin-Primarily based” sampling technique to create smaller but consultant datasets.
    3. Goal Discretization — Convert steady targets into discrete courses when regression fails to yield passable efficiency.
    4. Hybrid Characteristic Engineering — Generate new options routinely utilizing unary or binary transformations, then prune them to maintain solely essentially the most informative.
    Block diagram for preprocessing pipeline along with the mixing of AutoML module

    Typically, datasets have columns with:

    • Fixed values (e.g., “Machine No.” is at all times the identical quantity),
    • Values that match the overall variety of rows (e.g., IDs, names, or emails),
    • Very excessive correlations (redundant from an data perspective).

    The paper outlines a technique mixing variance evaluation, correlation thresholds, and an ANOVA F-test (for classification) or a univariate linear regression take a look at (for regression) to detect and take away these much less important options. These checks make sure the dataset retains solely the columns that carry significant data for the prediction activity.

    Working with full datasets could be computationally costly, particularly throughout time-intensive duties like characteristic engineering. Conventional sampling (like random sampling) may fail to protect knowledge distribution. Stratified sampling is nice however could be very costly computationally.

    That’s why the authors suggest Bin-Primarily based sampling:

    1. Divide steady options into bins (based mostly on their distribution).
    2. Draw random samples from every bin.
    3. Mix the ensuing samples from every characteristic.

    This technique maintains distributional traits higher than pure random sampling, whereas being way more environment friendly than absolutely stratified approaches.

    Bin-based Sampling: Comparability of distribution for a characteristic earlier than and after Bin-based sampling

    When a regression mannequin’s R2R²R2-score is just too low, the authors counsel a easy repair: convert the continual goal into discrete buckets. This basically transforms the issue right into a classification activity, which typically yields higher predictive efficiency if fine-grained regression proves unreliable.

    This step is non-compulsory, nevertheless it’s a simple solution to keep away from discarding datasets that aren’t well-suited to direct regression modeling.

    Typically, actual efficiency beneficial properties come not from fancy ML fashions however from higher options. The paper attracts inspiration from approaches like Cognito and ExploreKit however streamlines them by introducing a Hybrid Characteristic Engineering (HFE) method:

    1. Unary transformations on numerical columns (e.g., log, inverse, sq.).
    2. Binary transformations combining two columns (e.g., multiplication, ratio).
    3. Pruning new options shortly through the identical characteristic choice logic used earlier.
    4. Rating mannequin: Consider every newly generated characteristic’s impression on a baseline mannequin, retaining solely those who enhance metrics.

    The online end result? Probably important boosts in mannequin accuracy (or R²-score) for datasets the place interactions between options matter so much.

    By chaining these 4 steps — and optionally utilizing AutoML libraries like AutoSklearn, H2O, or AutoGluon afterwards — the paper reveals:

    • A 4–7% efficiency enchancment in lots of benchmark duties utilizing a baseline RandomForest mannequin.
    • Marginal however constant enhancements when appended to present AutoML frameworks (which usually skip superior characteristic engineering).

    For these within the full element and experimental validations, the unique paper is:
    Paranjape A., Katta P., Ohlenforst M. (2022). “Automated Knowledge Preprocessing for Machine Studying Primarily based Analyses.” COLLA 2022.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleItaipu Dam: South America’s Renewable Energy Giant
    Next Article Apple Is Losing $1 Billion a Year on Apple TV+ Streaming
    Team_AIBS News
    • Website

    Related Posts

    Machine Learning

    Reinforcement Learning in the Age of Modern AI | by @pramodchandrayan | Jul, 2025

    July 1, 2025
    Machine Learning

    Finding the right tool for the job: Visual Search for 1 Million+ Products | by Elliot Ford | Kingfisher-Technology | Jul, 2025

    July 1, 2025
    Machine Learning

    Meanwhile in Europe: How We Learned to Stop Worrying and Love the AI Angst | by Andreas Maier | Jul, 2025

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    3D Printer Breaks Kickstarter Record, Raises Over $46M

    July 1, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    39% of Your Skills Will be Obsolete in 5 Years — Here Are 6 Skills You Will Need to Thrive

    February 1, 2025

    Adam Grant: Employers Benefit From Giving Workers Higher Pay

    March 22, 2025

    South Korea plane crash More than 170 people have died after a plane crashed as it was landing in South Korea on Sunday morning. Harrowing video footage shows the Jeju Air plane coming off the runway… – Abhinav Kumar

    December 30, 2024
    Our Picks

    3D Printer Breaks Kickstarter Record, Raises Over $46M

    July 1, 2025

    People are using AI to ‘sit’ with them while they trip on psychedelics

    July 1, 2025

    Reinforcement Learning in the Age of Modern AI | by @pramodchandrayan | Jul, 2025

    July 1, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.