Close Menu
    Trending
    • Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025
    • The New Career Crisis: AI Is Breaking the Entry-Level Path for Gen Z
    • Musk’s X appoints ‘king of virality’ in bid to boost growth
    • Why Entrepreneurs Should Stop Obsessing Over Growth
    • Implementing IBCS rules in Power BI
    • What comes next for AI copyright lawsuits?
    • Why PDF Extraction Still Feels LikeHack
    • GenAI Will Fuel People’s Jobs, Not Replace Them. Here’s Why
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Artificial Intelligence»Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data | by Chris Lettieri | Jan, 2025
    Artificial Intelligence

    Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data | by Chris Lettieri | Jan, 2025

    Team_AIBS NewsBy Team_AIBS NewsJanuary 30, 2025No Comments5 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Discover how the chosen samples seize extra assorted writing types and edge circumstances.

    In some examples like cluster 1, 3, and eight the furthest level does simply appear like a extra assorted instance of the prototypical middle.

    Cluster 6 is an fascinating level, showcasing how some photographs are troublesome even for a human to guess what it’s. However you may nonetheless make out how this might be in a cluster with the centroid as an 8.

    Latest analysis on neural scaling laws helps to elucidate why knowledge pruning utilizing a “furthest-from-centroid” strategy works, particularly on the MNIST dataset.

    Information Redundancy

    Many coaching examples in massive datasets are extremely redundant.

    Take into consideration MNIST: what number of practically similar ‘7’s do we actually want? The important thing to knowledge pruning isn’t having extra examples — it’s having the precise examples.

    Choice Technique vs Dataset Measurement

    Probably the most fascinating findings from the above paper is how the optimum knowledge choice technique adjustments primarily based in your dataset dimension:

    • With “quite a bit” of knowledge : Choose more durable, extra various examples (furthest from cluster facilities).
    • With scarce knowledge: Choose simpler, extra typical examples (closest to cluster facilities).

    This explains why our “furthest-from-centroid” technique labored so effectively.

    With MNIST’s 60,000 coaching examples, we had been within the “considerable knowledge” regime the place choosing various, difficult examples proved most helpful.

    Inspiration and Targets

    I used to be impressed by these two current papers (and the truth that I’m an information engineer):

    Each discover varied methods we will use knowledge choice methods to coach performant fashions on much less knowledge.

    Methodology

    I used LeNet-5 as my mannequin structure.

    Then utilizing one of many methods under I pruned the coaching dataset of MNIST and skilled a mannequin. Testing was achieved in opposition to the complete take a look at set.

    Attributable to time constraints, I solely ran 5 assessments per experiment.

    Full code and outcomes available here on GitHub.

    Technique #1: Baseline, Full Dataset

    • Normal LeNet-5 structure
    • Educated utilizing 100% of coaching knowledge

    Technique #2: Random Sampling

    • Randomly pattern particular person photographs from the coaching dataset

    Technique #3: Ok-means Clustering with Totally different Choice Methods

    Right here’s how this labored:

    1. Preprocess the pictures with PCA to scale back the dimensionality. This simply means every picture was diminished from 784 values (28×28 pixels) into solely 50 values. PCA does this whereas retaining crucial patterns and eradicating redundant data.
    2. Cluster utilizing k-means. The variety of clusters was mounted at 50 and 500 in several assessments. My poor CPU couldn’t deal with a lot past 500 given all of the experiments.
    3. I then examined totally different choice strategies as soon as the information was cluster:
    • Closest-to-centroid — these symbolize a “typical” instance of the cluster.
    • Furthest-from-centroid — extra consultant of edge circumstances.
    • Random from every cluster — randomly choose inside every cluster.
    Instance of Clustering Choice. Picture by creator.
    • PCA diminished noise and computation time. At first I used to be simply flattening the pictures. The outcomes and compute each improved utilizing PCA so I stored it for the complete experiment.
    • I switched from commonplace Ok-means to MiniBatchKMeans clustering for higher velocity. The usual algorithm was too sluggish for my CPU given all of the assessments.
    • Organising a correct take a look at harness was key. Shifting experiment configs to a YAML, mechanically saving outcomes to a file, and having o1 write my visualization code made life a lot simpler.

    Median Accuracy & Run Time

    Listed below are the median outcomes, evaluating our baseline LeNet-5 skilled on the complete dataset with two totally different methods that used 50% of the dataset.

    Median Outcomes. Picture by creator.
    Median Accuracies. Picture by creator.

    Accuracy vs Run Time Full Outcomes

    The under charts present the outcomes of my 4 pruning methods in comparison with the baseline in pink.

    Median Accuracy throughout Information Pruning strategies. Picture by creator.
    Median Run time throughout Information Pruning strategies. Picture by creator.

    Key findings throughout a number of runs:

    • Furthest-from-centroid constantly outperformed different strategies
    • There positively is a candy spot between compute time and and mannequin accuracy if you wish to discover it to your use case. Extra work must be achieved right here.

    I’m nonetheless shocked that simply randomly lowering the dataset provides acceptable outcomes if effectivity is what you’re after.

    Future Plans

    1. Take a look at this on my second brain. I need to high quality tune a LLM on my full Obsidian and take a look at knowledge pruning together with hierarchical summarization.
    2. Discover different embedding strategies for clustering. I can attempt coaching an auto-encoder to embed the pictures somewhat than use PCA.
    3. Take a look at this on extra complicated and bigger datasets (CIFAR-10, ImageNet).
    4. Experiment with how mannequin structure impacts the efficiency of knowledge pruning methods.

    These findings recommend we have to rethink our strategy to dataset curation:

    1. Extra knowledge isn’t at all times higher — there appears to be diminishing returns to greater knowledge/ greater fashions.
    2. Strategic pruning can really enhance outcomes.
    3. The optimum technique relies on your beginning dataset dimension.

    As folks begin sounding the alarm that we’re working out of knowledge, I can’t assist however marvel if much less knowledge is definitely the important thing to helpful, cost-effective fashions.

    I intend to proceed exploring the house, please attain out for those who discover this fascinating — glad to attach and discuss extra 🙂



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleDeep Dive into rStar-Math and Monte Carlo Tree Search | by Isaac Kargar | Jan, 2025
    Next Article Meta, Microsoft CEOs Justify Heavy AI Spending Amid DeepSeek
    Team_AIBS News
    • Website

    Related Posts

    Artificial Intelligence

    Implementing IBCS rules in Power BI

    July 1, 2025
    Artificial Intelligence

    Become a Better Data Scientist with These Prompt Engineering Tips and Tricks

    July 1, 2025
    Artificial Intelligence

    Lessons Learned After 6.5 Years Of Machine Learning

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025

    July 1, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    Biais et Variance en Machine Learning : Les Fondamentaux Expliqués Simplement | by Abdessamad Touzani | Jun, 2025

    June 19, 2025

    NSFW Art Generator Review and Key Features

    June 12, 2025

    Anion Exchange Membranes: The Future of Green Hydrogen?

    May 16, 2025
    Our Picks

    Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025

    July 1, 2025

    The New Career Crisis: AI Is Breaking the Entry-Level Path for Gen Z

    July 1, 2025

    Musk’s X appoints ‘king of virality’ in bid to boost growth

    July 1, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.