Close Menu
    Trending
    • 3D Printer Breaks Kickstarter Record, Raises Over $46M
    • People are using AI to ‘sit’ with them while they trip on psychedelics
    • Reinforcement Learning in the Age of Modern AI | by @pramodchandrayan | Jul, 2025
    • How This Man Grew His Beverage Side Hustle From $1k a Month to 7 Figures
    • Finding the right tool for the job: Visual Search for 1 Million+ Products | by Elliot Ford | Kingfisher-Technology | Jul, 2025
    • How Smart Entrepreneurs Turn Mid-Year Tax Reviews Into Long-Term Financial Wins
    • Become a Better Data Scientist with These Prompt Engineering Tips and Tricks
    • Meanwhile in Europe: How We Learned to Stop Worrying and Love the AI Angst | by Andreas Maier | Jul, 2025
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Machine Learning»Papers Explained 338: Large-Scale Data Selection for Instruction Tuning | by Ritvik Rastogi | Mar, 2025
    Machine Learning

    Papers Explained 338: Large-Scale Data Selection for Instruction Tuning | by Ritvik Rastogi | Mar, 2025

    Team_AIBS NewsBy Team_AIBS NewsMarch 26, 2025No Comments5 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    This work presents a scientific research of how nicely knowledge choice strategies scale, It finds that:

    • Many not too long ago proposed strategies fall wanting random choice on this setting (whereas utilizing extra compute), and even decline in efficiency when given entry to bigger swimming pools of information to pick over.
    • A variant of representation-based knowledge choice (RDS+), which makes use of weighted imply pooling of pretrained LM hidden states, persistently outperforms extra advanced strategies throughout all settings examined — all while being extra compute-efficient.

    The next knowledge choice strategies are explored, aiming to pick n situations from an information pool D utilizing a question set V (containing 10s to 100s of samples) drawn from the identical distribution because the analysis set. Every technique assigns a rating to every knowledge level d ∈ D, both immediately or by aggregating scores from pairs v, d ∈ V, D.

    • Random Choice: This baseline technique includes randomly sampling n situations from D. A “balanced” variant uniformly samples from totally different knowledge sources inside D till every supply is exhausted, then distributes the remaining funds equally among the many remaining sources.
    • Perplexity: Calculates the lack of every d ∈ D utilizing the unique base mannequin. “Mid-ppl” selects factors in the midst of the loss distribution, whereas “top-ppl” selects these with the very best loss.
    • IFD (Iterative Function Distillation): Trains a mannequin on consultant samples from D, then scores every d ∈ D utilizing the ratio of the reply loss given the query to the lack of the reply alone (IFD rating).
    • LESS (Leveraging Expertise through Selective Sampling): Trains LoRAs on a random subset of D. Scores every pair v, d ∈ V, D based mostly on the gradient-based affect of d on v.
    • Embedding: Scores every pair v, d ∈ V, D based mostly on the cosine similarity of their embeddings, utilizing both NV-Embed-v2 or GTR-base embedding fashions.
    • RDS+ (Illustration-based Information Similarity): A customized variant of RDS. Computes the cosine similarity for every pair v, d ∈ V, D utilizing a position-weighted imply pool of the final hidden layer states from the mannequin being skilled.

    Choice & Aggregation

    For strategies scoring pairs v, d ∈ V, D, the |V| scores for every d ∈ D are aggregated. A round-robin method iteratively provides the highest-scoring level for every v ∈ V to the chosen pool till the specified dimension n is reached.

    For multi-task situations, task-level aggregation can be carried out. A rating S[t, d] for every knowledge level d and job t is calculated as the utmost rating throughout all question factors vt throughout the question set Vt for that job. A round-robin process then iterates over duties, choosing the highest-scoring knowledge level for every job till the specified dataset dimension is reached (after deduplication).

    Information Pool

    Experiments are performed on two giant, numerous, and unbalanced knowledge swimming pools: TÜLU 2 unfiltered and TÜLU 3 unfiltered. These swimming pools include thousands and thousands of samples from numerous sources, primarily FLAN and Open Orca, and are considerably bigger than these utilized in prior work. Precise-match deduplication was carried out to make sure pattern uniqueness.

    The experimental design is prolonged off TÜLU 2. As TÜLU 2 is finetuned ranging from Llama 2 base fashions, the first experiment is with the Llama 2 7B mannequin. Moreover, outcomes are reported utilizing the TÜLU 3 combination and Llama 3.1.

    For finetuning, the fashions are totally high-quality tuned for 2 epochs with a batch dimension of 1, 128 gradient accumulation steps, a studying fee of 2e−5 (1e−5 for 70B dimension fashions), linear studying fee warmup for the primary 3% of coaching steps, and linear cooldown for the remainder of coaching.

    The imply throughout three random runs (together with reselecting knowledge) is reported for random base- traces and single-run scores for different settings.

    Single-Process Information Choice

    Fashions are skilled on 10k samples chosen by every technique and evaluated individually for every job. Two pool sizes are used: a smaller pool (200k samples) and the total pool (5.8M samples). LESS efficiency isn’t evaluated on the total pool as a consequence of computational constraints.

    Single-task efficiency of various knowledge choice strategies over the TÜLU 2 unfiltered set.
    Efficiency in opposition to estimated compute price of assorted knowledge choice strategies when choosing 10k factors from knowledge swimming pools consisting of 200k (left factors) and 5.8M (proper factors) knowledge factors.
    • RDS+ carried out finest on common throughout each pool sizes.
    • RDS+ achieved the perfect efficiency for particular person duties (aside from SQuAD the place it was second finest) when choosing from the total pool.
    • A number of strategies (PPL, Random, IFD, Embed (NV)) carried out worse with the bigger pool, indicating scaling points.
    • Each RDS+ and Embed (GTR) improved with the bigger pool.

    Multi-task Choice

    Multi-task efficiency of dataset choice strategies when choosing 326k samples from the total TÜLU 2 unfiltered pool.
    • RDS+ persistently outperforms different knowledge choice strategies, together with human-curated mixtures and random choice.
    • Embedding-based strategies usually carry out higher than non-embedding strategies for knowledge choice.
    Multi-task efficiency of RDS in opposition to baselines when finetuning from Llama 3.1 8B base and choosing 939k samples from the TÜLU 3 unfiltered combination.
    • RDS+ maintains its sturdy efficiency even when evaluated on out-of-distribution duties, suggesting good generalization capabilities. Utilizing a high-quality choice set like Enviornment Laborious yields comparable outcomes to utilizing task-specific knowledge.
    • RDS+ additionally performs nicely with totally different knowledge swimming pools and base fashions, as demonstrated by its superior efficiency in comparison with the official TÜLU 3 SFT mannequin when utilizing TÜLU 3 knowledge to fine-tune Llama 3.1 fashions.

    Scaling Multi-task Choice

    • RDS+ persistently outperforms balanced random choice throughout totally different knowledge choice sizes.
    • RDS+ achieves comparable efficiency to coaching on your entire dataset whereas utilizing solely a fraction (round 6%) of the info. It even outperforms coaching on all knowledge when choosing over 1 million samples.
    • When contemplating the computational price of each choice and coaching, RDS+ turns into extra environment friendly than random choice when choosing bigger datasets (≥ 326k samples).
    • The price of RDS+ may doubtlessly be additional lowered by way of optimization methods like reusing embeddings or utilizing smaller fashions for choice.

    Giant-Scale Information Choice for Instruction Tuning 2503.01807



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleAfter the Signal Leak, How Well Do You Know Your Own Group Chats?
    Next Article The Ultimate AI/ML Roadmap For Beginners
    Team_AIBS News
    • Website

    Related Posts

    Machine Learning

    Reinforcement Learning in the Age of Modern AI | by @pramodchandrayan | Jul, 2025

    July 1, 2025
    Machine Learning

    Finding the right tool for the job: Visual Search for 1 Million+ Products | by Elliot Ford | Kingfisher-Technology | Jul, 2025

    July 1, 2025
    Machine Learning

    Meanwhile in Europe: How We Learned to Stop Worrying and Love the AI Angst | by Andreas Maier | Jul, 2025

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    3D Printer Breaks Kickstarter Record, Raises Over $46M

    July 1, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    0921.190.5260 – #شماره خاله #شماره خاله تهران #شماره خاله تهرانپار

    June 11, 2025

    The Rise and Fall of Inflection’s AI Chatbot, Pi

    April 1, 2025

    “From Engines to Algorithms: Understanding Deep Learning Through Cars” | by Varadrajan | Jan, 2025

    January 25, 2025
    Our Picks

    3D Printer Breaks Kickstarter Record, Raises Over $46M

    July 1, 2025

    People are using AI to ‘sit’ with them while they trip on psychedelics

    July 1, 2025

    Reinforcement Learning in the Age of Modern AI | by @pramodchandrayan | Jul, 2025

    July 1, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.