Close Menu
    Trending
    • Revisiting Benchmarking of Tabular Reinforcement Learning Methods
    • Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025
    • Qantas data breach to impact 6 million airline customers
    • He Went From $471K in Debt to Teaching Others How to Succeed
    • An Introduction to Remote Model Context Protocol Servers
    • Blazing-Fast ML Model Serving with FastAPI + Redis (Boost 10x Speed!) | by Sarayavalasaravikiran | AI Simplified in Plain English | Jul, 2025
    • AI Knowledge Bases vs. Traditional Support: Who Wins in 2025?
    • Why Your Finance Team Needs an AI Strategy, Now
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Machine Learning»Papers Explained 314: vdr Embeddings | by Ritvik Rastogi | Feb, 2025
    Machine Learning

    Papers Explained 314: vdr Embeddings | by Ritvik Rastogi | Feb, 2025

    Team_AIBS NewsBy Team_AIBS NewsFebruary 20, 2025No Comments4 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    vdr embeddings are dense, single-vector representations of doc web page screenshots. These embeddings are designed to seize the visible and textual content material of a doc web page, permitting for environment friendly search and retrieval of visually wealthy paperwork with out counting on Optical Character Recognition (OCR) or advanced information extraction pipelines. The important thing benefit of VDR embeddings is their skill to carry out semantic search instantly on doc pictures, enabling customers to question paperwork primarily based on their content material and visible structure, whatever the language.

    The fashions and datasets can be found at HuggingFace.

    The vdr-2b-multi-v1 and vdr-2b-v1 fashions are primarily based on the MrLight/dse-qwen2–2b-mrl-v1 structure. This structure is a transformer-based mannequin that’s fine-tuned for visible doc retrieval duties. The fashions take doc web page screenshots as enter and output a fixed-size vector illustration.

    The fashions are skilled on a big, custom-built dataset of query-image pairs. The dataset is designed to be high-quality and numerous, protecting a variety of matters and doc sorts.

    • Multilingual Search: For every language (Italian, Spanish, English, French, and German), a listing of search queries protecting numerous matters is generated. These queries are used to seek for PDFs utilizing language-specific filtering capabilities of search engines like google.
    • Doc Format Evaluation: Every web page of the scraped PDFs is analyzed utilizing a doc structure evaluation mannequin to find out whether or not the web page contained extra textual or visible parts. Pages have been categorized as text-only, visual-only, or blended.
    • Balanced Sampling: Roughly 100k pages are sampled, guaranteeing an excellent distribution throughout the three web page sorts (text-only, visual-only, and blended).
    • Question Era: Queries are generated utilizing Gemini-1.5-Professional and Qwen2-VL-72B. The fashions are tasked to generate each particular and normal questions associated to the doc web page. Solely the precise questions have been used for coaching.
    • Question Cleansing: The generated queries are cleaned to make sure they’re appropriate for coaching. This included:
    1. Guaranteeing the language is appropriate.
    2. Fixing formatting issues.
    3. Eradicating markdown.
    4. Guaranteeing just one query was posed.
    5. Eradicating grounding phrases.
    • Question Filtering: To filter out unhealthy questions, every broad question is embedded and listed utilizing the voyage-3 embedding mannequin. For every particular query, the index is searched. A question is marked as ‘good’ if its related broad query appeared within the high 100 outcomes.
    • Onerous-Adverse Mining: Onerous negatives are mined utilizing voyage-3 on particular questions with a set threshold of 0.75.

    The coaching dataset, vdr-multilingual-train, consists of 496,167 PDF pages, with 280,679 related to filtered queries. The dataset is split into 5 language subsets:

    The fashions are skilled utilizing the DSE (Doc Similarity Embedding) method. Onerous-mined negatives are used throughout coaching to enhance the mannequin’s skill to tell apart between comparable paperwork.

    Matryoshka Illustration Studying (MRL): The loss perform is calibrated to trace efficiency throughout all dimensions, main the mannequin to frontload an important figuring out info. This permits for shrinking the embedding dimensions based on scale and funds.

    The fashions are evaluated utilizing the ViDoRe benchmark.

    • The multilingual mannequin outperforms the bottom mannequin in each language and each web page kind, on common by +2.3%. On the ViDoRe benchmark, it additionally performs barely higher (+0.5%).
    • The fine-tuned vdr-2b-multi-v1 makes massive leaps in efficiency, particularly in non-English visual-only or blended pages. For instance the +6.33% NDCG@5 enchancment for German visual-only retrieval over the bottom mannequin.

    Quicker Inference

    • The English-only vdr-2b-v1 mannequin additionally matches the efficiency of the bottom mannequin on the ViDoRe benchmark artificial datasets, whereas solely utilizing 30% of the picture tokens (768 vs. 2560).

    Cross-Lingual Retrieval

    • The mannequin is considerably higher throughout all doc sorts, with a median enchancment of +2.3%.
    • These retrieval capabilities are important for real-world use instances, particularly in linguistically fragmented continents reminiscent of Europe.

    MRL and Binary Embeddings

    NDCG@5 (float).
    NDCG@5 (binary).
    • 1024 dimension float vectors supply an excellent stability between high quality and dimension. They’re ~30% smaller however nonetheless retain 99% of the retrieval efficiency.
    • That is additionally true for the 1536 dimensions binary vectors, which have 10x fewer bytes per vector however nonetheless retain 97% of their retrieval high quality.
    • It’s additionally attention-grabbing to see that 1536 binary vectors nearly match the efficiency of the bottom mannequin 1536 float vectors.

    Visual Document Retrieval Goes Multilingual



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticlePowerful quantum computers in years not decades, says Microsoft
    Next Article Formulation of Feature Circuits with Sparse Autoencoders in LLM
    Team_AIBS News
    • Website

    Related Posts

    Machine Learning

    Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025

    July 2, 2025
    Machine Learning

    Blazing-Fast ML Model Serving with FastAPI + Redis (Boost 10x Speed!) | by Sarayavalasaravikiran | AI Simplified in Plain English | Jul, 2025

    July 2, 2025
    Machine Learning

    From Training to Drift Monitoring: End-to-End Fraud Detection in Python | by Aakash Chavan Ravindranath, Ph.D | Jul, 2025

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Revisiting Benchmarking of Tabular Reinforcement Learning Methods

    July 2, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    Feature Comparison: Leading RPA Workflow Tools

    March 12, 2025

    CEOs Seek to Recalculate AI Journey amid Backlash, Study Finds

    January 22, 2025

    The Stock Market Imploded, But This OpenAI Tool Sees It as Opportunity

    April 13, 2025
    Our Picks

    Revisiting Benchmarking of Tabular Reinforcement Learning Methods

    July 2, 2025

    Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025

    July 2, 2025

    Qantas data breach to impact 6 million airline customers

    July 2, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.