Close Menu
    Trending
    • I Tested TradingView for 30 Days: Here’s what really happened
    • Clone Any Figma File with One Link Using MCP Tool
    • 11 strategies for navigating career plateaus
    • Agentic AI Patterns. Introduction | by özkan uysal | Aug, 2025
    • 10 Things That Separate Successful Founders From the Unsuccessful
    • Tested an AI Crypto Trading Bot That Works With Binance
    • The Rise of Data & ML Engineers: Why Every Tech Team Needs Them | by Nehal kapgate | Aug, 2025
    • Build Smarter Workflows With Lifetime Access to This Project Management Course Pack
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Machine Learning»What I Learned Benchmarking GPU-Powered Vector Search with cuVS and Milvus | by Alex Chen | Jul, 2025
    Machine Learning

    What I Learned Benchmarking GPU-Powered Vector Search with cuVS and Milvus | by Alex Chen | Jul, 2025

    Team_AIBS NewsBy Team_AIBS NewsJuly 8, 2025No Comments4 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    I’ve been knee-deep in efficiency tuning these days whereas constructing out a semantic search system. One problem saved surfacing: CPU-bound vector search doesn’t scale as easily as I hoped — particularly when pushing previous 100 million vectors. So I began exploring GPU-accelerated indexing, significantly utilizing NVIDIA’s cuVS library and the CAGRA algorithm.

    Right here’s what I realized after some hands-on testing and evaluation.

    Let’s say you’re constructing a RAG pipeline. The vector search step — the place you retrieve the top-k semantically related chunks — is commonly the latency bottleneck. As soon as your embedding quantity crosses into tens or tons of of thousands and thousands, search will get costly: each in reminiscence and time.

    We will’t simply scale vertically endlessly. That’s why libraries like cuVS exist. They take the computational core of vector search and transfer it to the GPU, giving us the type of throughput CPUs simply can’t match.

    When embedding a doc corpus, we usually chunk the info and run every chunk by way of an embedding model like all-MiniLM-L6-v2, which outputs 384-dimensional vectors. Similarity is computed utilizing cosine distance, Euclidean distance, or dot product.

    Most real-world search programs depend on an index to make this search quick. With out an index, brute-force scanning thousands and thousands of vectors is painfully gradual.

    Hierarchical Navigable Small World (HNSW) is a standard CPU-based indexing technique. It’s a multi-layer proximity graph that allows you to traverse rapidly to approximate nearest neighbors.

    I examined it on a 10M vector dataset (128 dimensions) and noticed the next construct instances:

    Even with first rate parallelism, construct instances scale poorly. And at greater recall ranges, latency turns into an actual concern.

    NVIDIA’s cuVS features a few algorithms, however the one I targeted on was CAGRA — a GPU-first graph indexing technique that resembles HNSW in precept, however is optimized for massively parallel execution.

    I re-ran the identical benchmarks on an A10G GPU:

    Speedup: ~7–10x relying on recall stage.

    Throughput was additionally dramatically greater:

    What shocked me was this: even should you don’t wish to deploy your search engine on GPU (because of value, ops complexity, and so on.), you’ll be able to nonetheless use GPU for index constructing.

    CAGRA-built graphs may be exported and used as enter to HNSW-like traversal logic on CPU. I discovered this yielded higher latency than native HNSW graphs when vector dimensionality bought giant (>512D).

    cuVS additionally helps quantization by way of an extension referred to as CAGRA-Q. That is particularly helpful when:

    • You will have memory-constrained GPUs (e.g., 8–16GB shopper playing cards)
    • You wish to offload graphs to CPU reminiscence whereas maintaining vector information on GPU

    Quantization does scale back precision barely, however in my testing it held up surprisingly properly till you push under 8-bit.

    I additionally examined cuVS-backed indexing on Milvus, which may offload each index and question node computations to GPU. The structure helps this sort of cut up natively:

    For instance, constructing IVF-PQ + CAGRA on giant datasets scaled linearly with GPU rely:

    Even after normalizing for {hardware} value (A10G = ~$9.68/hr, typical CPU = ~$0.78/hr), I discovered a 12.5x higher time-to-cost ratio for GPU-based indexing.

    At giant scale, the numbers get even starker:

    635M vectors, 1024 dimensions

    • 8× H100 GPUs (IVF-PQ): 56 minutes
    • CPU-only: ~6.22 days

    The efficiency ceiling on CPU-based vector search is actual. For those who’re working with dense, high-dimensional embeddings, you’ll probably hit it earlier than you assume.

    cuVS and CAGRA supply a well-engineered, modular path to scale search workloads with out rewriting every little thing from scratch. Even simply utilizing GPU for offline index builds could make an enormous distinction.

    Subsequent, I’ll be digging into how this performs with hybrid retrieval (dense + sparse) and exploring GPU-powered filtering in multi-tenant eventualities. Curious to see the place the actual trade-offs emerge.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleSpace technology: Lithuania’s promising space start-ups
    Next Article The Five-Second Fingerprint: Inside Shazam’s Instant Song ID
    Team_AIBS News
    • Website

    Related Posts

    Machine Learning

    Clone Any Figma File with One Link Using MCP Tool

    August 3, 2025
    Machine Learning

    Agentic AI Patterns. Introduction | by özkan uysal | Aug, 2025

    August 3, 2025
    Machine Learning

    The Rise of Data & ML Engineers: Why Every Tech Team Needs Them | by Nehal kapgate | Aug, 2025

    August 3, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    I Tested TradingView for 30 Days: Here’s what really happened

    August 3, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    How Process-Driven Leaders Drive Businesses and Teams Forward

    January 30, 2025

    Is Jeff Bezos-Backed Slate Auto Making a ‘Cheap’ EV Truck?

    April 13, 2025

    Python: A Language of the Future. Python: A Language of the Future | by Lemopo | Dec, 2024

    December 13, 2024
    Our Picks

    I Tested TradingView for 30 Days: Here’s what really happened

    August 3, 2025

    Clone Any Figma File with One Link Using MCP Tool

    August 3, 2025

    11 strategies for navigating career plateaus

    August 3, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.