What I Learned Benchmarking GPU-Powered Vector Search with cuVS and Milvus | by Alex Chen

I’ve been knee-deep in efficiency tuning these days whereas constructing out a semantic search system. One problem saved surfacing: CPU-bound vector search doesn’t scale as easily as I hoped — particularly when pushing previous 100 million vectors. So I began exploring GPU-accelerated indexing, significantly utilizing NVIDIA’s cuVS library and the CAGRA algorithm.

Right here’s what I realized after some hands-on testing and evaluation.

Let’s say you’re constructing a RAG pipeline. The vector search step — the place you retrieve the top-k semantically related chunks — is commonly the latency bottleneck. As soon as your embedding quantity crosses into tens or tons of of thousands and thousands, search will get costly: each in reminiscence and time.

We will’t simply scale vertically endlessly. That’s why libraries like cuVS exist. They take the computational core of vector search and transfer it to the GPU, giving us the type of throughput CPUs simply can’t match.

When embedding a doc corpus, we usually chunk the info and run every chunk by way of an embedding model like all-MiniLM-L6-v2, which outputs 384-dimensional vectors. Similarity is computed utilizing cosine distance, Euclidean distance, or dot product.

Most real-world search programs depend on an index to make this search quick. With out an index, brute-force scanning thousands and thousands of vectors is painfully gradual.

Hierarchical Navigable Small World (HNSW) is a standard CPU-based indexing technique. It’s a multi-layer proximity graph that allows you to traverse rapidly to approximate nearest neighbors.

I examined it on a 10M vector dataset (128 dimensions) and noticed the next construct instances:

Even with first rate parallelism, construct instances scale poorly. And at greater recall ranges, latency turns into an actual concern.

NVIDIA’s cuVS features a few algorithms, however the one I targeted on was CAGRA — a GPU-first graph indexing technique that resembles HNSW in precept, however is optimized for massively parallel execution.

I re-ran the identical benchmarks on an A10G GPU:

Speedup: ~7–10x relying on recall stage.

Throughput was additionally dramatically greater:

What shocked me was this: even should you don’t wish to deploy your search engine on GPU (because of value, ops complexity, and so on.), you’ll be able to nonetheless use GPU for index constructing.

CAGRA-built graphs may be exported and used as enter to HNSW-like traversal logic on CPU. I discovered this yielded higher latency than native HNSW graphs when vector dimensionality bought giant (>512D).

cuVS additionally helps quantization by way of an extension referred to as CAGRA-Q. That is particularly helpful when:

You will have memory-constrained GPUs (e.g., 8–16GB shopper playing cards)
You wish to offload graphs to CPU reminiscence whereas maintaining vector information on GPU

Quantization does scale back precision barely, however in my testing it held up surprisingly properly till you push under 8-bit.

I additionally examined cuVS-backed indexing on Milvus, which may offload each index and question node computations to GPU. The structure helps this sort of cut up natively:

For instance, constructing IVF-PQ + CAGRA on giant datasets scaled linearly with GPU rely:

Even after normalizing for {hardware} value (A10G = ~$9.68/hr, typical CPU = ~$0.78/hr), I discovered a 12.5x higher time-to-cost ratio for GPU-based indexing.

At giant scale, the numbers get even starker:

635M vectors, 1024 dimensions

8× H100 GPUs (IVF-PQ): 56 minutes
CPU-only: ~6.22 days

The efficiency ceiling on CPU-based vector search is actual. For those who’re working with dense, high-dimensional embeddings, you’ll probably hit it earlier than you assume.

cuVS and CAGRA supply a well-engineered, modular path to scale search workloads with out rewriting every little thing from scratch. Even simply utilizing GPU for offline index builds could make an enormous distinction.

Subsequent, I’ll be digging into how this performs with hybrid retrieval (dense + sparse) and exploring GPU-powered filtering in multi-tenant eventualities. Curious to see the place the actual trade-offs emerge.

Source link

Clone Any Figma File with One Link Using MCP Tool

Agentic AI Patterns. Introduction | by özkan uysal | Aug, 2025

The Rise of Data & ML Engineers: Why Every Tech Team Needs Them | by Nehal kapgate | Aug, 2025

I Tested TradingView for 30 Days: Here’s what really happened

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

How Process-Driven Leaders Drive Businesses and Teams Forward

Is Jeff Bezos-Backed Slate Auto Making a ‘Cheap’ EV Truck?

Python: A Language of the Future. Python: A Language of the Future | by Lemopo | Dec, 2024

Our Picks

I Tested TradingView for 30 Days: Here’s what really happened

Clone Any Figma File with One Link Using MCP Tool

11 strategies for navigating career plateaus

What I Learned Benchmarking GPU-Powered Vector Search with cuVS and Milvus | by Alex Chen | Jul, 2025

Related Posts