days, it’s all about brokers, which I’m all for, anbeyond primary vector search by giving LLMs entry to a variety of instruments:
- Internet search
- Varied API calls
- Querying totally different databases
Whereas there’s a surge in new MCP servers being developed, there’s surprisingly little analysis taking place. Positive, you may hook an LLM with numerous totally different instruments, however do you actually know the way it’s going to behave? That’s why I’m planning a collection of weblog posts targeted on evaluating each off-the-shelf and customized graph MCP servers, particularly people who retrieve info from Neo4j.
Mannequin Context Protocol (MCP) is Anthropic’s open normal that capabilities like “a USB-C port for AI functions,” standardizing how AI techniques connect with exterior knowledge sources via light-weight servers that expose particular capabilities to purchasers. The important thing perception is reusability. As a substitute of customized integrations for each knowledge supply, builders construct reusable MCP servers as soon as and share them throughout a number of AI functions.
An MCP server implements the Mannequin Context Protocol, exposing instruments and knowledge to an AI shopper through structured JSON-RPC calls. It handles requests from the shopper and executes them in opposition to native or distant APIs, returning outcomes to complement the AI’s context.
To guage MCP servers and their retrieval strategies, step one is to generate an analysis dataset, one thing we’ll use an LLM to assist with. Within the second stage, we’ll take an off-the-shelf mcp-neo4j-cypher server and take a look at it in opposition to the benchmark dataset we created.

The aim for now’s to determine a stable dataset and framework so we are able to constantly examine totally different retrievers all through the collection.
Code is offered on GitHub.
Analysis dataset
Final yr, Neo4j launched the Text2Cypher (2024) Dataset, which was designed round a single-step strategy to Cypher technology. In single-step Cypher technology, the system receives a pure language query and should produce one full Cypher question that instantly solutions that query, basically a one-shot translation from textual content to database question.
Nevertheless, this strategy doesn’t mirror how brokers really work with graph databases in observe. Brokers function via multi-step reasoning: they will execute a number of instruments iteratively, generate a number of Cypher statements in sequence, analyze intermediate outcomes, and mix findings from totally different queries to construct as much as a remaining reply. This iterative, exploratory strategy represents a essentially totally different paradigm from the prescribed single-step mannequin.

The present benchmark dataset fails to seize this distinction of how MCP servers really get utilized in agentic workflows. The benchmark wants updating to judge multi-step reasoning capabilities moderately than simply single-shot text2cypher translation. This may higher mirror how brokers navigate complicated info retrieval duties that require breaking down issues, exploring knowledge relationships, and synthesizing outcomes throughout a number of database interactions.
Analysis metrics
An important shift when transferring from single-step text2cypher analysis to an agentic strategy lies in how we measure accuracy.

In conventional text2query duties like text2cypher, analysis sometimes entails evaluating the database response on to a predefined floor fact, usually checking for exact matches or equivalence.
Nevertheless, agentic approaches introduce a key change. The agent could carry out a number of retrieval steps, select totally different question paths, and even rephrase the unique intent alongside the best way. Consequently, there could also be no single appropriate question. As a substitute, we shift our focus to evaluating the ultimate reply generated by the agent, whatever the intermediate queries it used to reach there.
To evaluate this, we use an LLM-as-a-judge setup, evaluating the agent’s remaining reply in opposition to the anticipated reply. This lets us consider the semantic high quality and usefulness of the output moderately than the interior mechanics or particular question outcomes.
Consequence Granularity and Agent Conduct
One other vital consideration in agentic analysis is the quantity of information returned from the database. In conventional text2cypher duties, it’s widespread to permit and even anticipate giant question outcomes, because the aim is to check whether or not the proper knowledge is retrieved. Nevertheless, this strategy doesn’t translate nicely to evaluating agentic workflows.
In an agentic setting, we’re not simply testing whether or not the agent can entry the proper knowledge, however whether or not it may generate a concise, correct remaining reply. If the database returns an excessive amount of info, the analysis turns into entangled with different variables, such because the agent’s means to summarize or navigate giant outputs, moderately than specializing in whether or not it understood the consumer’s intent and retrieved the proper info.
Introducing Actual-World Noise
To additional align the benchmark with real-world agentic utilization, we additionally introduce managed noise into the analysis prompts.

This contains components akin to:
- Typographical errors in named entities (e.g., “Andrwe Carnegie” as an alternative of “Andrew Carnegie”),
- Colloquial phrasing or casual language (e.g., “present me what’s up with Tesla’s board” as an alternative of “listing members of Tesla’s board of administrators”),
- Overly broad or under-specified intents that require follow-up reasoning or clarification.
These variations mirror how customers really work together with brokers in observe. In actual deployments, brokers should deal with messy inputs, incomplete formulations, and conversational shorthand, that are circumstances hardly ever captured by clear, canonical benchmarks.
To raised mirror these insights round evaluating agentic approaches, I’ve created a new benchmark using Claude 4.0. Not like conventional benchmarks that concentrate on Cypher question correctness, this one is designed to evaluate the standard of the remaining solutions produced by multi-step brokers
Databases
To make sure a wide range of evaluations, we use a few totally different databases which are obtainable on the Neo4j demo server. Examples embrace:

MCP-Neo4j-Cypher server
mcp-neo4j-cypher is a ready-to-use MCP instrument interface that enables brokers to work together with Neo4j via pure language. It helps three core capabilities: viewing the graph schema, working Cypher queries to learn knowledge, and executing write operations to replace the database. Outcomes are returned in a clear, structured format that brokers can simply perceive and use.

It really works out of the field with any framework that helps MCP servers, making it easy to plug into present agent setups with out additional integration work. Whether or not you’re constructing a chatbot, knowledge assistant, or customized workflow, this instrument lets your agent safely and intelligently work with graph knowledge.
Benchmark
Lastly, let’s run the benchmark analysis.
We used LangChain to host the agent and join it to the mcp-neo4j-cypher
server, which is the one instrument supplied to the agent. This setup makes the analysis easy and life like: the agent should rely completely on pure language interplay with the MCP interface to retrieve and manipulate graph knowledge.
For the analysis, we used Claude 3.7 Sonnet because the agent and GPT-4o Mini because the decide.
The benchmark dataset contains roughly 200 pure language question-answer pairs, categorized by variety of hops (1-hop, 2-hop, and so forth.) and whether or not the queries include distracting or noisy info. This construction helps assess the agent’s reasoning accuracy and robustness in each clear and noisy contexts. The analysis code is offered on GitHub.
Let’s look at the outcomes collectively.

The analysis exhibits that an agent utilizing solely the mcp-neo4j-cypher
interface can successfully reply complicated pure language questions over graph knowledge. Throughout a benchmark of round 200 questions, the agent achieved a median rating of 0.71, with efficiency dropping as query complexity elevated. The presence of noise within the enter considerably decreased accuracy, revealing the agent’s sensitivity to typos in named entities and such.
On the instrument utilization facet, the agent averaged 3.6 instrument calls per query. That is in keeping with the present requirement to make no less than one name to fetch the schema and one other to execute the primary Cypher question. Most queries fell inside a 2–4 name vary, displaying the agent’s means to purpose and act effectively. Notably, a small variety of questions had been answered with only one and even zero instrument calls, anomalies that will recommend early stopping, incorrect planning, or agent bugs, and are price additional evaluation. Trying forward, instrument rely may very well be decreased additional if schema entry is embedded instantly through MCP sources, eliminating the necessity for an express schema fetch step.
The actual worth of getting a benchmark is that it opens the door to systematic iteration. As soon as baseline efficiency is established, you can begin tweaking parameters, observing their impression, and making focused enhancements. For example, if agent execution is expensive, you would possibly need to take a look at whether or not capping the variety of allowed steps to 10 utilizing a LangGraph recursion restrict has a measurable impact on accuracy. With the benchmark in place, these trade-offs between efficiency and effectivity may be explored quantitatively moderately than guessed.

With a 10-step restrict in place, efficiency dropped noticeably. The imply analysis rating fell to 0.535. Accuracy decreased sharply on extra complicated (3-hop+) questions, suggesting the step restrict reduce off deeper reasoning chains. Noise continued to degrade efficiency, with noisy questions averaging decrease scores than clear ones.
Abstract
We’re dwelling in an thrilling second for AI, with the rise of autonomous brokers and rising requirements like MCP dramatically increasing what LLMs can do, particularly in the case of structured, multi-step duties. However whereas the capabilities are rising quick, sturdy analysis remains to be lagging behind. That’s the place this GRAPE undertaking is available in.
The aim is to construct a sensible, evolving benchmark for graph-based query answering utilizing the MCP interface. Over time, I plan to refine the dataset, experiment with totally different retrieval methods, and discover tips on how to lengthen or adapt the Cypher MCP for higher accuracy. There’s nonetheless a number of work forward from cleansing knowledge, bettering retrieval to tightening analysis. Nevertheless, having a transparent benchmark means we are able to monitor progress meaningfully, take a look at concepts systematically, and push the boundaries of what these brokers can reliably do.