
Follow ZDNET: Add us as a preferred source on Google.
ZDNET's key takeaways
- Even the best AI models are challenged to carry out tasks via MCP.
- New benchmarks show models struggle when tasks become more complex.
- More training of AI models is required that's specific to MCP use.
An emerging category of artificial intelligence middleware known as Model Context Protocol is meant to make generative AI programs such as chatbots bots more powerful by letting them connect with various resources, including packaged software such as databases.
Multiple studies, however, reveal that even the best AI models struggle to use Model Context Protocol. Top AI models such as Google's Gemini 5 require many, many rounds of interactions with the external programs, leading to long delays in the performance of the AI models.
Also: What is Model Context Protocol? The emerging standard bridging AI and data, explained
"Even state-of-the-art models struggle with different capabilities," writes Zhenting Wang and team at consulting firm Accenture, the MIT-IBM Watson AI Lab, and the University of California at Berkeley in an August work that introduced MCP-Bench, a set of 250 tasks for AI agents employing MCP.
"Performance generally declines as tasks transition from Single Server to Multi Server scopes," writes Zikang Guo and team at the University of Science and Technology of China last month when they tested several AI models on their own benchmark test, MCP-AgentBench.
Even the best models today, including OpenAI's GPT-5, have "failure cases" arising from "repetitive or exploratory interactions that fail to make meaningful progress," writes lead author Zijian Wu and the team of the National University of Singapore and collaborating institutions in the paper announcing their benchmark, MCPMArk, last month.
Where an AI model can go wrong with MCP
MCP is a kind of middleware for turning AI into client-server interactions. It was introduced last year by gen AI startup Anthropic (makers of the Claude family of large language models and chatbots) as a secure, industry-standard way to connect LLMs and AI agents to external software resources such as databases and customer relationship management software.
As ZDNET's Steven Vaughan-Nichols explains, middleware like MCP can reduce the number of connections that an AI program has to initiate to connect to multiple external resources.
Also: ChatGPT can now connect to MCP servers - here's how, and what to watch for
However, having a standard does not mean that an AI model, whose functionality includes a heavy dose of chance ("probability" in technical terms), will faithfully implement MCP.
An AI model plugged into MCP has to generate output that achieves several things, such as formulating a plan to answer a query by choosing which external resources to access, in what order to contact the MCP servers that lead to those external applications, and then structuring several requests for information to produce a final output to answer the query.
The various studies show that while top-of-the-line models such as Gemini 5 and GPT-5 can do better than less-impressive programs, all models are still limited in their ability to manage all those challenges. Issues across all the models include taking an excessive number of steps to retrieve the information, even when the language model's plan of approach was sound to begin with.
What the benchmarks tell us
All the benchmark tests take a similar approach: They collect a group of challenging queries for information and a collection of MCP servers to which the AI models can gain access, and the information resources to which those MCP servers grant access.
The resources in these tests are often publicly available resources such as Google Search, Wikipedia, or some other widely available repository of information.
An example problem from the Accenture work of Wang and team was to retrieve online information to plan a weekend hiking trip. The prompt began with "I'm trying to plan a week-long hiking and camping loop that starts and ends in Denver, and I'm hoping you can really nerd out with me on the details," and then went on to specify several requirements, such as which parks to visit, visitor hours, chances of rain, etc.
The request was to be sent to multiple MCP server-enabled information services, including Google Maps and the US national park websites, and to specific tools such as "findParks, getParkDetails, getAlerts, getVisitorCenters, getCampgrounds, getEvents."
Also: Anthropic now lets developers use Claude Code with any remote MCP server
All of the benchmarks are meant to evolve the measurement of AI models from simple function-calling challenges. The benchmarks require the AI models to achieve multiple requirements, including turning the natural-language prompt into search requests that respect the schema -- the order of communications for MCP specified in the JSON code on which MCP is built.
Respecting schema is just the lowest level of achievement. At a higher level, "agents must identify the correct tools from large, heterogeneous tool spaces when confronted with ambiguous or underspecified task descriptions," writes Wang and team. "This requires disambiguating semantic variants, coping with naming inconsistencies, and avoiding traps posed by superficially plausible but irrelevant tools."
The benchmarks typically measure how many different resources a program will tap into, and how many "turns" are required, a measure of the efficiency by which an AI model uses those resources.
Also: Is AI even worth it for your business? 5 expert tips to help prove ROI
As Wang and team describe it, MCP-Bench "measures structural coherence, dependency awareness, parallelism efficiency, and reflective adaptation. Tasks include not only linear workflows but also complex compositions requiring concurrent interactions across multiple servers with multiple objectives." All of which is taken as a greater or lesser ability by the models to engage in what's called "long-horizon planning."
If an AI model has to take increasingly more turns to get the information it needs from an MCP server, it may suggest that it is not able to properly plan how to use the available resources.
All of these benchmarks employ multiple large language models to compare how the current landscape of offerings perform on a relative basis.
The good news is that all three studies mentioned here reported that bigger, more powerful AI models scored better than smaller models. That suggests that as models get better in many respects, they can also improve on MCP-related challenges.
Zijian Wu and team at the National University of Singapore also note the advantage of top-of-the-line models to plan better, writing, "stronger models succeed through better decision making and targeted exploration, not blind trial-and-error."
Wang and team find that "the real differentiator is robustness to scaling, where top-tier models demonstrate clear advantages in handling long-horizon, cross-server tasks."
Guo and team find some open-source models (such as Qwen3-235B) take top scores, noting a "surprising and significant trend: the leading open-source models demonstrate exceptional capabilities, rivaling and even surpassing their proprietary counterparts."
But there are also pitfalls for all the models. Wang and team relate that their MCP-Bench tasks "are inherently multi-step and often involve chaining heterogeneous tools across servers," and find that "even strong [AI] models typically require several rounds of interaction," and "struggle with different capabilities such as dependency chain compliance, tool selection under noisy environment, and long-horizon planning."
Also: AI's not 'reasoning' at all - how this team debunked the industry hype
Likewise, Guo and team call out the problems that crop up with the rising complexity of MCP interactions, noting that across all models, "performance generally declines as tasks transition from single-server to multi-server scopes […] a similar drop occurs as call dependency increases from simple single to complex sequential calls."
Overall, it would appear that as tasks get more complex with MCP, all AI models have a harder time, even if some do much better than others.
What can be done to make models better?
The immediate takeaway from the various benchmarks is that AI models need to adapt to a new epoch in which using MCP is a challenge. AI models may have to evolve in new directions to fulfill the challenge.
All three studies identify a problem: Performance degrades as the AI models have to access more MCP servers. The complexity of multiple resources starts to overwhelm even the models that can best plan what steps to take at the outset.
As Wu and team put it in their MCPMark paper, the complexity of all those MCP servers strains any AI model's ability to keep track of it all.
Also: Consumers more likely to pay for 'responsible' AI tools, Deloitte survey says
They identify a key challenge in "the agent's ability to manage an ever-growing history" of MCP interactions, and a "core unreliability that can only be solved by building agents with robust error-handling and self-correction capabilities."
The most immediate route to ameliorating AI models' performance gap may be to train them specifically for MCP.
Using a form of fine-tuning, which means training AI models a second time after the main pre-training stage, scholars at the University of Washington and the MIT-IBM Watson AI Lab have developed a data set for fine-tuning consisting of millions of examples of MCP interactions between an AI program and external tools. As they put it, it is "the largest publicly available tool-agentic dataset to date."
Introduced this month, the data set, Toucan, was able to make relatively small AI models such as the open-source Qwen3-32B perform better at MCP tasks overall compared to much larger AI models such as DeepSeek V3 and OpenAI's o3 mini, using the same benchmark tests propounded by Wang and others.
Get the biggest stories in tech every Friday with ZDNET's Week in Review newsletter.
As encouraging as Toucan is, a big open question is what to do with all the non-public, non-standard resources to which MCP may be connected in private data centers. For example, if AI models are fine-tuned to work with MCP more efficiently in the greatest number of cases, will that necessarily improve a particular AI model's performance on XYZ Corp.'s on-premise installation of Salesforce CRM, or Oracle database?
We won't know until CIOs implement MCP and find out.