How RecursiveMAS speeds up multi-agent inference by 2.4x and reduces token usage by 75%

2 hours ago 2

One of the key challenges of current multi-agent AI systems is that they communicate by generating and sharing text sequences, which introduces latency, drives up token costs, and makes it difficult to train the entire system as a cohesive unit.

To overcome this challenge, researchers at University of Illinois Urbana-Champaign and Stanford University developed RecursiveMAS, a framework that enables agents to collaborate and transmit information through embedding space instead of text. This change results in both efficiency and performance gains.

Experiments show that RecursiveMAS achieves accuracy improvement across complex domains like code generation, medical reasoning, and search, while also increasing inference speed and slashing token usage.

RecursiveMAS is significantly cheaper to train than standard full fine-tuning or LoRA methods, making it a scalable and cost-effective blueprint for custom multi-agent systems.

The challenges of improving multi-agent systems

Multi-agent systems can help tackle complex tasks that single-agent systems struggle to handle. When scaling multi-agent systems for real-world applications, a big challenge is enabling the system to evolve, improve, and adapt to different scenarios over time.

Prompt-based adaptation improves agent interactions by iteratively refining the shared context provided to the agents. By updating the prompts, the system acts as a director, guiding the agents to generate responses that are more aligned with the overarching goal. The fundamental limitation is that the capabilities of the models underlying each agent remain static.

A more sophisticated approach is to train the agents by updating the weights of the underlying models. Training an entire system of agents is difficult because updating all the parameters across multiple models is computationally non-trivial.

Even if an engineering team commits to training their models, the standard method of agents communicating via text-based interactions creates major bottlenecks. Because agents rely on sequential text generation, it causes latency as each model must wait for the previous one to finish generating its text before it can begin its own processing.

Forcing models to spell out their intermediate reasoning token-by-token just so the next model can read it is highly inefficient. It severely inflates token usage, drives up compute costs, and makes iterative learning across the whole system painfully slow to scale.

How RecursiveMAS works

Instead of trying to improve each agent as an isolated, standalone component, RecursiveMAS is designed to co-evolve and scale the entire multi-agent system as a single integrated whole.

The framework is inspired by recursive language models (RLMs). In a standard language model, data flows linearly through a stack of distinct layers. In contrast, a recursive language model reuses a set of shared layers that processes the data and feeds it back to itself. By looping the computation, the model can deepen its reasoning without adding parameters.

RecursiveMAS extends this scaling principle from a single model to a multi-agent architecture that acts as a unified recursive system. In this setup, each agent functions like a layer in a recursive language model. Rather than generating text, the agents iteratively pass their continuous latent representations to the next agent in the sequence, creating a looped hidden stream of information flowing through the system.

This latent hand-off continues down the line through all the agents. When the final agent finishes its processing, its latent outputs are fed directly back to the very first agent, kicking off a new recursion round.

This structure allows the entire multi-agent system to interact, reflect, and refine its collective reasoning over multiple rounds entirely in the latent space, with only the very last agent producing a textual output in the final round. It is like the agents are communicating telepathically as a unified whole and the last agent provides the final response as text.

The architecture of latent collaboration

To make continuous latent space collaboration possible, the authors introduce a specialized architectural component called the RecursiveLink. This is a lightweight, two-layer module designed to transmit and refine a model's latent states rather than forcing it to decode text.

A language model's last-layer hidden states contain the rich, semantic representation of its reasoning process. The RecursiveLink is designed to preserve and transmit this high-dimensional information from one embedding space to another.

To avoid the cost of updating every parameter across multiple large language models, the framework keeps the models' parameters frozen. Instead, it optimizes the system by only training the parameters of the RecursiveLink modules.

To handle both internal reasoning and external communication, the system uses two variations of the module. The inner RecursiveLink operates inside an agent during its reasoning phase. It takes the model's newly generated embeddings and maps them directly back into its own input embedding space. This allows the agent to continuously generate a stream of latent thoughts without generating discrete text tokens.

The outer RecursiveLink serves as the bridge between agents. Because agents in a real-world system might use different model architectures and sizes, their internal embedding spaces have entirely different dimensions. The outer RecursiveLink includes an additional layer designed to match the embeddings from one agent's hidden dimension with the next agent's embedding space.

During training, first, the inner links are trained independently to warm up each agent's ability to think in continuous latent embeddings. Then, the system enters outer-loop training, where the diverse, frozen models are chained together in a loop, and the system is evaluated based on the final textual output of the last agent.

The only thing that gets updated in the training process is the RecursiveLink parameters and the original model weights remain unchanged, similar to low-rank adaptation (LoRA). Another advantage of this system comes into effect when you have multiple agents on top of the same backbone model.

If you have a multi-agent system where two agents are built on the exact same foundation model acting in different roles, you do not need to load two copies of the model into your GPU memory, nor do you train them separately. The agents will share the same backbone as the brain and use the RecursiveLink as the connective tissue.

RecursiveMAS in action

The researchers evaluated RecursiveMAS across nine benchmarks spanning mathematics, science and medicine, code generation, and search-based question answering. They created a multi-agent system using open-weights models including Qwen, Llama-3, Gemma3, and Mistral. These models were assigned roles to form different agent collaboration patterns such as sequential reasoning and mixture-of-experts collaboration.

RecursiveMAS was compared to baselines under identical training budgets, including standalone models enhanced with LoRA or full supervised fine-tuning, alternative multi-agent frameworks like Mixture-of-Agents and TextGrad, and recursive baselines like LoopLM. It was also compared to Recursive-TextMAS, which uses the same recursive loop structure as RecursiveMAS but forces the agents to explicitly communicate via text.

RecursiveMAS achieved an average accuracy improvement of 8.3% compared to the strongest baselines across the benchmarks. It excelled particularly on reasoning-heavy tasks, outperforming text-based optimization methods like TextGrad by 18.1% on AIME2025 and 13% on AIME2026.

Because it avoids generating text at every step, RecursiveMAS achieved 1.2x to 2.4x end-to-end inference speedup. RecursiveMAS is also much more token efficient than the alternative. Compared to the text-based Recursive-TextMAS, it reduces token usage by 34.6% in the first round of the recursion, and by round three, it achieves 75.6% token reduction. RecursiveMAS also proved remarkably cheap to train. Because it only updates the lightweight RecursiveLink modules, which consist of roughly 13 million parameters or about 0.31% of the trainable parameters of the frozen models, it requires the lowest peak GPU memory and cuts training costs by more than half compared to full fine-tuning.

Enterprise adoption

The efficiency gains — lower token consumption, reduced GPU memory requirements, and faster inference — are intended to make complex multi-step agent workflows viable in production environments without the compute overhead that limits enterprise agentic deployments. The researchers have released the code and trained model weights under the Apache 2.0 license.

Read Entire Article