Alibaba's proprietary Qwen3.7-Max can run for 35 hours autonomously and supports external harnesses like Anthropic's Claude Code

4 hours ago 3

The AI industry has fully entered the "agent era," a paradigm where AI models do far more than generate text — they now actively plan, execute, and course-correct complex tasks over days rather than seconds.

Thus, it's perhaps unsurprising to see Chinese e-commerce giant Alibaba's famed Qwen Team of AI researchers release a model capable of performing autonomous agentic AI work over multiple days: that model has arrived in the form of Qwen3.7-Max which the company reports in a blog post achieved "~35 hours of continuous autonomous execution" — albeit, in a proprietary, not open source format, as prior Qwen Team releases were.

This is also to be expected — it's what many analysts and industry experts feared in the wake of the departure of several key Qwen Team leaders earlier this year. But it makes sense for Alibaba financially, at least in the short term: training AI models, especially ones as powerful as Qwen3.7-Max, is expensive, and giving them away essentially for free, as open source models are, does not immediately help recoup any costs.

In that sense, Alibaba is simply aligning its efforts with American AI giants like OpenAI and Google by offering the latest and greatest models only through paid APIs and subscription or paid web plan bundles, and slightly less performant ones through open source.

Still, the arrival of Qwen3.7-Max offers further optionality to enterprises and individual users, and more competition for American AI labs — rarely a bad thing for consumers at all budget levels. Yet, the fact that the model is only accessible from Chinese-based endpoints means it may be limited in its appeal to American and European enterprises seeking to maximize compliance and security posturing when fulfilling government contracts, or even just attempting to comply with all relevant state, local, and national data sovereignty regulations.

The marathon AI era

To understand why Qwen3.7-Max is a departure from previous models, one must look at how it was trained and how it operates in practice.

Language models typically degrade when forced to maintain a single train of thought over thousands of conversational turns; they forget instructions, hallucinate variables, or simply get stuck in logical loops. Qwen3.7-Max was specifically designed as a "versatile agent foundation" capable of "long-horizon reasoning" to overcome this exact bottleneck.

The starkest demonstration of this capability is an autonomous engineering task detailed by the Qwen team. The model was given access to an isolated server equipped with a T-Head ZW-M890 PPU—a hardware architecture the model had never encountered during its training. Its task was to optimize an attention kernel.

Over the course of 35 straight hours, Qwen3.7-Max operated entirely autonomously. It executed 1,158 distinct tool calls, performed 432 kernel evaluations, diagnosed compilation failures, and iteratively improved the code to achieve a 10.0x geometric mean speedup.

By comparison, Chinese competitor models like z.ai's GLM-5.1 and Moonshot's Kimi K2.6 capped out at 7.3x and 5.0x speedups respectively, often voluntarily terminating their sessions when they failed to make progress. However, both are available open source.

This endurance is achieved through what Alibaba calls "environment scaling". Just as early LLMs grew smarter by ingesting more diverse text, Qwen3.7-Max was trained across a vast, scaled array of dynamic agentic environments.

It is capable of simulating a one-year lifecycle of a startup in the "YC-Bench" evaluation, navigating hundreds of decision-making rounds encompassing personnel management and contract screening. In this simulation, the model managed to generate $2.08 million in virtual revenue, nearly doubling the performance of the prior generation, Qwen3.6-Plus.

Furthermore, the model has built-in reward-hacking self-monitoring, autonomously detecting when it attempts to cheat a training environment and adding heuristic rules to correct its own behavior.

A brain for any scaffold

From a product perspective, Qwen3.7-Max is designed to be the cognitive engine for modern software development and enterprise automation.

The model offers a massive 1-million-token context window and a 64K maximum output limit, providing immense overhead for processing sprawling codebases or lengthy technical documents.

One of its most compelling features is "cross-harness generalization". Rather than being hardcoded to work best within a specific proprietary interface, Qwen3.7-Max is built to act as a drop-in intelligence layer for diverse agent frameworks. It supports the Anthropic API protocol natively, allowing developers to plug it directly into existing tools like Claude Code or OpenClaw.

The benchmark data provided by Alibaba indicates that this generalized approach has paid massive dividends.

On the Apex Math Reasoning benchmark, Qwen3.7-Max scored 44.5, eclipsing Claude Opus-4.6 Max's score of 34.5 and DeepSeek V4-Pro Max's 38.3. It also posted dominant scores on Humanity's Last Exam (41.4) and the realistic coding agent benchmark MCP-Atlas (76.4).

Alibaba Qwen3.7-Max benchmark comparison table

Alibaba Qwen3.7-Max benchmark comparison table. Credit: Alibaba Qwen

This translates into tangible utility for end-users. Through open source Model Context Protocol (MCP) integrations, the model can operate as an autonomous office assistant, capable of reading university formatting specs and automatically reformatting a messy Word document via command-line tools without human intervention.

Running this level of intelligence comes at a distinct cost. Developers accessing the API via Alibaba Cloud Model Studio will pay $2.50 per 1 million input tokens and $7.50 per 1 million output tokens. The platform also features explicit cache creation and read pricing, as well as a $10 fee per 1,000 calls for integrated web searches, though code interpreter tools remain free for a limited time.

Qwen3.7-Max occupies a strategic middle ground in the current API economy. While it demands a notable premium over aggressively priced domestic rivals—costing nearly double DeepSeek V4 Pro ($5.22) and Z.ai's GLM-5.1 ($5.80)—it drastically undercuts the Western frontier giants it routinely matches on benchmarks.

For context, running heavy agentic workflows through OpenAI's GPT-5.4 or Anthropic's Claude Opus 4.7 will run developers $17.50 and $30.00 per million tokens, respectively. See VentureBeat's pricing chart below:

Model

Input

Output

Total Cost

Source

MiMo-V2.5 Flash

$0.10

$0.30

$0.40

Xiaomi MiMo

MiniMax M2.7

$0.30

$1.20

$1.50

MiniMax

Gemini 3.1 Flash-Lite

$0.25

$1.50

$1.75

Google

MiMo-V2.5

$0.40

$2.00

$2.40

Xiaomi MiMo

Kimi-K2.6

$0.95

$4.00

$4.95

Moonshot/Kimi

GLM-5

$1.00

$3.20

$4.20

Z.ai

Grok 4.3 (low context)

$1.25

$2.50

$3.75

xAI

DeepSeek V4 Pro

$1.74

$3.48

$5.22

DeepSeek

GLM-5.1

$1.40

$4.40

$5.80

Z.ai

Claude Haiku 4.5

$1.00

$5.00

$6.00

Anthropic

Grok 4.3 (high context)

$2.50

$5.00

$7.50

xAI

Qwen3.7-Max

$2.50

$7.50

$10.00

Alibaba Cloud

Gemini 3.5 Flash

$1.50

$9.00

$10.50

Google

Gemini 3.1 Pro Preview (≤200K)

$2.00

$12.00

$14.00

Google

GPT-5.4

$2.50

$15.00

$17.50

OpenAI

Gemini 3.1 Pro Preview (>200K)

$4.00

$18.00

$22.00

Google

Claude Opus 4.7

$5.00

$25.00

$30.00

Anthropic

GPT-5.5

$5.00

$30.00

$35.00

OpenAI

By positioning Qwen3.7-Max just below Google's Gemini 3.5 Flash ($10.50) but well above budget-tier models, Alibaba is signaling that this isn't a commodity release; it’s a flagship reasoning engine priced to lure enterprise workloads away from Silicon Valley's most expensive offerings.

Licensing remains proprietary for now

For all its technical brilliance, the most controversial aspect of Qwen3.7-Max is how it is distributed. Qwen is billing the release as a "proprietary model". It is strictly API-only.

Historically, Alibaba’s Qwen has been a hero to the open-source and local LLM communities. Previous iterations, like Qwen 2.5 and Qwen 3.6, released their weights publicly. Open weights allow developers, researchers, and enterprises to download the model, run it on their own hardware, and fine-tune it for highly specific or data-sensitive use cases without sending proprietary information to a third-party server.

By locking Qwen3.7-Max behind an API, Alibaba is pivoting to the standard commercial playbook utilized by OpenAI (with GPT-4) and Anthropic (with Claude). For enterprise users, this means utilizing Qwen3.7-Max requires trusting Alibaba Cloud with their data streams and relying entirely on internet connectivity to run their agentic workflows. For the open-source community, it means losing access to what is currently one of the most capable models on the planet.

Community reactions split between awe and disappointment

The reaction from the developer community has been swift, characterized by a mix of profound respect for the engineering achievement and frustration over the licensing model.

Prominent AI commentator Sudo su (@sudoingX) captured the prevailing sentiment on X (formerly Twitter). "qwen is unreal," they wrote. "they just dropped 3.7 max and it is beating opus 4.6 max on most of the benchmarks they ran".

The technical metrics, particularly the model's endurance, have left many in the field stunned. "the apex math number, 44.5 against opus 34.5, that is not a small gap," Sudo su noted. "the 35 hours straight on a kernel optimization task with 1000+ tool calls is the part i keep rereading. that is the agent era thing actually happening, not a slide".

The speed of Alibaba's iteration is also drawing notice. With Qwen 3.6 released just last month, the leap to 3.7-Max highlights a relentless development cadence. As Sudo su observed, "nobody else is moving like this".

Yet, the praise is heavily caveated by the shift to a closed ecosystem. The loss of the model weights is seen as a blow to the localized AI movement, which relies on state-of-the-art open models to push the boundaries of what can be done on consumer hardware or private enterprise clusters.

"one thing though, please open source this one too," Sudo su pleaded in their post. "3.6 dense made the entire local llm ecosystem better. the max tier going api only would close a door we have been keeping open. give us the weights eventually".

Qwen3.7-Max proves that the autonomous agent era is no longer a theoretical projection; it is a present reality capable of executing complex engineering feats while humans sleep. The only question now is whether this new frontier of AI will be a democratized resource you can download to your laptop, or an intelligence utility rented strictly from the cloud. For now, with Qwen3.7-Max, it is undeniably the latter.

Read Entire Article