Chinese company Z.ai has updated its flagship open-weight language model, and the new GLM-5.1 version rewrites the rules of the game in the field of agent AI - artificial intelligence capable of performing complex tasks over long periods of time without continuous human supervision. While most of today's models operate within a fixed token budget or give up once they assess that further reasoning will not change the outcome, GLM-5.1 can autonomously work on a single task for up to eight hours.

The key is a different approach to thinking. The model goes through a loop of planning, execution, evaluation of intermediate results, and re-evaluation of the chosen strategy - and repeats this loop hundreds of times until it decides that the task is complete. If it recognizes that the current approach is not leading to the goal, it changes the entire strategy.

In internal tests, Z.ai models used thousands of tool calls across several hours. It is this ability to recognize dead ends and deviate from them that experts say is what today's benchmarks fail to reliably capture.

From a technical point of view, it is an impressive machine. GLM-5.1 is built on a mixture-of-experts transformer architecture with a total of 754 billion parameters, with 40 billion parameters active per token. The context window can hold up to 200,000 input tokens, and the output reaches 128,000 tokens. The model handles reasoning, function calls and structured output. The weights are freely available via HuggingFace under the MIT license - for commercial and non-commercial use.

The results in the benchmarks are convincing, especially in the programming area. On the Artificial Analysis Intelligence Index, GLM-5.1 scores 51 points in reasoning mode - the highest among open-source models, albeit behind proprietary models Gemini 3.1 Pro Preview and GPT-5.4 (both 57 points) and Claude Opus 4.6 (53 points).

On the Arena Code leaderboard, where models compete in anonymous pairwise battles rated by programmers, GLM-5.1 came in third with an Elo rating of 1,530, behind Claude Opus 4.6 (1,542) and Claude Opus 4.6 in reasoning mode (1,548). On real software problems from GitHub tested by the SWE-Bench Pro benchmark, GLM-5.1 even led with 58.4 percent - ahead of GPT-5.4 (57.7 %), Claude Opus 4.6 (57.3 %), and Gemini 3.1 Pro (54.2 %).

Weaknesses are evident in mathematics and scientific reasoning. On the GPQA Diamond, a test of graduate-level science questions, GLM-5.1 scored 86.2 percent, while Gemini 3.1 Pro scored 94.3 percent. On the AIME 2026 competitive math problems, GLM-5.1 finished with 95.3 percent behind GPT-5.4 with 98.7 percent.

The price per performance remains significantly lower than the proprietary alternatives - $1.40 per million input tokens versus $5 for Claude Opus 4.6. However, Z.ai has increased prices over the previous version: tokens by about 40 percent and programmer subscriptions by about double. The gap is narrowing.

The broader context of the report is crucial. According to independent testing organisation METR, the length of tasks that AI agents can complete autonomously doubles approximately every seven months. However, even the best models still only successfully complete about a quarter of the long-term programming tasks in benchmarks designed to measure persistence. GLM-5.1 pushes this ceiling - and if its ability to strategically re-evaluate is confirmed in independent tests, this will be a qualitative shift, not just a performance gain.

deeplearning.ai/gnews.cz - GH