GLM-5.1 API on Novita AI: Long-Horizon Agentic Model

glm5.1 on novita

Most coding models hit a wall after a few dozen tool calls. They try the obvious approaches, run out of ideas, and plateau. More time doesn’t help — the model has already exhausted what it knows how to try.

What’s actually new in GLM-5.1

GLM-5.1 is a 754B-parameter Mixture-of-Experts model, 40B active per inference pass, 204,800-token context window.

The real change is in how it behaves on long-horizon tasks. Z.ai calls it a staircase pattern: the model refines within a fixed strategy until it hits a ceiling, then shifts to a structurally different approach and climbs again. Six such shifts happened in a single VectorDBBench run. Each one was initiated by the model after it analyzed its own benchmark logs and identified what was blocking further progress.

That’s different from having a longer context window. It’s the model actively managing its own strategy.

GLM-5.1 benchmark results: coding and agentic tasks

GLM-5.1 leads on coding and agentic benchmarks. On reasoning, Gemini 3.1 Pro and GPT-5.4 are ahead.

Reasoning

BenchmarkGLM-5.1GLM-5Qwen3.6-PlusMiniMax M2.7DeepSeek-V3.2Kimi K2.5Claude Opus 4.6Gemini 3.1 ProGPT-5.4
HLE31.030.528.828.025.131.536.745.039.8
HLE (w/ Tools)52.350.450.640.851.853.151.452.1
AIME 202695.395.495.189.895.194.595.698.298.7
HMMT Nov. 202594.096.994.681.090.291.196.394.895.8
HMMT Feb. 202682.682.887.872.779.981.384.387.391.8
IMOAnswerBench83.882.583.866.378.381.875.381.091.4
GPQA-Diamond86.286.090.487.082.487.691.394.392.0

Coding

BenchmarkGLM-5.1GLM-5Qwen3.6-PlusMiniMax M2.7DeepSeek-V3.2Kimi K2.5Claude Opus 4.6Gemini 3.1 ProGPT-5.4
SWE-Bench Pro58.455.156.656.253.857.354.257.7
NL2Repo42.735.937.939.832.049.833.441.3
Terminal-Bench 2.0 (Terminus-2)63.556.261.639.350.865.468.5
Terminal-Bench 2.0 (best harness)69.0 (Claude Code)56.2 (Claude Code)57.0 (Claude Code)46.4 (Claude Code)75.1 (Codex)
CyberGym68.748.317.341.366.6

Agentic

BenchmarkGLM-5.1GLM-5Qwen3.6-PlusMiniMax M2.7DeepSeek-V3.2Kimi K2.5Claude Opus 4.6Gemini 3.1 ProGPT-5.4
BrowseComp68.062.051.460.6
BrowseComp (w/ Context Manage)79.375.967.674.984.085.982.7
τ³-Bench70.669.270.767.669.266.072.467.172.9
MCP-Atlas (Public Set)71.869.274.148.862.263.873.869.267.2
Tool-Decathlon40.738.039.846.335.227.847.248.854.6
Vending Bench 2$5,634$4,432$5,115$1,034$1,198$8,018$911$6,144

SWE-Bench Pro (58.4) is the headline — the highest score across all nine models in this comparison, open-source and proprietary alike. CyberGym is the sharpest jump generation-over-generation: 48.3 on GLM-5 to 68.7. Worth noting on Terminal-Bench 2.0: the “best harness” row reflects each team’s self-reported result using their preferred execution environment. GLM-5.1 hits 69.0 with Claude Code; GPT-5.4 hits 75.1 with Codex.

What long-horizon agentic execution looks like in practice

Single-pass benchmark numbers don’t capture what happens when you let a model run for hours. Z.ai ran three scenarios with progressively less structured feedback to show what GLM-5.1 does differently.

Scenario 1: vector database optimization, 600+ iterations

VectorDBBench gives the model a Rust skeleton with HTTP endpoints and empty implementation stubs. Using tool-call-based agents, it reads and writes files, compiles, tests, and profiles — normally within a 50-turn budget. The best result under that constraint: 3,547 QPS, by Claude Opus 4.6.

Z.ai removed the cap. In each iteration, GLM-5.1 could use as many tool calls as needed, then submit a new version to benchmark. It ran 655 iterations with 6,000+ tool calls and reached 21.5k QPS — roughly 6x the single-session best.

Two transitions illustrate how it got there. Around iteration 90, it shifted from full-corpus scanning to IVF cluster probing with f16 vector compression, jumping to 6.4k QPS. Around iteration 240, it introduced a two-stage pipeline—u8 prescoring followed by f16 reranking—reaching 13.4k QPS. Six such structural transitions occurred over the full run, each initiated by the model after analyzing its own benchmark logs and identifying the current bottleneck. 

Scenario 2: GPU kernel optimization, 1,000+ turns

KernelBench asks the model to take a reference PyTorch implementation and produce a faster GPU kernel with identical outputs. Level 3 covers 50 full-model problems: MobileNet, VGG, MiniGPT, Mamba. Baseline: torch.compile at 1.15x, max-autotune at 1.49x.

Z.ai ran four models on Level 3, tracking geometric mean speedup across tool-use turns:

  • GLM-5 improves quickly early and levels off
  • Claude Opus 4.5 continues longer, then also tapers
  • GLM-5.1 finishes at 3.6x and keeps making progress well into the run
  • Claude Opus 4.6 is the strongest at 4.2x, still showing headroom at the end

GLM-5.1 doesn’t match Claude Opus 4.6 here. But it clearly extends the useful run duration beyond GLM-5, which is the point.

Scenario 3: building a Linux desktop, 8 hours autonomous

The first two scenarios have a number to optimize. This one doesn’t. The prompt: build a Linux-style desktop environment as a web application. No starter code, no design mockups, no intermediate feedback.

Most models produce a basic skeleton — static taskbar, a placeholder window — then declare it done.

GLM-5.1 ran inside a simple harness: after each execution round, the model reviews its own output, identifies what’s missing or broken, and continues. Over 8 hours, it built a file browser, terminal, text editor, system monitor, calculator, and functional games, each integrated into a coherent UI. Styling got more polished with each pass. Edge cases got handled. The model decided the whole roadmap itself.

What GLM-5.1 is built for

GLM-5.1 makes the most sense for tasks where additional runtime actually produces better output:

  • Long-running coding agents — multi-file refactors, migrations, full system builds
  • Agentic coding tools — works with Claude Code, OpenClaw, Trae, Cursor, Codex, and Cline
  • Terminal automation — 63.5 on Terminal-Bench 2.0 (Terminus-2), up from 56.2 on GLM-5
  • Cybersecurity — 68.7 on CyberGym, the highest in this benchmark set
  • Web research — 68.0 on BrowseComp, also the highest here

GLM-5.1 API pricing on Novita AI

 Price
Input$1.40 / M tokens
Cache Read$0.26 / M tokens
Output$4.40 / M tokens

Pay per token, no monthly commitment. Full pricing at novita.ai/pricing.

Getting started: OpenAI and Anthropic SDK compatible

Novita AI’s API works with both the OpenAI and Anthropic SDKs. Drop in the model ID and your existing setup runs as-is. GLM-5.1 can be called directly from Claude Code, OpenClaw, Trae, Cursor, Codex, and any platform that accepts an OpenAI- or Anthropic-compatible endpoint.

Try GLM-5.1 on Playground  |  View API Docs

Python (OpenAI SDK):

from openai import OpenAI

client = OpenAI(
    api_key="<Your Novita API Key>",
    base_url="https://api.novita.ai/openai"
)

response = client.chat.completions.create(
    model="zai-org/glm-5.1",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Refactor this module to use async/await throughout."}
    ],
    max_tokens=131072,
    temperature=0.7
)

print(response.choices[0].message.content)

TypeScript (OpenAI SDK):

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "<Your Novita API Key>",
  baseURL: "https://api.novita.ai/openai",
});

const response = await client.chat.completions.create({
  model: "zai-org/glm-5.1",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Build a CLI tool for parsing JSON logs." }
  ],
  max_tokens: 131072,
});

console.log(response.choices[0].message.content);

Use cases for developers

GLM-5.1 is most useful where the task can’t be solved in a single pass and benefits from iterative refinement:

  • Autonomous coding agents — Assign a repo-level task and let the model plan, implement, test, and iterate without check-ins
  • CI/CD pipeline automation — Function calling makes it straightforward to wire GLM-5.1 into build/test/debug loops
  • Long-form technical document generation — 204K context and 131K output handle large, coherent documents in a single call
  • GPU kernel and ML performance optimization — 3.6× speedup on KernelBench translates directly to ML infrastructure work
  • Web application scaffolding — GLM-5.1 built a full desktop UI from one natural-language prompt; the same loop applies to any complex frontend or backend task
  • Security engineering — 68.7 on CyberGym puts it among the strongest available models for autonomous security tasks

Bottom line

Open-source models have closed the gap on reasoning benchmarks. The remaining gap is in long-horizon execution — staying coherent and productive across hundreds of tool calls and hours of autonomous work. GLM-5.1 is the clearest evidence so far that this gap is closeable.

If you’re running serious agentic workloads and want to avoid proprietary lock-in, it’s the most capable open-source option right now for coding and agent tasks. On Novita AI, you get it with OpenAI and Anthropic SDK compatibility, pay-per-token pricing, and no infrastructure overhead.

Try GLM-5.1 on Playground  |  View API Docs

Novita AI is an AI & agent cloud platform helping developers and startups build, deploy, and scale models and agentic applications with high performance, reliability, and cost efficiency.

Frequently Asked Questions

What changed between GLM-5 and GLM-5.1?u003c/strongu003e

The biggest change is in long-horizon execution. GLM-5 plateaus after a few dozen iterations; GLM-5.1 keeps finding new strategies through hundreds of rounds. The staircase pattern — structural shifts triggered by self-analysis — is what makes the difference. Coding benchmark scores also improved across the board.

Is GLM-5.1 open-source?u003c/strongu003e

Yes, MIT license. Weights are on Hugging Face. You can use it commercially, fine-tune it, and self-host.

How does GLM-5.1 compare to Claude Opus 4.6?

On SWE-Bench Pro, GLM-5.1 scores 58.4 vs Claude Opus 4.6’s 57.3. On KernelBench long-horizon GPU optimization, Claude Opus 4.6 leads at 4.2× vs GLM-5.1’s 3.6×. For most agentic coding tasks, the two are closely matched — GLM-5.1 has an open-weight and cost advantage.


Discover more from Novita

Subscribe to get the latest posts sent to your email.

Leave a Comment

Scroll to Top

Discover more from Novita

Subscribe now to keep reading and get access to the full archive.

Continue reading