GLM-5.1 API on Novita AI: Long-Horizon Agentic Model

Table Of Contents

What’s actually new in GLM-5.1
GLM-5.1 benchmark results: coding and agentic tasks
What long-horizon agentic execution looks like in practice
What GLM-5.1 is built for
GLM-5.1 API pricing on Novita AI
Getting started: OpenAI and Anthropic SDK compatible
Use cases for developers
Bottom line

Most coding models hit a wall after a few dozen tool calls. They try the obvious approaches, run out of ideas, and plateau. More time doesn’t help — the model has already exhausted what it knows how to try.

GLM-5.1, Z.ai’s latest flagship, is built around a different assumption: that useful optimization should compound over time, not taper off. In Z.ai’s own benchmarks, it ran 655 iterations on a vector search problem and reached 21.5k QPS — roughly 6x what the best models achieve in a standard session. It ran for 8 hours building a Linux desktop from scratch, deciding for itself what to add next.

GLM-5.1 is now available on Novita AI, via OpenAI- and Anthropic-compatible APIs, pay per token.

Try GLM-5.1 Now

What’s actually new in GLM-5.1

GLM-5.1 is a 754B-parameter Mixture-of-Experts model, 40B active per inference pass, 204,800-token context window.

The real change is in how it behaves on long-horizon tasks. Z.ai calls it a staircase pattern: the model refines within a fixed strategy until it hits a ceiling, then shifts to a structurally different approach and climbs again. Six such shifts happened in a single VectorDBBench run. Each one was initiated by the model after it analyzed its own benchmark logs and identified what was blocking further progress.

That’s different from having a longer context window. It’s the model actively managing its own strategy.

GLM-5.1 benchmark results: coding and agentic tasks

GLM-5.1 leads on coding and agentic benchmarks. On reasoning, Gemini 3.1 Pro and GPT-5.4 are ahead.

Reasoning

Benchmark	GLM-5.1	GLM-5	Qwen3.6-Plus	MiniMax M2.7	DeepSeek-V3.2	Kimi K2.5	Claude Opus 4.6	Gemini 3.1 Pro	GPT-5.4
HLE	31.0	30.5	28.8	28.0	25.1	31.5	36.7	45.0	39.8
HLE (w/ Tools)	52.3	50.4	50.6	—	40.8	51.8	53.1	51.4	52.1
AIME 2026	95.3	95.4	95.1	89.8	95.1	94.5	95.6	98.2	98.7
HMMT Nov. 2025	94.0	96.9	94.6	81.0	90.2	91.1	96.3	94.8	95.8
HMMT Feb. 2026	82.6	82.8	87.8	72.7	79.9	81.3	84.3	87.3	91.8
IMOAnswerBench	83.8	82.5	83.8	66.3	78.3	81.8	75.3	81.0	91.4
GPQA-Diamond	86.2	86.0	90.4	87.0	82.4	87.6	91.3	94.3	92.0

Coding

Benchmark	GLM-5.1	GLM-5	Qwen3.6-Plus	MiniMax M2.7	DeepSeek-V3.2	Kimi K2.5	Claude Opus 4.6	Gemini 3.1 Pro	GPT-5.4
SWE-Bench Pro	58.4	55.1	56.6	56.2	—	53.8	57.3	54.2	57.7
NL2Repo	42.7	35.9	37.9	39.8	—	32.0	49.8	33.4	41.3
Terminal-Bench 2.0 (Terminus-2)	63.5	56.2	61.6	—	39.3	50.8	65.4	68.5	—
Terminal-Bench 2.0 (best harness)	69.0 (Claude Code)	56.2 (Claude Code)	—	57.0 (Claude Code)	46.4 (Claude Code)	—	—	—	75.1 (Codex)
CyberGym	68.7	48.3	—	—	17.3	41.3	66.6	—	—

Agentic

Benchmark	GLM-5.1	GLM-5	Qwen3.6-Plus	MiniMax M2.7	DeepSeek-V3.2	Kimi K2.5	Claude Opus 4.6	Gemini 3.1 Pro	GPT-5.4
BrowseComp	68.0	62.0	—	—	51.4	60.6	—	—	—
BrowseComp (w/ Context Manage)	79.3	75.9	—	—	67.6	74.9	84.0	85.9	82.7
τ³-Bench	70.6	69.2	70.7	67.6	69.2	66.0	72.4	67.1	72.9
MCP-Atlas (Public Set)	71.8	69.2	74.1	48.8	62.2	63.8	73.8	69.2	67.2
Tool-Decathlon	40.7	38.0	39.8	46.3	35.2	27.8	47.2	48.8	54.6
Vending Bench 2	$5,634	$4,432	$5,115	—	$1,034	$1,198	$8,018	$911	$6,144

SWE-Bench Pro (58.4) is the headline — the highest score across all nine models in this comparison, open-source and proprietary alike. CyberGym is the sharpest jump generation-over-generation: 48.3 on GLM-5 to 68.7. Worth noting on Terminal-Bench 2.0: the “best harness” row reflects each team’s self-reported result using their preferred execution environment. GLM-5.1 hits 69.0 with Claude Code; GPT-5.4 hits 75.1 with Codex.

What long-horizon agentic execution looks like in practice

Single-pass benchmark numbers don’t capture what happens when you let a model run for hours. Z.ai ran three scenarios with progressively less structured feedback to show what GLM-5.1 does differently.

Scenario 1: vector database optimization, 600+ iterations

VectorDBBench gives the model a Rust skeleton with HTTP endpoints and empty implementation stubs. Using tool-call-based agents, it reads and writes files, compiles, tests, and profiles — normally within a 50-turn budget. The best result under that constraint: 3,547 QPS, by Claude Opus 4.6.

Z.ai removed the cap. In each iteration, GLM-5.1 could use as many tool calls as needed, then submit a new version to benchmark. It ran 655 iterations with 6,000+ tool calls and reached 21.5k QPS — roughly 6x the single-session best.

Two transitions illustrate how it got there. Around iteration 90, it shifted from full-corpus scanning to IVF cluster probing with f16 vector compression, jumping to 6.4k QPS. Around iteration 240, it introduced a two-stage pipeline—u8 prescoring followed by f16 reranking—reaching 13.4k QPS. Six such structural transitions occurred over the full run, each initiated by the model after analyzing its own benchmark logs and identifying the current bottleneck.

Scenario 2: GPU kernel optimization, 1,000+ turns

KernelBench asks the model to take a reference PyTorch implementation and produce a faster GPU kernel with identical outputs. Level 3 covers 50 full-model problems: MobileNet, VGG, MiniGPT, Mamba. Baseline: torch.compile at 1.15x, max-autotune at 1.49x.

Z.ai ran four models on Level 3, tracking geometric mean speedup across tool-use turns:

GLM-5 improves quickly early and levels off
Claude Opus 4.5 continues longer, then also tapers
GLM-5.1 finishes at 3.6x and keeps making progress well into the run
Claude Opus 4.6 is the strongest at 4.2x, still showing headroom at the end

GLM-5.1 doesn’t match Claude Opus 4.6 here. But it clearly extends the useful run duration beyond GLM-5, which is the point.

Scenario 3: building a Linux desktop, 8 hours autonomous

The first two scenarios have a number to optimize. This one doesn’t. The prompt: build a Linux-style desktop environment as a web application. No starter code, no design mockups, no intermediate feedback.

Most models produce a basic skeleton — static taskbar, a placeholder window — then declare it done.

GLM-5.1 ran inside a simple harness: after each execution round, the model reviews its own output, identifies what’s missing or broken, and continues. Over 8 hours, it built a file browser, terminal, text editor, system monitor, calculator, and functional games, each integrated into a coherent UI. Styling got more polished with each pass. Edge cases got handled. The model decided the whole roadmap itself.

What GLM-5.1 is built for

GLM-5.1 makes the most sense for tasks where additional runtime actually produces better output:

Long-running coding agents — multi-file refactors, migrations, full system builds
Agentic coding tools — works with Claude Code, OpenClaw, Trae, Cursor, Codex, and Cline
Terminal automation — 63.5 on Terminal-Bench 2.0 (Terminus-2), up from 56.2 on GLM-5
Cybersecurity — 68.7 on CyberGym, the highest in this benchmark set
Web research — 68.0 on BrowseComp, also the highest here

GLM-5.1 API pricing on Novita AI

|Price| |---|---| |Input|$1.40 / M tokens| |Cache Read|$0.26 / M tokens| |Output|$4.40 / M tokens|

Pay per token, no monthly commitment. Full pricing at novita.ai/pricing.

Getting started: OpenAI and Anthropic SDK compatible

Novita AI’s API works with both the OpenAI and Anthropic SDKs. Drop in the model ID and your existing setup runs as-is. GLM-5.1 can be called directly from Claude Code, OpenClaw, Trae, Cursor, Codex, and any platform that accepts an OpenAI- or Anthropic-compatible endpoint.

Try GLM-5.1 on Playground | View API Docs

Python (OpenAI SDK):

from openai import OpenAI

client = OpenAI(
    api_key="<Your Novita API Key>",
    base_url="https://api.novita.ai/openai"
)

response = client.chat.completions.create(
    model="zai-org/glm-5.1",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Refactor this module to use async/await throughout."}
    ],
    max_tokens=131072,
    temperature=0.7
)

print(response.choices[0].message.content)

TypeScript (OpenAI SDK):

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "<Your Novita API Key>",
  baseURL: "https://api.novita.ai/openai",
});

const response = await client.chat.completions.create({
  model: "zai-org/glm-5.1",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Build a CLI tool for parsing JSON logs." }
  ],
  max_tokens: 131072,
});

console.log(response.choices[0].message.content);

Use cases for developers

GLM-5.1 is most useful where the task can’t be solved in a single pass and benefits from iterative refinement:

Autonomous coding agents — Assign a repo-level task and let the model plan, implement, test, and iterate without check-ins
CI/CD pipeline automation — Function calling makes it straightforward to wire GLM-5.1 into build/test/debug loops
Long-form technical document generation — 204K context and 131K output handle large, coherent documents in a single call
GPU kernel and ML performance optimization — 3.6× speedup on KernelBench translates directly to ML infrastructure work
Web application scaffolding — GLM-5.1 built a full desktop UI from one natural-language prompt; the same loop applies to any complex frontend or backend task
Security engineering — 68.7 on CyberGym puts it among the strongest available models for autonomous security tasks

Bottom line

Open-source models have closed the gap on reasoning benchmarks. The remaining gap is in long-horizon execution — staying coherent and productive across hundreds of tool calls and hours of autonomous work. GLM-5.1 is the clearest evidence so far that this gap is closeable.

If you’re running serious agentic workloads and want to avoid proprietary lock-in, it’s the most capable open-source option right now for coding and agent tasks. On Novita AI, you get it with OpenAI and Anthropic SDK compatibility, pay-per-token pricing, and no infrastructure overhead.

Try GLM-5.1 on Playground | View API Docs

Novita AI is an AI & agent cloud platform helping developers and startups build, deploy, and scale models and agentic applications with high performance, reliability, and cost efficiency.

Frequently Asked Questions

What changed between GLM-5 and GLM-5.1?u003c/strongu003e

The biggest change is in long-horizon execution. GLM-5 plateaus after a few dozen iterations; GLM-5.1 keeps finding new strategies through hundreds of rounds. The staircase pattern — structural shifts triggered by self-analysis — is what makes the difference. Coding benchmark scores also improved across the board.

Is GLM-5.1 open-source?u003c/strongu003e

Yes, MIT license. Weights are on Hugging Face. You can use it commercially, fine-tune it, and self-host.

How does GLM-5.1 compare to Claude Opus 4.6?

On SWE-Bench Pro, GLM-5.1 scores 58.4 vs Claude Opus 4.6’s 57.3. On KernelBench long-horizon GPU optimization, Claude Opus 4.6 leads at 4.2× vs GLM-5.1’s 3.6×. For most agentic coding tasks, the two are closely matched — GLM-5.1 has an open-weight and cost advantage.

GLM-5.1 API on Novita AI: Long-Horizon Agentic Model

What’s actually new in GLM-5.1

GLM-5.1 benchmark results: coding and agentic tasks

What long-horizon agentic execution looks like in practice

Scenario 1: vector database optimization, 600+ iterations

Scenario 2: GPU kernel optimization, 1,000+ turns

Scenario 3: building a Linux desktop, 8 hours autonomous

What GLM-5.1 is built for

GLM-5.1 API pricing on Novita AI

Getting started: OpenAI and Anthropic SDK compatible

Use cases for developers

Bottom line

Frequently Asked Questions

Product

RESOURCES

Partners

Company

What’s actually new in GLM-5.1

GLM-5.1 benchmark results: coding and agentic tasks

What long-horizon agentic execution looks like in practice

Scenario 1: vector database optimization, 600+ iterations

Scenario 2: GPU kernel optimization, 1,000+ turns

Scenario 3: building a Linux desktop, 8 hours autonomous

What GLM-5.1 is built for

GLM-5.1 API pricing on Novita AI

Getting started: OpenAI and Anthropic SDK compatible

Use cases for developers

Bottom line

Frequently Asked Questions

Related Posts

Product

RESOURCES

Partners

Company