GLM 5.2 API Quick Start on Novita AI

GLM 5.2 API Quick Start on Novita AI

This quick start shows how to call GLM 5.2 on Novita AI through the OpenAI-compatible chat completions API. Use the verified model ID zai-org/glm-5.2, the Novita AI base URL, and a small first request before testing the model’s 1,048,576-token context window, 131,072-token max output, function calling, structured outputs, reasoning support, or Anthropic-compatible access shown in the current model listing.

GLM 5.2 API quick start prerequisites

GLM 5.2 is Z.AI’s flagship model for long-horizon autonomous work. The Novita AI model page describes it as a model built for sustained tasks such as planning, execution, iterative optimization, coding, and delivery of production-grade results. For developers, the practical point is simple: GLM 5.2 is not just another short-chat model. It is positioned for workflows where the model needs enough context to keep a large task, codebase, document set, or agent state in view.

On Novita AI, GLM 5.2 is exposed through serverless model APIs. That matters if you want to evaluate the model without standing up GPU infrastructure, routing traffic through a custom inference stack, or managing long-context serving yourself. You use Novita AI’s API key, the OpenAI-compatible endpoint, and the exact model ID:

zai-org/glm-5.2

The current Novita AI LLM API guide documents the platform’s OpenAI-compatible approach for chat and completion tasks. The chat completions API reference documents the REST path used by the examples below:

https://api.novita.ai/openai/v1/chat/completions

Use the model page for model-specific details such as context length, max output, pricing, modalities, and supported endpoint families. Use the API reference for request parameters, authentication, streaming, and chat message structure.

GLM 5.2 API specs and pricing

The current Novita AI listing for GLM 5.2 shows a serverless text-in, text-out model with long-context and agent-oriented feature support.

FieldCurrent Novita AI value
Display nameGLM 5.2
API model IDzai-org/glm-5.2
Access pathServerless
Context window1,048,576 tokens
Max output131,072 tokens
Input modalitiesText
Output modalitiesText
Endpoint familieschat/completions, Anthropic-compatible endpoint
Function callingSupported
Structured outputsSupported
ReasoningSupported
Input price$1.40 per million tokens
Cached-read input price$0.26 per million tokens
Output price$4.40 per million tokens

Pricing is listed per million tokens. For a quick estimate, multiply prompt tokens by the input rate and generated tokens by the output rate. Cached-read pricing can reduce cost when your application repeatedly sends the same reusable context, such as a system prompt, tool schema, policy block, or stable repository summary.

For example, a request with 100,000 uncached input tokens and 5,000 output tokens would be estimated as:

ComponentCalculationEstimated cost
Input0.1 million tokens x $1.40$0.14
Output0.005 million tokens x $4.40$0.022
TotalInput + output$0.162

This is only a simple token-rate estimate. Production cost also depends on prompt reuse, retries, truncation, streaming behavior, response length, and whether your application repeatedly includes large context blocks that could be cached or summarized.

How to make your first GLM 5.2 API request

Start with a small prompt before testing the full 1M-token context window. That gives you a clean baseline for authentication, model routing, response shape, and latency.

Install the OpenAI Python SDK and store your Novita AI key in an environment variable:

pip install openai
export NOVITA_API_KEY="YOUR_NOVITA_API_KEY"

Then call GLM 5.2 with the Novita AI base URL:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["NOVITA_API_KEY"],
    base_url="https://api.novita.ai/openai",
)

response = client.chat.completions.create(
    model="zai-org/glm-5.2",
    messages=[
        {
            "role": "system",
            "content": "You are a practical software architecture assistant.",
        },
        {
            "role": "user",
            "content": "Review this migration plan and list the highest-risk steps.",
        },
    ],
    max_tokens=1200,
    temperature=0.3,
)

print(response.choices[0].message.content)

If you prefer a direct REST call, use the chat completions path:

curl --request POST \
  --url https://api.novita.ai/openai/v1/chat/completions \
  --header "Authorization: Bearer $NOVITA_API_KEY" \
  --header "Content-Type: application/json" \
  --data '{
    "model": "zai-org/glm-5.2",
    "messages": [
      {
        "role": "system",
        "content": "You are a concise engineering reviewer."
      },
      {
        "role": "user",
        "content": "Create a release-risk checklist for a payments API change."
      }
    ],
    "max_tokens": 1200,
    "temperature": 0.3
  }'

For longer responses, enable streaming so your application can start receiving tokens before the full completion is finished:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["NOVITA_API_KEY"],
    base_url="https://api.novita.ai/openai",
)

stream = client.chat.completions.create(
    model="zai-org/glm-5.2",
    messages=[
        {
            "role": "user",
            "content": "Draft a phased plan for refactoring a monolith into services.",
        }
    ],
    max_tokens=2000,
    temperature=0.3,
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="")

Keep API keys out of source control, set explicit max_tokens values, and log usage data when available. Long-context models make it easy to send very large prompts, so cost control starts with measuring prompt and completion tokens from the first prototype.

When to use GLM 5.2

GLM 5.2 is a strong fit when your task is too large for a normal chat context or when the model needs to coordinate multiple steps with tools, files, or structured outputs.

Good evaluation targets include:

  • Repository analysis: ask the model to review architecture notes, file maps, dependency descriptions, and selected code excerpts in one request.
  • Coding agents: keep task goals, constraints, tool schemas, previous decisions, and working notes in context while the agent iterates.
  • Long-document synthesis: summarize policies, technical specifications, contracts, research notes, or product documents without aggressive chunking.
  • Migration planning: give the model a system map, constraints, rollout plan, and risk register, then ask for gaps or sequencing issues.
  • Structured extraction: combine long source documents with a strict JSON schema for downstream systems.

GLM 5.2 is not automatically the right model for every request. For short classification, basic chat, simple extraction, or high-volume low-latency traffic, compare smaller models in the Novita AI model library and current rates on the Novita AI pricing page. A 1M-token model is most valuable when you actually need the context, output ceiling, or agent-oriented features.

Function calling and structured outputs

The GLM 5.2 listing shows function calling and structured outputs support. These features are useful when the model should return something your application can act on, not just prose.

Function calling is a good fit when your application exposes controlled tools such as:

  • retrieving a customer record,
  • opening a ticket,
  • checking deployment status,
  • searching an internal knowledge base,
  • calculating a quote,
  • or routing a request to a specialized service.

Here is a minimal tool-calling pattern:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["NOVITA_API_KEY"],
    base_url="https://api.novita.ai/openai",
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "create_release_ticket",
            "description": "Create a release ticket after risk review.",
            "parameters": {
                "type": "object",
                "properties": {
                    "title": {"type": "string"},
                    "risk_level": {
                        "type": "string",
                        "enum": ["low", "medium", "high"],
                    },
                    "summary": {"type": "string"},
                },
                "required": ["title", "risk_level", "summary"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="zai-org/glm-5.2",
    messages=[
        {
            "role": "user",
            "content": "Assess this release and create a ticket if risk is medium or high.",
        }
    ],
    tools=tools,
    tool_choice="auto",
    max_tokens=1000,
)

print(response.choices[0].message)

Structured outputs are useful when you want the response to fit a predictable schema. Even when you ask for JSON, keep validation in your application. Treat the model’s output as a generated candidate, parse it, validate required fields, and handle errors with a repair prompt or a fallback path.

For more background on tool design, see Novita AI’s guide to function calling and structured outputs and the GLM-focused guide to GLM function calling.

Production notes for long-context usage

The headline context window is the ceiling, not the default operating mode. A 1,048,576-token request can be useful, but most applications should earn their way up to that size.

Start with these controls:

  • Budget the prompt: split stable instructions, volatile user input, retrieval results, and tool schemas so you can see which part is driving token count.
  • Use retrieval before full stuffing: send the most relevant files or passages first, then expand context only when the task needs more evidence.
  • Cap output length: GLM 5.2 supports a high max output, but most workflows do not need 131,072 generated tokens. Set max_tokens to the smallest useful value.
  • Stream long responses: streaming improves user experience and lets your service handle long completions more gracefully.
  • Validate structured results: schemas reduce ambiguity, but your application still needs parser checks, retries, and clear error handling.
  • Track cache opportunities: repeated context blocks can be expensive if sent as fresh input every time. Identify reusable prompts, policies, and tool definitions early.
  • Keep a smaller-model fallback: many routing systems use a smaller model for easy cases and reserve long-context models for tasks that need their full capacity.

For coding agents, one practical pattern is to keep durable project context outside the prompt, retrieve only the files relevant to the current task, and ask GLM 5.2 to produce a bounded plan or patch review rather than an open-ended essay. This keeps costs legible while still giving the model enough context to reason across the parts of the system that matter.

Frequently asked questions

Is GLM 5.2 available on Novita AI?

Yes. GLM 5.2 is listed on Novita AI as a serverless model with the API model ID zai-org/glm-5.2.

What is the context window for GLM 5.2 on Novita AI?

The current Novita AI listing shows a 1,048,576-token context window for GLM 5.2.

What is the max output for GLM 5.2?

The current Novita AI listing shows a 131,072-token max output for GLM 5.2. Set a smaller max_tokens value unless your workflow truly needs a very long response.

How much does GLM 5.2 cost on Novita AI?

The current pricing page lists GLM 5.2 at $1.40 per million input tokens, $0.26 per million cached-read input tokens, and $4.40 per million output tokens.

Does GLM 5.2 support function calling?

Yes. The current GLM 5.2 listing shows function calling support. Use it when the model should choose from controlled application tools instead of returning only natural-language text.

Does GLM 5.2 support structured outputs?

Yes. The current GLM 5.2 listing shows structured outputs support. Validate generated JSON or schema-shaped responses in your application before using them downstream.