GLM-4.6V on Novita AI: Vision AI with Native Tool Calling

GLM-4.6V is now available on the Novita AI platform, bringing Zhipu AI’s advanced vision-language model with breakthrough multimodal capabilities. Featuring 106B parameters in its foundation version and a 128K token context window, GLM-4.6V achieves state-of-the-art performance in visual understanding among models of similar parameter scales.

This latest release integrates native Function Calling capabilities for the first time, effectively bridging the gap between visual perception and executable action. Whether you’re building multimodal agents, processing complex documents, or developing visual editing applications, GLM-4.6V delivers the capabilities you need through Novita AI’s developer-friendly infrastructure.

Try GLM-4.6V Demo

Table Of Contents

What is GLM-4.6V?
Key Features and Capabilities
Performance and Architecture
Getting Started with GLM-4.6V on Novita AI
Conclusion

What is GLM-4.6V?

GLM-4.6V is Zhipu AI’s advanced vision-language model that delivers comprehensive multimodal understanding and generation capabilities. Part of the GLM-V model family, it represents a significant advancement in bridging visual perception with actionable intelligence through native function calling integration.

Dual Model Architecture: GLM-4.6V comes in two versions: the 106B-parameter foundation model designed for cloud and high-performance cluster scenarios, and GLM-4.6V-Flash with 9B parameters optimized for local deployment and low-latency applications. Both models provide powerful multimodal capabilities scaled to different deployment needs.

Extended Context Window: GLM-4.6V features a 128K token context window, allowing it to process multi-document or long-document input while directly interpreting richly formatted pages as images. This expanded context enables handling complex, image-heavy documents without requiring prior conversion to plain text.

Native Function Calling: For the first time in the GLM-V series, GLM-4.6V integrates native Function Calling capabilities. This breakthrough effectively bridges visual perception and executable action, providing a unified technical foundation for multimodal agents in real-world business scenarios.

State-of-the-Art Performance: GLM-4.6V achieves SoTA performance in visual understanding among models of similar parameter scales across major multimodal benchmarks, demonstrating exceptional capabilities in processing and understanding visual information.

Key Features and Capabilities

GLM-4.6V introduces several specialized capabilities that make it particularly effective for multimodal applications.

Multimodal Document Understanding

GLM-4.6V processes up to 128K tokens of multi-document or long-document input, directly interpreting richly formatted pages as images. The model understands text, layout, charts, tables, and figures jointly, enabling accurate comprehension of complex, image-heavy documents. This capability eliminates the need for preprocessing or text extraction, allowing direct analysis of PDFs, reports, presentations, and other visual documents.

Frontend Replication & Visual Editing

The model reconstructs pixel-accurate HTML/CSS from UI screenshots and supports natural-language-driven edits. GLM-4.6V detects layout, components, and styles visually, generates clean code, and applies iterative visual modifications through simple user instructions. This makes it valuable for rapid prototyping, design-to-code workflows, and automated UI generation.

Interleaved Image-Text Content Generation

GLM-4.6V supports high-quality mixed media creation from complex multimodal inputs. The model takes multimodal context spanning documents, user inputs, and tool-retrieved images, then synthesizes coherent, interleaved image-text content tailored to the task. During generation, it can actively call search and retrieval tools to gather and curate additional text and visuals, producing rich, visually grounded content.

Native Tool Integration

The integrated Function Calling capabilities enable GLM-4.6V to autonomously invoke external tools during processing. This allows the model to fetch real-time information, access databases, retrieve images, or trigger actions based on visual analysis. The native integration makes it particularly effective for building sophisticated multimodal agent systems.

Performance and Architecture

GLM-4.6V demonstrates strong performance across comprehensive multimodal evaluations.

Model Architecture

GLM-4.6V employs a sophisticated architecture optimized for multimodal understanding, building on the technical foundations of the GLM-V series:

Foundation Model (GLM-4.6V): 106B total parameters designed for cloud deployment and maximum capability
Lightweight Model (GLM-4.6V-Flash): 9B parameters optimized for edge deployment and reduced latency
Context Length: 128K tokens for processing extensive multimodal inputs
Vision Encoder: Spatial patch size of 14 with temporal patch size of 2 for efficient visual processing

Getting Started with GLM-4.6V on Novita AI

Novita AI offers multiple ways to access GLM-4.6V, designed for different skill levels and use cases.

Use the Playground (No Coding Required)

Sign up and start experimenting with GLM-4.6V in seconds through an interactive interface. Upload images or documents, test multimodal prompts, and see outputs in real-time with the full 128K context window. Perfect for prototyping and understanding what the model can do before building full implementations.

Integrate via API (For Developers)

Connect GLM-4.6V to your applications using Novita AI’s unified REST API.

Direct API Integration (Python Example)

from openai import OpenAI

client = OpenAI(
    api_key="<Your API Key>",
    base_url="https://api.novita.ai/openai"
)

response = client.chat.completions.create(
    model="zai-org/glm-4.6v",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello, how are you?"}
    ],
    max_tokens=32768,
    temperature=0.7
)

print(response.choices[0].message.content)

Multi-Agent Workflows with OpenAI Agents SDK

Build sophisticated multimodal agent systems with plug-and-play integration, support for handoffs, routing, and tool integration with native function calling and the full 128K context window.

Connect with Third-Party Platforms

Agent Frameworks: Easily connect Novita AI with partner platforms like Continue, AnythingLLM, LangChain, Dify, and Langflow through official connectors and step-by-step integration guides.

Hugging Face: Novita AI is an official inference provider for Hugging Face, ensuring broad ecosystem compatibility.

OpenAI-Compatible API: Enjoy hassle-free migration and integration with tools such as Cline, Cursor, Trae and Qwen Code, designed for the OpenAI API standard.

Anthropic-Compatible API: Seamlessly integrate with Claude Code for agentic coding workflows and other Anthropic API-compatible tools.

Conclusion

GLM-4.6V on Novita AI delivers Zhipu AI’s advanced vision-language model with 106B parameters and 128K context window, achieving state-of-the-art performance in multimodal understanding. With native Function Calling integration and specialized capabilities for document analysis, UI replication, and mixed-media generation, GLM-4.6V provides a unified foundation for building sophisticated multimodal AI applications.

Start exploring GLM-4.6V today through Novita AI’s playground, API, or third-party integrations to enhance your applications with advanced visual understanding, document processing, and multimodal reasoning capabilities. Build the next generation of AI-powered solutions with GLM-4.6V’s breakthrough vision-language intelligence.

Novita AI is a leading AI cloud platform that provides developers with easy-to-use APIs and affordable, reliable GPU infrastructure for building and scaling AI applications.

Discover more from Novita

Subscribe to get the latest posts sent to your email.

GLM-4.6V on Novita AI: Vision AI with Native Tool Calling

What is GLM-4.6V?