Are you considering deploying GLM-4.5 locally but concerned about the substantial GPU resources required? The full GLM-4.5 model demands configurations such as 16 NVIDIA H100 GPUs or 8 H200 GPUs in FP8 precision, while the more resource-efficient GLM-4.5-Air variant operates on 2 H100 GPUs or 1 H200 GPU in FP8 precision. These setups ensure optimal performance and support the model’s extensive context length of up to 128K tokens.
In this article, we will explore the VRAM requirements for GLM-4.5, discuss the feasibility of local deployment, and examine alternative methods to effectively utilize this powerful language model.
GLM 4.5 VRAM Requirements
GLM-4.5 is the latest advancement in the GLM family, featuring a sophisticated Mixture-of-Experts (MoE) architecture and optimization for agentic applications. The model comes in two variants: the flagship GLM-4.5 with 355 billion total parameters (32 billion active), and the efficient GLM-4.5-Air with 106 billion total parameters (12 billion active).
Key architectural innovations include a deeper model structure with reduced width and increased depth for enhanced reasoning, pre-training on a massive 15 trillion token corpus for comprehensive knowledge, and the open-source “slime” RL infrastructure designed for scalable, large-scale agentic reinforcement learning.

How much VRAM does GLM 4.5 need for inference?
The models can run under the configurations in the table below:
| Model | Precision | GPU Type and Count | Test Framework |
|---|---|---|---|
| GLM-4.5 | BF16 | H100 x 16 / H200 x 8 | sglang |
| GLM-4.5 | FP8 | H100 x 8 / H200 x 4 | sglang |
| GLM-4.5-Air | BF16 | H100 x 4 / H200 x 2 | sglang |
| GLM-4.5-Air | FP8 | H100 x 2 / H200 x 1 | sglang |
Under the configurations in the table below, the models can utilize their full 128K context length:
| Model | Precision | GPU Type and Count | Test Framework |
|---|---|---|---|
| GLM-4.5 | BF16 | H100 x 32 / H200 x 16 | sglang |
| GLM-4.5 | FP8 | H100 x 16 / H200 x 8 | sglang |
| GLM-4.5-Air | BF16 | H100 x 8 / H200 x 4 | sglang |
| GLM-4.5-Air | FP8 | H100 x 4 / H200 x 2 | sglang |
How much VRAM does GLM 4.5 need for fine-tuning?
The code can run under the configurations in the table below using Llama Factory:
| Model | GPU Type and Count | Strategy | Batch Size (per GPU) |
|---|---|---|---|
| GLM-4.5 | H100 x 16 | Lora | 1 |
| GLM-4.5-Air | H100 x 4 | Lora | 1 |
The code can run under the configurations in the table below using Swift:
| Model | GPU Type and Count | Strategy | Batch Size (per GPU) |
|---|---|---|---|
| GLM-4.5 | H20 (96GiB) x 16 | Lora | 1 |
| GLM-4.5-Air | H20 (96GiB) x 4 | Lora | 1 |
| GLM-4.5 | H20 (96GiB) x 128 | SFT | 1 |
| GLM-4.5-Air | H20 (96GiB) x 32 | SFT | 1 |
| GLM-4.5 | H20 (96GiB) x 128 | RL | 1 |
| GLM-4.5-Air | H20 (96GiB) x 32 | RL | 1 |
GLM 4.5 VRAM usage with different batch sizes
| Model | Precision | Batch Size (per GPU) | VRAM |
|---|---|---|---|
| GLM-4.5 | FP16 | 1 | 945.36GB |
| GLM-4.5 | FP16 | 8 | 1128.49GB |
| GLM-4.5 | FP16 | 16 | 1137.79GB |
| GLM-4.5 | FP16 | 32 | 1756.38GB |
| GLM-4.5-Air | FP16 | 1 | 288.68GB |
| GLM-4.5-Air | FP16 | 8 | 343.58GB |
| GLM-4.5-Air | FP16 | 16 | 406.33GB |
| GLM-4.5-Air | FP16 | 32 | 531.83GB |
What are the Hardware Requirements for GLM 4.5?
- GPUs:
- Inference: 8 × H100/4 × H200 (FP8) or 16 × H100/8 × H200 (BF16) for full model; half for Air variant.
- Fine-Tuning: GPUs with ≥ 80 GB VRAM.
- CPU & System:
- ≥ 1 TB RAM to load models and manage offload buffers.
- High-bandwidth interconnect (NVLink/HPC switch) for multi-GPU tensor parallelism.
- Precision:
- FP8 for minimal VRAM usage (requires GPUs with native FP8 support).
- BF16 as alternative on GPUs without FP8.
- Software:
- vLLM or Llama Factory for inference; support for speculative decoding and CPU offload.
Optimizing GLM 4.5 for lower VRAM consumption
- Model Variants: Choose GLM 4.5-Air (106 B total/12 B active) for 32–64 GB GPU setups.
- When to Choose GLM-4.5-Air:
- Significantly Faster Generation:
- GLM-4.5-Air achieves an output rate of around 160 tokens per second, nearly twice as fast as the Full-size model (approximately 88 tokens/s). This makes Air ideal for latency-sensitive applications.
- Extremely Low First-Token Latency (TTFT):
- Air outputs its first token in about 0.58 seconds, compared to 0.68 seconds for Full-size. In some tests, Full-size latency can reach 22–23 seconds when including “thinking” time.
- Shorter End-to-End Response Time:
- Air delivers end-to-end responses (input processing, inference, and output) in about 16 seconds, while Full-size takes nearly 29 seconds, making Full-size less suitable for real-time interactions.
- Slightly Lower Scores on Complex Reasoning Tasks:
- On reasoning benchmarks such as MMLU-Pro, GPQA, and AIME, Air scores about 2–3% lower than Full-size, but still maintains industry-leading performance.
- Recommended for Most Use Cases:
- For the majority of text generation, summarization, basic reasoning, and code-assist tasks, the Full-size model is not necessary—Air is sufficient for high performance and responsiveness.
- Significantly Faster Generation:

- Layer Offloading: Offload select MoE experts or feed-forward layers to CPU memory.
- KV-Cache Quantization: Reduce cache precision to save VRAM at minor quality cost.
- Batch Size = 1: Limit to single-sample inference per GPU to minimize activations.
Another Cost-Effectively Option: API
Here’s a simplified comparison between deploying GLM 4.5 via an API and running it locally:
| Aspect | API Deployment | Local Deployment |
|---|---|---|
| Cost | Pay-per-use pricing; for example, input tokens at $0.6 per million and output tokens at ¥2.2 per million on Novita AI | High initial investment in hardware (e.g., NVIDIA A100 GPUs); potentially lower costs over time for heavy usage. |
| Performance | Scalable with potential network latency; suitable for applications where slight delays are acceptable. | Lower latency and consistent performance; ideal for real-time applications requiring immediate responses. |
| Scalability | Easily scalable without managing infrastructure; provider handles scaling. | Scaling requires additional hardware and infrastructure management. |
| Data Privacy | Data is processed externally, which may raise privacy concerns, especially in regulated industries. | Data remains in-house, offering greater control and compliance with data protection regulations. |
| Operational Complexity | Minimal setup and maintenance; provider manages updates and infrastructure. | Requires technical expertise for setup, maintenance, and security; offers greater customization. |
| Customization | Limited to provider’s configurations; less flexibility for specific needs. | Full control over model customization, fine-tuning, and integration with existing systems. |
| Use Case Suitability | Ideal for applications with variable or low usage, rapid development needs, or limited technical resources. | Best for applications with high, consistent usage, stringent data privacy requirements, or need for extensive customization. |
How to Access GLM 4.5 via Novita AI?
Novita AI provides APIs with 131K context, and costs of $0.6/input and $2.2/output, delivering strong support for maximizing GLM 4.5’s code agent potential.
Novita AI
Step 1: Log In and Access the Model Library
Log in to your account and click on the Model Library button.

Step 2: Choose Your Model
Browse through the available options and select the model that suits your needs.

Step 3: Start Your Free Trial
Begin your free trial to explore the capabilities of the selected model.

Step 4: Get Your API Key
To authenticate with the API, we will provide you with a new API key. Entering the “Settings“ page, you can copy the API key as indicated in the image.

Step 5: Install the API
Install API using the package manager specific to your programming language.
After installation, import the necessary libraries into your development environment. Initialize the API with your API key to start interacting with Novita AI LLM. This is an example of using chat completions API for python users.
from openai import OpenAI
client = OpenAI(
base_url="https://api.novita.ai/v3/openai",
api_key="session_UsudmdAIggvSInjIdO2HWaTCyXxTFOXDV8TH8UCPbA576Rs4AGqSA5ThNbelSDgdEGAWQcWXnAU2bHi5BueceA==",
)
model = "zai-org/glm-4.5"
stream = True # or False
max_tokens = 65536
system_content = ""Be a helpful assistant""
temperature = 1
top_p = 1
min_p = 0
top_k = 50
presence_penalty = 0
frequency_penalty = 0
repetition_penalty = 1
response_format = { "type": "text" }
chat_completion_res = client.chat.completions.create(
model=model,
messages=[
{
"role": "system",
"content": system_content,
},
{
"role": "user",
"content": "Hi there!",
}
],
stream=stream,
max_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
presence_penalty=presence_penalty,
frequency_penalty=frequency_penalty,
response_format=response_format,
extra_body={
"top_k": top_k,
"repetition_penalty": repetition_penalty,
"min_p": min_p
}
)
if stream:
for chunk in chat_completion_res:
print(chunk.choices[0].delta.content or "", end="")
else:
print(chat_completion_res.choices[0].message.content)
GLM-4.5 and its Air variant provide powerful solutions for agentic applications, with varying VRAM requirements to suit different deployment scenarios. Assessing your specific needs and resources will guide you in choosing between local deployment and API-based solutions.
Frequently Asked Questions
GLM-4.5 is ideal for developers, researchers, and businesses seeking advanced AI agent capabilities, especially for coding, automation, and knowledge tasks.
GLM-4.5 is an advanced large language model featuring a Mixture-of-Experts architecture, optimized for agentic applications requiring complex reasoning and tool integration.
Yes, utilizing GLM-4.5 through an API is an alternative that reduces the need for significant hardware investment, though it may involve considerations regarding data privacy and network latency.
Novita AI is an AI cloud platform that offers developers an easy way to deploy AI models using our simple API, while also providing an affordable and reliable GPU cloud for building and scaling.
Recommend Reading
- Novita Kimi K2 API Support Function Calling Now!
- Why Kimi K2 VRAM Requirements Are a Challenge for Everyone?
- Qwen 3 in RAG Pipelines: All-in-One LLM, Embedding, and Reranking Models
Discover more from Novita
Subscribe to get the latest posts sent to your email.





