Maximizing GLM 4.5 VRAM for Advanced AI Tasks

Are you considering deploying GLM-4.5 locally but concerned about the substantial GPU resources required? The full GLM-4.5 model demands configurations such as 16 NVIDIA H100 GPUs or 8 H200 GPUs in FP8 precision, while the more resource-efficient GLM-4.5-Air variant operates on 2 H100 GPUs or 1 H200 GPU in FP8 precision. These setups ensure optimal performance and support the model’s extensive context length of up to 128K tokens.

In this article, we will explore the VRAM requirements for GLM-4.5, discuss the feasibility of local deployment, and examine alternative methods to effectively utilize this powerful language model.

Table Of Contents

GLM 4.5 VRAM Requirements
What are the Hardware Requirements for GLM 4.5?
Optimizing GLM 4.5 for lower VRAM consumption
Another Cost-Effectively Option: API
- How to Access GLM 4.5 via Novita AI？

GLM 4.5 VRAM Requirements

GLM-4.5 is the latest advancement in the GLM family, featuring a sophisticated Mixture-of-Experts (MoE) architecture and optimization for agentic applications. The model comes in two variants: the flagship GLM-4.5 with 355 billion total parameters (32 billion active), and the efficient GLM-4.5-Air with 106 billion total parameters (12 billion active).

Key architectural innovations include a deeper model structure with reduced width and increased depth for enhanced reasoning, pre-training on a massive 15 trillion token corpus for comprehensive knowledge, and the open-source “slime” RL infrastructure designed for scalable, large-scale agentic reinforcement learning.

How much VRAM does GLM 4.5 need for inference?

The models can run under the configurations in the table below:

Model	Precision	GPU Type and Count	Test Framework
GLM-4.5	BF16	H100 x 16 / H200 x 8	sglang
GLM-4.5	FP8	H100 x 8 / H200 x 4	sglang
GLM-4.5-Air	BF16	H100 x 4 / H200 x 2	sglang
GLM-4.5-Air	FP8	H100 x 2 / H200 x 1	sglang

Under the configurations in the table below, the models can utilize their full 128K context length:

Model	Precision	GPU Type and Count	Test Framework
GLM-4.5	BF16	H100 x 32 / H200 x 16	sglang
GLM-4.5	FP8	H100 x 16 / H200 x 8	sglang
GLM-4.5-Air	BF16	H100 x 8 / H200 x 4	sglang
GLM-4.5-Air	FP8	H100 x 4 / H200 x 2	sglang

How much VRAM does GLM 4.5 need for fine-tuning?

The code can run under the configurations in the table below using Llama Factory:

Model	GPU Type and Count	Strategy	Batch Size (per GPU)
GLM-4.5	H100 x 16	Lora	1
GLM-4.5-Air	H100 x 4	Lora	1

The code can run under the configurations in the table below using Swift:

Model	GPU Type and Count	Strategy	Batch Size (per GPU)
GLM-4.5	H20 (96GiB) x 16	Lora	1
GLM-4.5-Air	H20 (96GiB) x 4	Lora	1
GLM-4.5	H20 (96GiB) x 128	SFT	1
GLM-4.5-Air	H20 (96GiB) x 32	SFT	1
GLM-4.5	H20 (96GiB) x 128	RL	1
GLM-4.5-Air	H20 (96GiB) x 32	RL	1

GLM 4.5 VRAM usage with different batch sizes

Model	Precision	Batch Size (per GPU)	VRAM
GLM-4.5	FP16	1	945.36GB
GLM-4.5	FP16	8	1128.49GB
GLM-4.5	FP16	16	1137.79GB
GLM-4.5	FP16	32	1756.38GB
GLM-4.5-Air	FP16	1	288.68GB
GLM-4.5-Air	FP16	8	343.58GB
GLM-4.5-Air	FP16	16	406.33GB
GLM-4.5-Air	FP16	32	531.83GB

What are the Hardware Requirements for GLM 4.5?

GPUs:
- Inference: 8 × H100/4 × H200 (FP8) or 16 × H100/8 × H200 (BF16) for full model; half for Air variant.
- Fine-Tuning: GPUs with ≥ 80 GB VRAM.
CPU & System:
- ≥ 1 TB RAM to load models and manage offload buffers.
- High-bandwidth interconnect (NVLink/HPC switch) for multi-GPU tensor parallelism.
Precision:
- FP8 for minimal VRAM usage (requires GPUs with native FP8 support).
- BF16 as alternative on GPUs without FP8.
Software:
- vLLM or Llama Factory for inference; support for speculative decoding and CPU offload.

Optimizing GLM 4.5 for lower VRAM consumption

Model Variants: Choose GLM 4.5-Air (106 B total/12 B active) for 32–64 GB GPU setups.
When to Choose GLM-4.5-Air：
- Significantly Faster Generation:
  - GLM-4.5-Air achieves an output rate of around 160 tokens per second, nearly twice as fast as the Full-size model (approximately 88 tokens/s). This makes Air ideal for latency-sensitive applications.
- Extremely Low First-Token Latency (TTFT):
  - Air outputs its first token in about 0.58 seconds, compared to 0.68 seconds for Full-size. In some tests, Full-size latency can reach 22–23 seconds when including “thinking” time.
- Shorter End-to-End Response Time:
  - Air delivers end-to-end responses (input processing, inference, and output) in about 16 seconds, while Full-size takes nearly 29 seconds, making Full-size less suitable for real-time interactions.
- Slightly Lower Scores on Complex Reasoning Tasks:
  - On reasoning benchmarks such as MMLU-Pro, GPQA, and AIME, Air scores about 2–3% lower than Full-size, but still maintains industry-leading performance.
- Recommended for Most Use Cases:
  - For the majority of text generation, summarization, basic reasoning, and code-assist tasks, the Full-size model is not necessary—Air is sufficient for high performance and responsiveness.

Layer Offloading: Offload select MoE experts or feed-forward layers to CPU memory.
KV-Cache Quantization: Reduce cache precision to save VRAM at minor quality cost.
Batch Size = 1: Limit to single-sample inference per GPU to minimize activations.

Another Cost-Effectively Option: API

Here’s a simplified comparison between deploying GLM 4.5 via an API and running it locally:

Aspect	API Deployment	Local Deployment
Cost	Pay-per-use pricing; for example, input tokens at $0.6 per million and output tokens at ¥2.2 per million on Novita AI	High initial investment in hardware (e.g., NVIDIA A100 GPUs); potentially lower costs over time for heavy usage.
Performance	Scalable with potential network latency; suitable for applications where slight delays are acceptable.	Lower latency and consistent performance; ideal for real-time applications requiring immediate responses.
Scalability	Easily scalable without managing infrastructure; provider handles scaling.	Scaling requires additional hardware and infrastructure management.
Data Privacy	Data is processed externally, which may raise privacy concerns, especially in regulated industries.	Data remains in-house, offering greater control and compliance with data protection regulations.
Operational Complexity	Minimal setup and maintenance; provider manages updates and infrastructure.	Requires technical expertise for setup, maintenance, and security; offers greater customization.
Customization	Limited to provider’s configurations; less flexibility for specific needs.	Full control over model customization, fine-tuning, and integration with existing systems.
Use Case Suitability	Ideal for applications with variable or low usage, rapid development needs, or limited technical resources.	Best for applications with high, consistent usage, stringent data privacy requirements, or need for extensive customization.

How to Access GLM 4.5 via Novita AI？

Novita AI provides APIs with 131K context, and costs of $0.6/input and $2.2/output, delivering strong support for maximizing GLM 4.5’s code agent potential.
Novita AI

Step 1: Log In and Access the Model Library

Try GLM 4.5 Now!

Step 2: Choose Your Model

Browse through the available options and select the model that suits your needs.

Step 3: Start Your Free Trial

Begin your free trial to explore the capabilities of the selected model.

Step 4: Get Your API Key

To authenticate with the API, we will provide you with a new API key. Entering the “Settings“ page, you can copy the API key as indicated in the image.

Step 5: Install the API

Install API using the package manager specific to your programming language.

After installation, import the necessary libraries into your development environment. Initialize the API with your API key to start interacting with Novita AI LLM. This is an example of using chat completions API for python users.

from openai import OpenAI
  
client = OpenAI(
    base_url="https://api.novita.ai/v3/openai",
    api_key="session_UsudmdAIggvSInjIdO2HWaTCyXxTFOXDV8TH8UCPbA576Rs4AGqSA5ThNbelSDgdEGAWQcWXnAU2bHi5BueceA==",
)

model = "zai-org/glm-4.5"
stream = True # or False
max_tokens = 65536
system_content = ""Be a helpful assistant""
temperature = 1
top_p = 1
min_p = 0
top_k = 50
presence_penalty = 0
frequency_penalty = 0
repetition_penalty = 1
response_format = { "type": "text" }

chat_completion_res = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "system",
            "content": system_content,
        },
        {
            "role": "user",
            "content": "Hi there!",
        }
    ],
    stream=stream,
    max_tokens=max_tokens,
    temperature=temperature,
    top_p=top_p,
    presence_penalty=presence_penalty,
    frequency_penalty=frequency_penalty,
    response_format=response_format,
    extra_body={
      "top_k": top_k,
      "repetition_penalty": repetition_penalty,
      "min_p": min_p
    }
  )

if stream:
    for chunk in chat_completion_res:
        print(chunk.choices[0].delta.content or "", end="")
else:
    print(chat_completion_res.choices[0].message.content)

GLM-4.5 and its Air variant provide powerful solutions for agentic applications, with varying VRAM requirements to suit different deployment scenarios. Assessing your specific needs and resources will guide you in choosing between local deployment and API-based solutions.

Frequently Asked Questions

Who should use GLM 4.5?

GLM-4.5 is ideal for developers, researchers, and businesses seeking advanced AI agent capabilities, especially for coding, automation, and knowledge tasks.

What is GLM-4.5?

GLM-4.5 is an advanced large language model featuring a Mixture-of-Experts architecture, optimized for agentic applications requiring complex reasoning and tool integration.

Can I deploy GLM-4.5 without extensive hardware?

Yes, utilizing GLM-4.5 through an API is an alternative that reduces the need for significant hardware investment, though it may involve considerations regarding data privacy and network latency.

Novita AI is an AI cloud platform that offers developers an easy way to deploy AI models using our simple API, while also providing an affordable and reliable GPU cloud for building and scaling.

Discover more from Novita

Subscribe to get the latest posts sent to your email.