Maximizing GLM 4.5 VRAM for Advanced AI Tasks

glm 4.5 vram

Are you considering deploying GLM-4.5 locally but concerned about the substantial GPU resources required? The full GLM-4.5 model demands configurations such as 16 NVIDIA H100 GPUs or 8 H200 GPUs in FP8 precision, while the more resource-efficient GLM-4.5-Air variant operates on 2 H100 GPUs or 1 H200 GPU in FP8 precision. These setups ensure optimal performance and support the model’s extensive context length of up to 128K tokens.

In this article, we will explore the VRAM requirements for GLM-4.5, discuss the feasibility of local deployment, and examine alternative methods to effectively utilize this powerful language model.

GLM 4.5 VRAM Requirements

GLM-4.5 is the latest advancement in the GLM family, featuring a sophisticated Mixture-of-Experts (MoE) architecture and optimization for agentic applications. The model comes in two variants: the flagship GLM-4.5 with 355 billion total parameters (32 billion active), and the efficient GLM-4.5-Air with 106 billion total parameters (12 billion active).

Key architectural innovations include a deeper model structure with reduced width and increased depth for enhanced reasoning, pre-training on a massive 15 trillion token corpus for comprehensive knowledge, and the open-source “slime” RL infrastructure designed for scalable, large-scale agentic reinforcement learning.

glm 4.5 benchmark
From Z.AI

How much VRAM does GLM 4.5 need for inference?

The models can run under the configurations in the table below:

ModelPrecisionGPU Type and CountTest Framework
GLM-4.5BF16H100 x 16 / H200 x 8sglang
GLM-4.5FP8H100 x 8 / H200 x 4sglang
GLM-4.5-AirBF16H100 x 4 / H200 x 2sglang
GLM-4.5-AirFP8H100 x 2 / H200 x 1sglang

Under the configurations in the table below, the models can utilize their full 128K context length:

ModelPrecisionGPU Type and CountTest Framework
GLM-4.5BF16H100 x 32 / H200 x 16sglang
GLM-4.5FP8H100 x 16 / H200 x 8sglang
GLM-4.5-AirBF16H100 x 8 / H200 x 4sglang
GLM-4.5-AirFP8H100 x 4 / H200 x 2sglang

How much VRAM does GLM 4.5 need for fine-tuning?

The code can run under the configurations in the table below using Llama Factory:

ModelGPU Type and CountStrategyBatch Size (per GPU)
GLM-4.5H100 x 16Lora1
GLM-4.5-AirH100 x 4Lora1

The code can run under the configurations in the table below using Swift:

ModelGPU Type and CountStrategyBatch Size (per GPU)
GLM-4.5H20 (96GiB) x 16Lora1
GLM-4.5-AirH20 (96GiB) x 4Lora1
GLM-4.5H20 (96GiB) x 128SFT1
GLM-4.5-AirH20 (96GiB) x 32SFT1
GLM-4.5H20 (96GiB) x 128RL1
GLM-4.5-AirH20 (96GiB) x 32RL1

GLM 4.5 VRAM usage with different batch sizes

ModelPrecisionBatch Size (per GPU)VRAM
GLM-4.5FP161945.36GB
GLM-4.5FP1681128.49GB
GLM-4.5FP16161137.79GB
GLM-4.5FP16321756.38GB
GLM-4.5-AirFP161288.68GB
GLM-4.5-AirFP168343.58GB
GLM-4.5-AirFP1616406.33GB
GLM-4.5-AirFP1632531.83GB

What are the Hardware Requirements for GLM 4.5?

  • GPUs:
    • Inference: 8 × H100/4 × H200 (FP8) or 16 × H100/8 × H200 (BF16) for full model; half for Air variant.
    • Fine-Tuning: GPUs with ≥ 80 GB VRAM.
  • CPU & System:
    • ≥ 1 TB RAM to load models and manage offload buffers.
    • High-bandwidth interconnect (NVLink/HPC switch) for multi-GPU tensor parallelism.
  • Precision:
    • FP8 for minimal VRAM usage (requires GPUs with native FP8 support).
    • BF16 as alternative on GPUs without FP8.
  • Software:
    • vLLM or Llama Factory for inference; support for speculative decoding and CPU offload.

Optimizing GLM 4.5 for lower VRAM consumption

  • Model Variants: Choose GLM 4.5-Air (106 B total/12 B active) for 32–64 GB GPU setups.
  • When to Choose GLM-4.5-Air
    • Significantly Faster Generation:
      • GLM-4.5-Air achieves an output rate of around 160 tokens per second, nearly twice as fast as the Full-size model (approximately 88 tokens/s). This makes Air ideal for latency-sensitive applications.
    • Extremely Low First-Token Latency (TTFT):
      • Air outputs its first token in about 0.58 seconds, compared to 0.68 seconds for Full-size. In some tests, Full-size latency can reach 22–23 seconds when including “thinking” time.
    • Shorter End-to-End Response Time:
      • Air delivers end-to-end responses (input processing, inference, and output) in about 16 seconds, while Full-size takes nearly 29 seconds, making Full-size less suitable for real-time interactions.
    • Slightly Lower Scores on Complex Reasoning Tasks:
      • On reasoning benchmarks such as MMLU-Pro, GPQA, and AIME, Air scores about 2–3% lower than Full-size, but still maintains industry-leading performance.
    • Recommended for Most Use Cases:
      • For the majority of text generation, summarization, basic reasoning, and code-assist tasks, the Full-size model is not necessary—Air is sufficient for high performance and responsiveness.
GLM 4.5 VS GLM 4.5 Air
  • Layer Offloading: Offload select MoE experts or feed-forward layers to CPU memory.
  • KV-Cache Quantization: Reduce cache precision to save VRAM at minor quality cost.
  • Batch Size = 1: Limit to single-sample inference per GPU to minimize activations.

Another Cost-Effectively Option: API

Here’s a simplified comparison between deploying GLM 4.5 via an API and running it locally:

AspectAPI DeploymentLocal Deployment
CostPay-per-use pricing; for example, input tokens at $0.6 per million and output tokens at ¥2.2 per million on Novita AIHigh initial investment in hardware (e.g., NVIDIA A100 GPUs); potentially lower costs over time for heavy usage.
PerformanceScalable with potential network latency; suitable for applications where slight delays are acceptable.Lower latency and consistent performance; ideal for real-time applications requiring immediate responses.
ScalabilityEasily scalable without managing infrastructure; provider handles scaling.Scaling requires additional hardware and infrastructure management.
Data PrivacyData is processed externally, which may raise privacy concerns, especially in regulated industries.Data remains in-house, offering greater control and compliance with data protection regulations.
Operational ComplexityMinimal setup and maintenance; provider manages updates and infrastructure.Requires technical expertise for setup, maintenance, and security; offers greater customization.
CustomizationLimited to provider’s configurations; less flexibility for specific needs.Full control over model customization, fine-tuning, and integration with existing systems.
Use Case SuitabilityIdeal for applications with variable or low usage, rapid development needs, or limited technical resources.Best for applications with high, consistent usage, stringent data privacy requirements, or need for extensive customization.

How to Access GLM 4.5 via Novita AI?

Novita AI provides APIs with 131K context, and costs of $0.6/input and $2.2/output, delivering strong support for maximizing GLM 4.5’s code agent potential.

Novita AI

Step 1: Log In and Access the Model Library

Log in to your account and click on the Model Library button.

Log In and Access the Model Library

Step 2: Choose Your Model

Browse through the available options and select the model that suits your needs.

choose your model

Step 3: Start Your Free Trial

Begin your free trial to explore the capabilities of the selected model.

start your free trail of glm 4.5

Step 4: Get Your API Key

To authenticate with the API, we will provide you with a new API key. Entering the “Settings“ page, you can copy the API key as indicated in the image.

get api key

Step 5: Install the API

Install API using the package manager specific to your programming language.

After installation, import the necessary libraries into your development environment. Initialize the API with your API key to start interacting with Novita AI LLM. This is an example of using chat completions API for python users.

from openai import OpenAI
  
client = OpenAI(
    base_url="https://api.novita.ai/v3/openai",
    api_key="session_UsudmdAIggvSInjIdO2HWaTCyXxTFOXDV8TH8UCPbA576Rs4AGqSA5ThNbelSDgdEGAWQcWXnAU2bHi5BueceA==",
)

model = "zai-org/glm-4.5"
stream = True # or False
max_tokens = 65536
system_content = ""Be a helpful assistant""
temperature = 1
top_p = 1
min_p = 0
top_k = 50
presence_penalty = 0
frequency_penalty = 0
repetition_penalty = 1
response_format = { "type": "text" }

chat_completion_res = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "system",
            "content": system_content,
        },
        {
            "role": "user",
            "content": "Hi there!",
        }
    ],
    stream=stream,
    max_tokens=max_tokens,
    temperature=temperature,
    top_p=top_p,
    presence_penalty=presence_penalty,
    frequency_penalty=frequency_penalty,
    response_format=response_format,
    extra_body={
      "top_k": top_k,
      "repetition_penalty": repetition_penalty,
      "min_p": min_p
    }
  )

if stream:
    for chunk in chat_completion_res:
        print(chunk.choices[0].delta.content or "", end="")
else:
    print(chat_completion_res.choices[0].message.content)
  
  

GLM-4.5 and its Air variant provide powerful solutions for agentic applications, with varying VRAM requirements to suit different deployment scenarios. Assessing your specific needs and resources will guide you in choosing between local deployment and API-based solutions.

Frequently Asked Questions

Who should use GLM 4.5?

GLM-4.5 is ideal for developers, researchers, and businesses seeking advanced AI agent capabilities, especially for coding, automation, and knowledge tasks.

What is GLM-4.5? 

GLM-4.5 is an advanced large language model featuring a Mixture-of-Experts architecture, optimized for agentic applications requiring complex reasoning and tool integration.

Can I deploy GLM-4.5 without extensive hardware?

Yes, utilizing GLM-4.5 through an API is an alternative that reduces the need for significant hardware investment, though it may involve considerations regarding data privacy and network latency.

Novita AI is an AI cloud platform that offers developers an easy way to deploy AI models using our simple API, while also providing an affordable and reliable GPU cloud for building and scaling.

Recommend Reading


Discover more from Novita

Subscribe to get the latest posts sent to your email.

Leave a Comment

Scroll to Top

Discover more from Novita

Subscribe now to keep reading and get access to the full archive.

Continue reading