Optimizing GLM4-MoE for Production: 65% Faster TTFT with SGLang

TL;DR

A suite of production-tested, high-impact optimizations has been developed by Novita AI for deploying GLM4-MOE models based on SGLANG. We introduce an end-to-end performance optimization strategy that addresses bottlenecks across the entire inference pipeline — from kernel execution efficiency to cross-node data transfer scheduling. Through the integration of Shared Experts Fusion and Suffix Decoding, we observe substantial gains in key production metrics, including:

up to 65% reduction in Time-to-First-Token (TTFT)
22% improvement in Time-Per-Output-Token (TPOT)

under agentic coding workloads.

All results were validated on H200 clusters under TP8 and FP8 configurations, providing a battle-tested blueprint for achieving both optimal throughput and low latency in demanding production environments.

How We Implemented Core Production Optimizations for GLM-MoE

1. Shared Experts Fusion

https://github.com/sgl-project/sglang/pull/13873

Full credit for this optimization belongs to the original work on Deepseek model. As illustrated in the figure above, MoE models such as GLM4.7 route all input tokens through a shared expert, while each token is also individually routed to its own set of top‑k routed experts as selected by the model’s router. The outputs from all experts are then weighted and aggregated. GLM4.7, for instance, employs 160 routed experts alongside a single shared expert, selecting the top 8 routed experts per token. In earlier implementations, these two components were handled separately. Given that they share identical tensor shapes and computational procedures, it is natural to unify them by merging the shared expert into the routed MoE structure—selecting the top 9 out of the total 161 experts, with the shared expert consistently assigned the 9th position.

As documented in PR, this optimization achieves performance gains of up to 23.7% in TTFT and 20.8% in ITL. These gains are expected because, under TP8 and FP8 configurations—where the intermediate size is only 192, which is relatively small for H200 hardware—the fusion operation substantially boosts Streaming Multiprocessor (SM) utilization and significantly reduces memory I/O overhead.

2. Qknorm Fusion

This migration builds upon the optimization from Qwen-MOE. The underlying idea is straightforward. Since both operators perform head-wise computations, it is a natural approach to fuse them into a single kernel. Our contribution lies in adapting this fused kernel to accommodate the GLM4-MOE variant’s specific case, where only half of the dimensions within a head are rotated.

3. Async Transfer

https://github.com/sgl-project/sglang/pull/14782

In scenarios where PD disaggregation with overlapping schedules is applied, although throughput can gain about 10%, TTFT drops significantly. We observed that in the current implementation of prefill, the data transfer process is delayed until after the kernel launch for the next batch. For a model like GLM4.7, which consists of 92 layers, kernel launch without CUDA Graph can be time‑consuming (often taking hundreds of milliseconds even more than 1 second).

To address this, in our modification we advance the transfer step slightly, scheduling it right after its corresponding GPU operations complete. Additionally, the transfer is placed in a separate thread. By carefully handling potential data‑race structures, it can proceed without blocking the main thread.

The performance is enormous for models with a lot kernel launches. When at heavy workloads, this optimization can save up to 1 second in terms of TTFT as shown below.

Production Benchmark Results

After implementing the approaches described above, we observed significant performance improvements for GLM-MOE models, as clearly demonstrated by the benchmark results below.

Benchmark configuration

Input length: 4096
Output length: 1000
Request rate: 14 req/s
Model: GLM-4.7 FP8 (TP8)

Results

These optimizations are not just experimental — they have already been deployed and validated in Novita.ai’s production inference service. If you are looking for a reliable, low-latency GLM-MoE backend for real-world workloads, you’re welcome to try it directly on novita.ai.

Suffix decoding

Agentic coding scenarios (like Cursor and Claude Code) exhibit a high volume of reusable code patterns, allowing for targeted performance optimizations such as Suffix Decoding.

Background: The Inference Bottleneck in Agentic Coding

LLM Agents excel at code generation tasks, but latency remains a significant challenge. Traditional Speculative Decoding accelerates inference by predicting multiple tokens in advance, but common approaches require training additional draft models, introducing engineering complexity.

How Suffix Decoding Works

Suffix Decoding takes a fundamentally different approach—it is completely model-free:

No dependency on additional model weights
Leverages patterns from previously generated output sequences to predict upcoming tokens
When the current request’s suffix matches a historical pattern, it continues along that historical sequence for speculation

Data Validation: Output Pattern Repetition Analysis

By analyzing 22 Claude Code sessions (17,487 conversation turns), we discovered:

39.3% output pattern repetition: High frequency of similar tool calls and response patterns
Highly structured agentic behaviors: Fixed phrases like “Let me…”, “Now let me…” appear frequently

To support further research, we have open-sourced the evaluation dataset on Hugging Face : https://huggingface.co/datasets/novita/agentic_code_dataset_22

Performance Comparison

With built-in MTP acceleration, Suffix Decoding further reduces TPOT by 22% (from 25.13ms to 19.63ms):

Metric	MTP	Suffix Decoding	Change
Mean TPOT	25.13 ms	19.63 ms	-21.90%
Median TPOT	25.95 ms	20.05 ms	-22.70%

Conclusion

The combination of these optimizations provides comprehensive performance improvements for SGLANG deployments:

Shared Experts Fusion addresses compute efficiency in MoE models
QK-Norm-RoPE Fusion reduces kernel launch overhead
Async Transfer optimizes data movement in disaggregated deployments
Suffix Decoding leverages pattern repetition for speculative decoding for agentic coding.

Most components are already merged upstream or undergoing integration; feel free to check them out on the SGLang repo.

How to Reproduce

Only the key performance-relevant parameters are shown here.

Full launch scripts (baseline vs optimized), benchmark harness, and profiling traces are published in our GitHub:https://github.com/novitalabs/sglang/tree/glm_suffix.

Core Optimization Flags (SGLang Runtime)

--tp-size 8
--kv-cache-dtype fp8_e4m3
--attention-backend fa3
--chunked-prefill-size 16384
--enable-flashinfer-allreduce-fusion
--enable-fused-qk-norm-rope
--enable-shared-experts-fusion
--disaggregation-async-transfer

Speculative decoding configuration (agentic coding workload)

--speculative-algorithm NEXTN
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4

Suffix Decoding configuration (optional)

--speculative-algorithm SUFFIX
--speculative-suffix-cache-max-depth 64
--speculative-suffix-max-spec-factor 1.0
--speculative-suffix-min-token-prob 0.1

References

Novita AI is a leading AI cloud platform that provides developers with easy-to-use APIs and affordable, reliable GPU infrastructure for building and scaling AI applications.

Discover more from Novita

Subscribe to get the latest posts sent to your email.

Optimizing GLM4-MoE for Production: 65% Faster TTFT with SGLang

TL;DR

How We Implemented Core Production Optimizations for GLM-MoE

1. Shared Experts Fusion

2. Qknorm Fusion

3. Async Transfer

Production Benchmark Results

Suffix decoding

Background: The Inference Bottleneck in Agentic Coding

How Suffix Decoding Works

Data Validation: Output Pattern Repetition Analysis

Performance Comparison

Conclusion

How to Reproduce

References

Discover more from Novita

Leave a CommentCancel reply

CONTACT

RESOURCES

COMPANY

PARTNERS

TL;DR

How We Implemented Core Production Optimizations for GLM-MoE

1. Shared Experts Fusion

2. Qknorm Fusion

3. Async Transfer

Production Benchmark Results

Suffix decoding

Background: The Inference Bottleneck in Agentic Coding

How Suffix Decoding Works

Data Validation: Output Pattern Repetition Analysis

Performance Comparison

Conclusion

How to Reproduce

References

Discover more from Novita

Related Posts

Leave a CommentCancel reply

CONTACT

RESOURCES

COMPANY

PARTNERS

Discover more from Novita