Optimizing GLM4-MoE for Production: 65% Faster TTFT with SGLang

GLM4-MoE Optimization

TL;DR

A suite of production-tested, high-impact optimizations has been developed by Novita AI for deploying GLM4-MOE models based on SGLANG. We introduce an end-to-end performance optimization strategy that addresses bottlenecks across the entire inference pipeline — from kernel execution efficiency to cross-node data transfer scheduling. Through the integration of Shared Experts Fusion and Suffix Decoding, we observe substantial gains in key production metrics, including:

  • up to 65% reduction in Time-to-First-Token (TTFT)
  • 22% improvement in Time-Per-Output-Token (TPOT)

under agentic coding workloads.

All results were validated on H200 clusters under TP8 and FP8 configurations, providing a battle-tested blueprint for achieving both optimal throughput and low latency in demanding production environments.

How We Implemented Core Production Optimizations for GLM-MoE

1. Shared Experts Fusion

image-fotor-20260104102229

Full credit for this optimization belongs to the original work on Deepseek model. As illustrated in the figure above, MoE models such as GLM4.7 route all input tokens through a shared expert, while each token is also individually routed to its own set of top‑k routed experts as selected by the model’s router. The outputs from all experts are then weighted and aggregated. GLM4.7, for instance, employs 160 routed experts alongside a single shared expert, selecting the top 8 routed experts per token. In earlier implementations, these two components were handled separately. Given that they share identical tensor shapes and computational procedures, it is natural to unify them by merging the shared expert into the routed MoE structure—selecting the top 9 out of the total 161 experts, with the shared expert consistently assigned the 9th position.

As documented in PR, this optimization achieves performance gains of up to 23.7% in TTFT and 20.8% in ITL. These gains are expected because, under TP8 and FP8 configurations—where the intermediate size is only 192, which is relatively small for H200 hardware—the fusion operation substantially boosts Streaming Multiprocessor (SM) utilization and significantly reduces memory I/O overhead.

2. Qknorm Fusion

Qknorm Fusion

This migration builds upon the optimization from Qwen-MOE. The underlying idea is straightforward. Since both operators perform head-wise computations, it is a natural approach to fuse them into a single kernel. Our contribution lies in adapting this fused kernel to accommodate the GLM4-MOE variant’s specific case, where only half of the dimensions within a head are rotated.

3. Async Transfer

https://github.com/sgl-project/sglang/pull/14782

Async Transfer

In scenarios where PD disaggregation with overlapping schedules is applied, although throughput can gain about 10%, TTFT drops significantly. We observed that in the current implementation of prefill, the data transfer process is delayed until after the kernel launch for the next batch. For a model like GLM4.7, which consists of 92 layers, kernel launch without CUDA Graph can be time‑consuming (often taking hundreds of milliseconds even more than 1 second).

To address this, in our modification we advance the transfer step slightly, scheduling it right after its corresponding GPU operations complete. Additionally, the transfer is placed in a separate thread. By carefully handling potential data‑race structures, it can proceed without blocking the main thread.

The performance is enormous for models with a lot kernel launches. When at heavy workloads, this optimization can save up to 1 second in terms of TTFT as shown below.

Production Benchmark Results

After implementing the approaches described above, we observed significant performance improvements for GLM-MOE models, as clearly demonstrated by the benchmark results below.

Benchmark configuration

  • Input length: 4096
  • Output length: 1000
  • Request rate: 14 req/s
  • Model: GLM-4.7 FP8 (TP8)

Results

TTFT & E2E Latency
TPOT & Inter-Token Latency

These optimizations are not just experimental — they have already been deployed and validated in Novita.ai’s production inference service. If you are looking for a reliable, low-latency GLM-MoE backend for real-world workloads, you’re welcome to try it directly on novita.ai.

Suffix decoding

Agentic coding scenarios (like Cursor and Claude Code) exhibit a high volume of reusable code patterns, allowing for targeted performance optimizations such as Suffix Decoding.

Background: The Inference Bottleneck in Agentic Coding

LLM Agents excel at code generation tasks, but latency remains a significant challenge. Traditional Speculative Decoding accelerates inference by predicting multiple tokens in advance, but common approaches require training additional draft models, introducing engineering complexity.

How Suffix Decoding Works

How Suffix Decoding Works

Suffix Decoding takes a fundamentally different approach—it is completely model-free:

  • No dependency on additional model weights
  • Leverages patterns from previously generated output sequences to predict upcoming tokens
  • When the current request’s suffix matches a historical pattern, it continues along that historical sequence for speculation

Data Validation: Output Pattern Repetition Analysis

By analyzing 22 Claude Code sessions (17,487 conversation turns), we discovered:

  • 39.3% output pattern repetition: High frequency of similar tool calls and response patterns
  • Highly structured agentic behaviors: Fixed phrases like “Let me…”, “Now let me…” appear frequently

To support further research, we have open-sourced the evaluation dataset on Hugging Face : https://huggingface.co/datasets/novita/agentic_code_dataset_22

Performance Comparison

With built-in MTP acceleration, Suffix Decoding further reduces TPOT by 22% (from 25.13ms to 19.63ms):

MetricMTPSuffix DecodingChange
Mean TPOT25.13 ms19.63 ms-21.90%
Median TPOT25.95 ms20.05 ms-22.70%

Conclusion

The combination of these optimizations provides comprehensive performance improvements for SGLANG deployments:

  1. Shared Experts Fusion addresses compute efficiency in MoE models
  2. QK-Norm-RoPE Fusion reduces kernel launch overhead
  3. Async Transfer optimizes data movement in disaggregated deployments
  4. Suffix Decoding leverages pattern repetition for speculative decoding for agentic coding.

Most components are already merged upstream or undergoing integration; feel free to check them out on the SGLang repo.

How to Reproduce

Only the key performance-relevant parameters are shown here.

Full launch scripts (baseline vs optimized), benchmark harness, and profiling traces are published in our GitHub:https://github.com/novitalabs/sglang/tree/glm_suffix.

  • Core Optimization Flags (SGLang Runtime)
--tp-size 8
--kv-cache-dtype fp8_e4m3
--attention-backend fa3
--chunked-prefill-size 16384
--enable-flashinfer-allreduce-fusion
--enable-fused-qk-norm-rope
--enable-shared-experts-fusion
--disaggregation-async-transfer
  • Speculative decoding configuration (agentic coding workload)
--speculative-algorithm NEXTN
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4
  • Suffix Decoding configuration (optional)
--speculative-algorithm SUFFIX
--speculative-suffix-cache-max-depth 64
--speculative-suffix-max-spec-factor 1.0
--speculative-suffix-min-token-prob 0.1

References

  1. SGLANG PR #13873: Shared Experts Optimization
  2. Snowflake Engineering Blog: SuffixDecoding at Production Scale
  3. NeurIPS Paper: SuffixDecoding
  4. Arctic Inference Repository

Novita AI is a leading AI cloud platform that provides developers with easy-to-use APIs and affordable, reliable GPU infrastructure for building and scaling AI applications.


Discover more from Novita

Subscribe to get the latest posts sent to your email.

Leave a Comment

Scroll to Top

Discover more from Novita

Subscribe now to keep reading and get access to the full archive.

Continue reading