TL;DR
A suite of production-tested, high-impact optimizations has been developed by Novita AI for deploying GLM4-MOE models based on SGLANG. We introduce an end-to-end performance optimization strategy that addresses bottlenecks across the entire inference pipeline — from kernel execution efficiency to cross-node data transfer scheduling. Through the integration of Shared Experts Fusion and Suffix Decoding, we observe substantial gains in key production metrics, including:
- up to 65% reduction in Time-to-First-Token (TTFT)
- 22% improvement in Time-Per-Output-Token (TPOT)
under agentic coding workloads.
All results were validated on H200 clusters under TP8 and FP8 configurations, providing a battle-tested blueprint for achieving both optimal throughput and low latency in demanding production environments.
How We Implemented Core Production Optimizations for GLM-MoE
1. Shared Experts Fusion

Full credit for this optimization belongs to the original work on Deepseek model. As illustrated in the figure above, MoE models such as GLM4.7 route all input tokens through a shared expert, while each token is also individually routed to its own set of top‑k routed experts as selected by the model’s router. The outputs from all experts are then weighted and aggregated. GLM4.7, for instance, employs 160 routed experts alongside a single shared expert, selecting the top 8 routed experts per token. In earlier implementations, these two components were handled separately. Given that they share identical tensor shapes and computational procedures, it is natural to unify them by merging the shared expert into the routed MoE structure—selecting the top 9 out of the total 161 experts, with the shared expert consistently assigned the 9th position.
As documented in PR, this optimization achieves performance gains of up to 23.7% in TTFT and 20.8% in ITL. These gains are expected because, under TP8 and FP8 configurations—where the intermediate size is only 192, which is relatively small for H200 hardware—the fusion operation substantially boosts Streaming Multiprocessor (SM) utilization and significantly reduces memory I/O overhead.
2. Qknorm Fusion

This migration builds upon the optimization from Qwen-MOE. The underlying idea is straightforward. Since both operators perform head-wise computations, it is a natural approach to fuse them into a single kernel. Our contribution lies in adapting this fused kernel to accommodate the GLM4-MOE variant’s specific case, where only half of the dimensions within a head are rotated.
3. Async Transfer
https://github.com/sgl-project/sglang/pull/14782

In scenarios where PD disaggregation with overlapping schedules is applied, although throughput can gain about 10%, TTFT drops significantly. We observed that in the current implementation of prefill, the data transfer process is delayed until after the kernel launch for the next batch. For a model like GLM4.7, which consists of 92 layers, kernel launch without CUDA Graph can be time‑consuming (often taking hundreds of milliseconds even more than 1 second).
To address this, in our modification we advance the transfer step slightly, scheduling it right after its corresponding GPU operations complete. Additionally, the transfer is placed in a separate thread. By carefully handling potential data‑race structures, it can proceed without blocking the main thread.
The performance is enormous for models with a lot kernel launches. When at heavy workloads, this optimization can save up to 1 second in terms of TTFT as shown below.

Production Benchmark Results
After implementing the approaches described above, we observed significant performance improvements for GLM-MOE models, as clearly demonstrated by the benchmark results below.
Benchmark configuration
- Input length: 4096
- Output length: 1000
- Request rate: 14 req/s
- Model: GLM-4.7 FP8 (TP8)
Results


These optimizations are not just experimental — they have already been deployed and validated in Novita.ai’s production inference service. If you are looking for a reliable, low-latency GLM-MoE backend for real-world workloads, you’re welcome to try it directly on novita.ai.
Suffix decoding
Agentic coding scenarios (like Cursor and Claude Code) exhibit a high volume of reusable code patterns, allowing for targeted performance optimizations such as Suffix Decoding.
Background: The Inference Bottleneck in Agentic Coding
LLM Agents excel at code generation tasks, but latency remains a significant challenge. Traditional Speculative Decoding accelerates inference by predicting multiple tokens in advance, but common approaches require training additional draft models, introducing engineering complexity.
How Suffix Decoding Works

Suffix Decoding takes a fundamentally different approach—it is completely model-free:
- No dependency on additional model weights
- Leverages patterns from previously generated output sequences to predict upcoming tokens
- When the current request’s suffix matches a historical pattern, it continues along that historical sequence for speculation
Data Validation: Output Pattern Repetition Analysis
By analyzing 22 Claude Code sessions (17,487 conversation turns), we discovered:
- 39.3% output pattern repetition: High frequency of similar tool calls and response patterns
- Highly structured agentic behaviors: Fixed phrases like “Let me…”, “Now let me…” appear frequently
To support further research, we have open-sourced the evaluation dataset on Hugging Face : https://huggingface.co/datasets/novita/agentic_code_dataset_22
Performance Comparison
With built-in MTP acceleration, Suffix Decoding further reduces TPOT by 22% (from 25.13ms to 19.63ms):
| Metric | MTP | Suffix Decoding | Change |
| Mean TPOT | 25.13 ms | 19.63 ms | -21.90% |
| Median TPOT | 25.95 ms | 20.05 ms | -22.70% |
Conclusion
The combination of these optimizations provides comprehensive performance improvements for SGLANG deployments:
- Shared Experts Fusion addresses compute efficiency in MoE models
- QK-Norm-RoPE Fusion reduces kernel launch overhead
- Async Transfer optimizes data movement in disaggregated deployments
- Suffix Decoding leverages pattern repetition for speculative decoding for agentic coding.
Most components are already merged upstream or undergoing integration; feel free to check them out on the SGLang repo.
How to Reproduce
Only the key performance-relevant parameters are shown here.
Full launch scripts (baseline vs optimized), benchmark harness, and profiling traces are published in our GitHub:https://github.com/novitalabs/sglang/tree/glm_suffix.
- Core Optimization Flags (SGLang Runtime)
--tp-size 8 --kv-cache-dtype fp8_e4m3 --attention-backend fa3 --chunked-prefill-size 16384 --enable-flashinfer-allreduce-fusion --enable-fused-qk-norm-rope --enable-shared-experts-fusion --disaggregation-async-transfer
- Speculative decoding configuration (agentic coding workload)
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
- Suffix Decoding configuration (optional)
--speculative-algorithm SUFFIX --speculative-suffix-cache-max-depth 64 --speculative-suffix-max-spec-factor 1.0 --speculative-suffix-min-token-prob 0.1
References
- SGLANG PR #13873: Shared Experts Optimization
- Snowflake Engineering Blog: SuffixDecoding at Production Scale
- NeurIPS Paper: SuffixDecoding
- Arctic Inference Repository
Novita AI is a leading AI cloud platform that provides developers with easy-to-use APIs and affordable, reliable GPU infrastructure for building and scaling AI applications.
Discover more from Novita
Subscribe to get the latest posts sent to your email.





