PegaFlow External KV Cache for vLLM
2.15x faster startup is the headline number from the joint vLLM and Novita AI PegaFlow article, but the deeper point is architectural: production LLM serving needs KV cache ownership outside a single inference engine process. PegaFlow makes KV cache a standalone service so vLLM deployments can preserve, share, and scale cache across restarts, local instances, and remote nodes.
This post gives the Novita AI perspective on why we built PegaFlow, what the public vLLM integration shows, which claims are already source-backed, and how developers can inspect the open-source implementation today.
Explore the PegaFlow GitHub repository or read the joint vLLM x Novita AI article for the complete technical walkthrough.
What problem does PegaFlow solve for vLLM serving?
PegaFlow addresses the fragility of process-local KV cache in high-throughput LLM inference. When KV cache lives only inside one vLLM engine process, useful cache state can disappear during restarts, remain trapped inside one instance, or fail to move efficiently across nodes.
That becomes expensive when workloads reuse long prompts, route similar requests across replicas, or separate prefill and decode work. The cache may already contain work the system should not recompute, but the serving topology cannot always reuse it.
PegaFlow changes that boundary. It runs as an external KV cache service, implemented with a Rust core, and connects to vLLM through the external KV connector mechanism rather than a long-lived fork.
How does PegaFlow integrate with vLLM?
PegaFlow integrates with vLLM through kv_transfer_config, PegaKVConnector, and kv_connector_module_path. In the published article, the connector lets PegaFlow take over key KV cache operations at runtime while vLLM continues to handle scheduling, model execution, batching, and the OpenAI-compatible serving path.
The public repository currently lists vLLM as ready in its framework integration table and shows this connector configuration in the quick start:
vllm serve Qwen/Qwen3-0.6B \
--kv-transfer-config '{"kv_connector": "PegaKVConnector", "kv_role": "kv_both", "kv_connector_module_path": "pegaflow.connector"}'
The practical benefit is a cleaner ownership model: vLLM remains the serving engine, while PegaFlow owns external KV cache storage, transfer, sharing, and related cache observability.
What does the Novita AI architecture add?
The Novita AI design goal is to make KV cache behave like production serving infrastructure, not temporary process memory. That means PegaFlow is designed around a standalone service boundary, a Rust data path, shared cache pools, and multi-tier storage.
| Architecture choice | Why it matters for developers | Public source |
|---|---|---|
| Independent sidecar service | KV cache can survive inference engine restarts and scale separately from the vLLM process. | PegaFlow README |
| GIL-free Rust core | The cache hot path avoids Python overhead and keeps inference engine threads focused on serving. | PegaFlow README |
| Pinned host memory, RDMA remote memory, and SSD cache | The cache can span faster local memory, remote node memory, and larger SSD-backed capacity. | vLLM article |
| Prometheus metrics and OTLP export | Operators can observe cache behavior rather than treating KV reuse as a hidden engine detail. | PegaFlow README |
Last verified: 2026-05-20. These details come from the joint vLLM article and the public novitalabs/pegaflow README.
What performance results are public?
The public performance claims should be read as PegaFlow evaluation results from the joint vLLM article and repository benchmark, not generic guarantees for every workload. Cache hit rate, prompt reuse, model shape, hardware, network topology, and request routing all affect real deployments.
| Scenario | Reported result | Source |
|---|---|---|
| vLLM startup with pre-owned 500 GiB host KV pool | 2.15x faster startup | Joint vLLM article |
| Eight Qwen3-8B instances sharing one host cache | 56% higher throughput | Joint vLLM article |
| DeepSeek-V3.2 MLA with TP8 | 72% higher throughput | Joint vLLM article |
| Internal RDMA cluster remote reads | 194 GB/s average remote-read throughput | Joint vLLM article |
| H800 reference benchmark, Llama-3.1-8B, warm versus cold cache | TTFT mean reduced from 572.5 ms to 61.5 ms; P99 TTFT reduced from 1113.7 ms to 77.0 ms | PegaFlow README |
Last verified: 2026-05-20. The RDMA number is described in the source article as an internal cluster result, so it should stay framed as reported evaluation data rather than a universal throughput promise.
When is external KV cache most useful?
External KV cache is most useful when prompt reuse is high enough that recomputation becomes visible in latency, throughput, or GPU utilization. It is less useful for workloads where nearly every request is unique and cache reuse is naturally low.
- Frequent restarts: keeping cache outside the engine can reduce restart penalties when cache state remains useful.
- Multi-instance serving: sharing host cache can reduce duplicate prefill work across local vLLM instances.
- Multi-node deployments: RDMA-backed remote cache can make useful KV blocks available beyond one machine.
- Prefill/decode disaggregation: external cache can give the serving system a clearer handoff point between stages.
For Novita AI, this is part of a broader infrastructure principle: production AI systems need the serving engine, memory layer, routing layer, and observability layer to evolve independently when traffic patterns become complex.
How can developers inspect PegaFlow today?
Developers can inspect the public GitHub repository and install the published packages referenced by the README. The repository documents a CUDA 12 package, a CUDA 13 package, a vLLM connector example, server configuration, P2P RDMA setup, prefill/decode routing, metrics, and project goals.
uv pip install pegaflow-llm # CUDA 12
uv pip install pegaflow-llm-cu13 # CUDA 13
The simplest local server command in the README is:
pegaflow-server
For production evaluation, start with your own prompt reuse profile, target model, GPU topology, memory capacity, and RDMA or SSD assumptions. PegaFlow is infrastructure for cache reuse; the workload determines how much value there is to capture.
What should platform teams verify before adopting it?
Platform teams should validate PegaFlow against their own serving topology before treating public benchmark numbers as planning inputs. The right test is not only cold versus warm cache, but whether cache reuse appears in the traffic pattern that actually drives cost or latency.
- Measure prompt reuse and expected KV cache hit rate under real routing.
- Compare restart behavior with and without externally owned KV cache.
- Test single-node multi-instance sharing before expanding to RDMA.
- Verify observability: cache hits, misses, transfer latency, memory pressure, and SSD behavior.
- Confirm version compatibility with the vLLM connector path used in your deployment.
This is also why the open-source boundary matters. Developers can inspect the connector, server configuration, metrics, and benchmark setup rather than relying on a black-box cache service.
FAQ
What is PegaFlow?
PegaFlow is an open-source KV cache storage engine for LLM inference from Novita AI. It runs as an independent service and connects to vLLM through the external KV connector path.
Does PegaFlow require a vLLM fork?
No. The published vLLM article describes PegaFlow connecting through kv_transfer_config and PegaKVConnector, with external packages loaded through kv_connector_module_path.
What performance results are public?
The joint vLLM article reports 2.15x faster startup, 56% higher throughput in a shared-host-cache setup, 72% higher throughput for a DeepSeek-V3.2 MLA setup, and 194 GB/s average remote-read throughput in an internal RDMA cluster. The README also reports H800 TTFT reductions for a warm-cache reference benchmark.
Where can developers try PegaFlow?
Developers can review the public novitalabs/pegaflow repository, install pegaflow-llm for CUDA 12 or pegaflow-llm-cu13 for CUDA 13, and follow the repository quick start.
Conclusion
PegaFlow is Novita AI’s external KV cache work for production LLM inference with vLLM: a standalone cache service, a Rust data path, shared cache pools, and a connector boundary that avoids a vLLM fork. The key takeaway is simple: when KV cache becomes infrastructure rather than process-local state, serving teams get more control over restarts, sharing, scaling, and observability. Review the PegaFlow repository, compare the public results with your own workload, and use Novita AI’s broader developer infrastructure when you need model APIs, agent execution, or GPU workflows around that serving stack.
Recommended articles
- Top Inference API Providers for Open-Source Models in 2026
- Qwen 3.5 Medium Series VRAM Requirements: 27B, 35B, 122B GPU Deployment Guide
- Can You Run Qwen3.5-397B-A17B Locally? GPU Guide 2026
Discover more from Novita
Subscribe to get the latest posts sent to your email.





