Tag: Research - Novita

Scaling Kimi Inference with DSpark Speculative Decoding in vLLM

See how DSpark improves Kimi-K2.6 and Kimi-K2.7-Code throughput as the speculative window scales from n=3 to n=7 in vLLM.

By Novita AI / July 10, 2026 / 5 minutes of reading

Optimizing GLM4-MoE for Production: 65% Faster TTFT with SGLang

As the state-of-the-art GLM 4.7 model continues to lead in coding performance, Novita AI remains committed to delivering a reliable, efficient, and production-grade GLM service to

By Novita AI / January 21, 2026 / 5 minutes of reading

Revolutionizing Large Language Model Inference: Speculative Decoding and Low-Precision Quantization

Learn how speculative sampling and low-precision quantization reduce costs and accelerate speed, offering practical solutions for scalable AI deployment.

By Novita AI / December 18, 2024 / 9 minutes of reading

Dynamic KV Cache compression based on vLLM framework

Novita AI speeds up Llama-70B loading with KV sparsity, reducing memory, computation, and I/O overhead for faster inference and minimal accuracy loss.

By Novita AI / December 13, 2024 / 3 minutes of reading

How to Select the Best GPU for LLM Inference: Benchmarking Insights

Discover how to select cost-effective GPUs for large model inference, focusing on performance metrics and best practices to enhance efficiency.

By Novita AI / November 5, 2024 / 14 minutes of reading

How KV Sparsity Achieves 1.5x Acceleration for vLLM

Boost AI inference speed with KV sparsity. Understand how it works and optimize your models for real-world applications.

By Novita AI / October 25, 2024 / 13 minutes of reading

Dynamic allocation of GPU resources for Kubernetes workloads

Currently, to schedule GPU Pods in Kubernetes (k8s), various extension solutions are put into action, including Device Plugin, Extended Resource, scheduler extender, scheduler fram

By Novita AI / October 24, 2024 / 4 minutes of reading

Dynamically Adding Port Mappings to Running Docker Containers

Port mapping is a crucial aspect of developing and deploying containerized applications. Typically, we establish a connection between a container's internal port and a port on the

By Novita AI / October 21, 2024 / 4 minutes of reading

GPU Container Core Binding Strategy Based on Affinity

Introduction to Optimizing CPU and GPU Performance In high-performance computing and large-scale parallel task processing, GPUs have become indispensable accelerators. To fully uti

By Novita AI / August 26, 2024 / 4 minutes of reading

Will Speculative Decoding Harm LLM Inference Accuracy?

Mitchell Stern et al. 2018 introduced the prototype concept of speculative decoding. This method has since been further developed and refined by various approaches, including Looka

By Novita AI / August 26, 2024 / 3 minutes of reading

Quantization Methods for 100X Speedup in Large Language Model Inference

Discover how selecting the best data types and optimizing GPU hardware support unlocks new pathways for spending up quantization inference.

By Novita AI / February 2, 2024 / 16 minutes of reading