R&D Talk - APIs, Serverless and GPU Instance In One AI Cloud - Novita AI

APIs, Serverless and GPU Instance In One AI Cloud - Novita AI

Sign in Subscribe

R&D Talk

A collection of 8 posts

Revolutionizing Large Language Model Inference: Speculative Decoding and Low-Precision Quantization

Revolutionizing Large Language Model Inference: Speculative Decoding and Low-Precision Quantization

Learn how speculative sampling and low-precision quantization reduce costs and accelerate speed, offering practical solutions for scalable AI deployment.

Dynamic KV Cache compression based on vLLM framework

Dynamic KV Cache compression based on vLLM framework

Novita AI speeds up Llama-70B loading with KV sparsity, reducing memory, computation, and I/O overhead for faster inference and minimal accuracy loss.

How to Select the Best GPU for LLM Inference: Benchmarking Insights

How to Select the Best GPU for LLM Inference: Benchmarking Insights

Key Highlights * High Inference Costs: Large-scale model inference remains expensive, limiting scalability despite decreasing overall costs. * GPU Selection Challenges: The variety of available GPUs complicates the selection process, often leading to suboptimal choices based on superficial metrics. * Objective Evaluation Framework: A standardized evaluation method helps identify cost-effective GPU solutions tailored

How KV Sparsity Achieves 1.5x Acceleration for vLLM

How KV Sparsity Achieves 1.5x Acceleration for vLLM

Zachary, Algorithm Expert at Novita AI Research Focus: Inference Acceleration Introduction Over the past year, since the emergence of H2O, there has been a proliferation of research papers on KV sparsity. However, one significant challenge in real-world applications is the substantial gap between academic research and practical implementation. For example,

Dynamic allocation of GPU resources for Kubernetes workloads

Dynamic allocation of GPU resources for Kubernetes workloads

Currently, to schedule GPU Pods in Kubernetes (k8s), various extension solutions are put into action, including Device Plugin, Extended Resource, scheduler extender, scheduler framework, or developing a new scheduler. These solutions often rely on nodeSelector or nodeAffinity to select nodes that provide specific GPU models. Additionally, to automatically discover the

Dynamically Adding Port Mappings to Running Docker Containers

Dynamically Adding Port Mappings to Running Docker Containers

Port mapping is a crucial aspect of developing and deploying containerized applications. Typically, we establish a connection between a container's internal port and a port on the host machine when we create a container. However, there are situations where we might need to dynamically add new port mappings

GPU Container Core Binding Strategy Based on Affinity

GPU Container Core Binding Strategy Based on Affinity

Introduction to Optimizing CPU and GPU Performance In high-performance computing and large-scale parallel task processing, GPUs have become indispensable accelerators. To fully utilize GPU computing capabilities, it is crucial to optimize CPU and GPU relationships by reasonably allocating and binding CPU cores to GPUs. This article will delve into the

Will Speculative Decoding Harm LLM Inference Accuracy?

Will Speculative Decoding Harm LLM Inference Accuracy?

Mitchell Stern et al. 2018 introduced the prototype concept of speculative decoding. This method has since been further developed and refined by various approaches, including Lookahead Decoding, REST, Medusa and EAGLE, significantly accelerating the inference process of large language models (LLMs). One might wonder: will speculative decoding in LLMs harm