Deploy Custom Base Model with LLM Dedicated Endpoint: Flexible, Reliable, Scalable

LLM Dedicated Endpoint

Novita AI’s LLM Dedicated Endpoint is a newly launched service that empowers you to deploy your own custom or fine-tuned Hugging Face models with ease.

With dedicated H100 GPUs starting at $1.86/hr and H200 from $2.99/hr, Novita AI delivers highly competitive pricing—often more cost-effective than alternatives such as Together AI, Fireworks AI, and Friendli AI.

Enjoy flexible LoRA support, a 99.5% SLA, and scalable GPU options. Set up production-ready LLM endpoints in minutes and confidently manage your resources with transparent, predictable pricing.

What is LLM Dedicated Endpoint?

An LLM Dedicated Endpoint provides a private, cloud-based API for running large language models on infrastructure reserved solely for your use. This setup ensures your models operate with consistent performance, high reliability, and complete resource isolation—unlike shared or serverless alternatives.

With a dedicated endpoint, you can deploy both open-source and private models on Hugging Face, including your custom or fine-tuned variants. Sensitive data and intellectual property remain protected, as your models and traffic are never exposed to other users.

Why Choose LLM Dedicated Endpoint?

With Novita AI’s LLM Dedicated Endpoint, you get a robust and flexible environment for your AI workloads:

  • Custom Model Deployment: Easily serve any Hugging Face model, including private and fine-tuned versions, within an isolated, dedicated environment.
  • Flexible LoRA Adapter Management: Attach and switch between multiple LoRA adapters on a single endpoint. Experiment, iterate, and support diverse tasks without redeploying your base model.
  • Predictable Performance: Dedicated resources ensure consistent throughput and low latency, unaffected by other users. There are no hard rate limits; your endpoint’s capacity is determined by your chosen hardware and configuration.
  • Scalable Hardware: Scale from idle (0 replicas) to up to 10 replicas per endpoint, and choose the GPU type that fits your requirements. Each user can access up to 8 GPUs, with enterprise expansion available.
  • Transparent Pricing: H100 from $1.86/hr, H200 from $2.99/hr—pay only for what you use. Dedicated endpoints are often more cost-effective than serverless solutions under high or sustained usage.
  • User-Friendly Management: Intuitive web console for deployment and management, plus instant Playground testing for rapid validation.
  • Production-Ready Reliability: 99.5% uptime guarantee, fully managed by Novita AI for peace of mind.

How to Choose: Dedicated Endpoint vs. Serverless Endpoint

Selecting the right type of LLM inference endpoint depends on your use case, workload, and operational requirements. Here’s a quick guide to help you decide:

Choose LLM Serverless Endpoint if:

  • You want fast, flexible access to public LLMs without infrastructure management.
  • Your usage is low, variable, or for prototyping.
  • You want simple, pay-per-use pricing.

Choose LLM Dedicated Endpoint if:

  • You want to deploy any Hugging Face model (including private, fine-tuned, or gated).
  • You need to configure LoRA adapters and parameters flexibly.
  • You require dedicated hardware, stable high throughput, and production-grade reliability.
  • You want to optimize for the lowest GPU cost in the industry.
  • You need up to 8 GPUs per user, or more.

If you require more resources, please contact our sales team for a custom enterprise solution.

AspectLLM Serverless EndpointLLM Dedicated Endpoint (DE)
Billing ModelPay-per-use (by token)Pay-per-GPU per hour
Resource TypeShared, serverless (multi-tenant)Dedicated, user-controlled (single-tenant)
Performance ConsistencyMay fluctuate (shared load)Predictable, not affected by other users
Rate LimitsYes (TPM, RPM by user tier)No hard rate limits; limited by user GPU quota
Model SelectionPublic models onlyLoad custom base models from Hugging Face repositories (public, private, or gated); supports LoRA parameter configuration
Hardware ChoiceNot selectableFlexible: H100, H200, 4090, etc.
Deployment RegionNot user-selectableUser can choose region
SLANo formal guarantee99.5% SLA
High Utilization CostMore expensive at scaleCheaper at high utilization
Security & Data IsolationShared environmentFull tenant isolation, private endpoints
Best ForStartups, prototyping, fluctuating usageEnterprise, production, stable high-throughput, custom base models

Dedicated Endpoint GPU Price Comparison

When choosing a provider, cost efficiency is crucial—especially for production-scale deployments. Novita AI offers the lowest hourly rates for dedicated H100 and H200 GPUs among leading providers:

ProviderH100 (1 card/H)H200 (1 card/H)
Novita AI$1.86$2.99
Fireworks AI$5.80$9.99
Friendli AI$4.90$5.90
Together AI$3.36$4.99
Deepinfra$2.40$3.00

As shown above, Novita AI consistently offers the most competitive pricing for both H100 and H200 GPUs—up to 60% lower than other popular providers.

This means you can significantly reduce infrastructure costs for high-throughput or long-running LLM deployments by choosing Novita AI.

How to Get Started with Novita AI LLM Dedicated Endpoints

1. Access the Console

2. Create a New Endpoint

  • Click the + New Endpoint button in the upper right corner.
Create-a-New-Endpoint

3. Configure Your Endpoint

Fill out the configuration form with the following options:

Configure-Your-Endpoint
  • Endpoint Name: Give your deployment a unique and descriptive name.
  • Base Model: Enter the Hugging Face repository name for your base model (only Hugging Face models are supported, including public, private, or gated).
  • LoRA Adapters (optional): Add one or more Hugging Face Model IDs to attach LoRA adapters to your base model.
  • Instance Type: Select the GPU hardware (e.g., H100, H200, RTX4090). Each user can use up to 8 GPUs across all endpoints.
  • Autoscaling Configuration:
    • Minimum Replicas: Set to 0 to allow the endpoint to sleep when idle (cost saving), or a higher value to always keep a minimum number of active replicas.
    • Maximum Replicas: Set the maximum number of replicas for scaling (up to 10).
    • Cooldown Period: Set the delay (in seconds) before scaling down replicas to avoid premature downscaling during brief traffic drops.
  • Engine Configuration:
    • Engine Type: Choose the inference engine (vLLM or SGLang).
    • Engine Version: Use the default (latest) or specify a version.
    • Context Length: Optionally set the max token context length; if omitted, will be derived from the model config.
    • Max Running Requests: Set the maximum number of sequences processed per iteration.
    • Additional Arguments: Add any extra engine parameters for advanced customization.

When you’re done, click Create to deploy your endpoint.

4. Endpoint Deployment Status

Endpoint-Deployment-Status

After creation, your endpoint will transition through several statuses:

  • Sleeping: The endpoint is idle, consuming no compute resources (if minimum replicas is set to 0).
  • Pending: The deployment is initializing.
  • Rolling: The model and infrastructure are being set up.
  • Running: The endpoint is active and ready to serve requests.

You can monitor this status on the Endpoints page in the console.

5. Test Your Endpoint in Playground

Dedicated endpoint Playground
  • Once deployment is complete and status is Running, Click on your endpoint and open the Playground tab.
  • In the Playground, you can:
    • Send test prompts to your base model and any attached LoRA adapters.
    • Instantly compare the output of different adapters versus the base model.

6. Next Steps

  • Multi-LoRA Endpoints: Deploy multiple LoRA adapters on a single endpoint for flexible model switching.
  • API Integration: Use the provided API endpoints to send requests and integrate your model into your own applications.
  • Optimize and Scale: Adjust autoscaling, engine configuration, and GPU quota as your needs grow.
  • Need More Resources? Contact our sales team for an enterprise solution if you need more than 8 GPUs or require enterprise-level features.

Code Examples (For Python users)

from openai import OpenAI

client = OpenAI(
    base_url="https://api.novita.ai/dedicated/v1/openai",
    api_key="<Your API Key>",
)

model = "deepseek-ai/DeepSeek-R1-0528-"
stream = True  # or False
max_tokens = 512

chat_completion_res = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "system",
            "content": "you are a professional AI helper.",
        },
        {
            "role": "user",
            "content": "Where can the example of GPU provided by novita ai be adapted?",
        }
    ],
    stream=stream,
    max_tokens=max_tokens,
)

if stream:
    for chunk in chat_completion_res:
        print(chunk.choices[0].delta.content or "", end="")
else:
    print(chat_completion_res.choices[0].message.content)

Conclusion

Novita AI’s new LLM Dedicated Endpoint empowers you to deploy and scale custom Hugging Face models with confidence. Enjoy flexible LoRA adapter integration, straightforward autoscaling, competitive transparent pricing, and the assurance of a 99.5% SLA. Whether you’re launching your first fine-tuned model or managing production workloads, Novita AI makes it simple to go from prototype to production—quickly, securely, and efficiently.

Ready to experience seamless LLM deployment? Sign up now or contact our sales team for an enterprise demo and tailored plan.

Frequently Asked Questions

What models can I deploy on a Dedicated Endpoint?

You can deploy any model from Hugging Face, including public, private, fine-tuned, or proprietary models. Both base models and models with custom or LoRA adapters are supported.

How is a Dedicated Endpoint different from a Serverless Endpoint?

A Dedicated Endpoint provides you with reserved, isolated hardware for consistent performance, advanced customization, and higher throughput. In contrast, Serverless Endpoints run on shared infrastructure, are best for low or variable usage, and are ideal for rapid prototyping without hardware management.

Can I scale my Dedicated Endpoint as my workload grows?

Yes. Dedicated Endpoints support autoscaling based on real-time demand. You can start with one GPU and scale up to 8 GPUs per user (with enterprise options for more), ensuring your applications remain responsive even during peak traffic.

How do I monitor and manage my Dedicated Endpoint?

Each Dedicated Endpoint comes with detailed metrics and logs. You can track performance, monitor usage, and troubleshoot issues through the web console or API, making management and optimization straightforward.

What are the pricing options and how do I control costs?

Pricing is transparent and usage-based, starting from $1.86/hr for H100 GPUs and $3.00/hr for H200 GPUs. You only pay for what you use. Autoscaling and flexible management help you optimize utilization and keep costs predictable, especially for production workloads.

Novita AI is an AI cloud platform that offers developers an easy way to deploy AI models using our simple API, while also providing the affordable and reliable GPU cloud for building and scaling.


Discover more from Novita

Subscribe to get the latest posts sent to your email.

Leave a Comment

Scroll to Top

Discover more from Novita

Subscribe now to keep reading and get access to the full archive.

Continue reading