Open-source models like Deepseek V3 and Qwen3 Coder are not just catching up to their closed-source counterparts; they are delivering state-of-the-art performance at a 6-10x cost advantage. But this incredible potential comes with a hidden challenge: open-source models are rarely hosted, while closed models almost always are.
For most teams, deploying these models in-house is challenging for three primary reasons.
- Costly: To run a model like Llama 3.3 70B, you’ll likely need two H100 GPUs, a massive upfront expense. To make things worse, this expensive hardware often sits idle during periods of low demand, leading to poor utilization and wasted investment.
- Complex: Deploying and maintaining LLMs requires deep expertise in inference optimization and GPU operations, and hiring an entire MLOps team is not logical for most companies.
- Cumbersome: New models are released often, but in-house setups are rigid, which makes it slow and difficult to test new models or scale to meet sudden demand swings.
At Novita AI, we believe you shouldn’t have to choose between the power of open-source and the polish of a managed service. Our platform is engineered to deliver the stability, performance, and developer experience you expect from a premium closed model with the cost benefits of the open ecosystem. We deliver production-grade hosting for open-source LLMs.
Here’s a peek behind the scenes of what we do to make this possible.
Behind the Scenes of Model Hosting
When you host a custom model on Novita AI or call our open-source LLM API, a lot is happening under the hood. Hosting models at scale involves a complex process of orchestration, optimization, and ongoing monitoring to ensure that every request is fast and reliable.
Model Storage and Hardware
We maintain a warm library of popular open-source models (e.g. Llama, Qwen, DeepSeek), which involves storing these multi-billion parameter models. Since running these LLMs requires specialized hardware, we partner with data centers around the world to ensure fast and reliable service for users in every location to manage:
- Servers powerful enough to handle inference workloads
- Networking to move requests and responses quickly
- Power to keep it all running 24/7
We absorb the hardware costs and provide:
- Warm Model Library: We maintain hundreds of warm-started models. This allows you to instantly test and validate the latest LLMs for your use case.
- Pay-As-You-Go Serverless Endpoints: You only pay for the tokens you use. This token-based pricing model is perfect for applications with variable demand, such as chatbots and text generation, ensuring you never pay for idle capacity.
- On-Demand Custom Deployments: When you need more control, you can rent powerful GPUs like the NVIDIA H100 for as little as $1.85 per hour. This allows you to scale your resources with your needs, transforming heavy capital expenditure into a predictable operational cost.
- Developer-Friendly Integration: We’ve prepared a unified API that abstracts away the underlying complexity. These APIs are designed to be compatible with popular frameworks like the OpenAI API, making it easy for you to switch providers: just change the base URL and key, and you’ve got access to every open model in our library. We also integrate seamlessly with frameworks like LangChain, LiteLLM, and LlamaIndex, so switching or experimenting with new models won’t break your existing workflows.
Inference Optimization
Raw model execution is only the beginning. To provide the best performance at the lowest cost, we use several techniques to optimize inference:
- Quantization: reducing the precision of model weights, making them smaller and faster to run while maintaining performance
- Batching: processing multiple user requests simultaneously to maximize GPU usage
- Load balancing: distributing requests across several servers so no single server is overloaded, maintaining low latency
We handle the underlying complexity to provide a polished, developer-friendly experience that makes open-source AI accessible to everyone.
- We provide built-in support for critical features like Function Calling, Structured Outputs, and Batch Inference. This eliminates the need for you to build these complex systems yourself, accelerating your time to market.
- Elastic Scaling for Any Workload: Our infrastructure is designed to be fully elastic. Serverless Endpoints auto-scale to handle high concurrency with a Time to First Token (TTFT) under 300ms. Custom and Enterprise deployments offer GPU auto-scaling to meet any demand while ensuring performance and data isolation.
For mission-critical applications, we offer a “Zero-Ops” solution. Submit your requirements (model name, I/O length, performance SLA), and our LLM Optimizer Engine will custom design the most cost-effective solution for you. Our expert team will also deploy and manage the model for you, backed by a 99.5% SLA, guaranteed performance, and direct technical support.
Self Hosting vs Using Hosted Models
Some developers prefer to host their own models for maximum control. If that’s you, we’re here to support: rent GPUs by the hour through Novita AI and tune your stack exactly how you like.
However, self-hosting comes with significant trade-offs: setup and maintenance require time and expertise, scaling can be tricky, and balancing cost and performance tradeoffs can be an ongoing challenge.
Using hosted open-source LLM APIs like Novita eliminates that overhead, giving you a production-ready solution with predictable performance and minimal operational burden. We’ve optimized Novita AI’s infrastructure to give you the best experience at the lowest cost. By running models at scale, we can offer lower prices than what an individual or small company can achieve by self-hosting. We charge by the number of tokens processed, so you only pay for what you use.
We designed three service tiers to provide a perfect fit for every stage of your AI journey.
| Serverless Endpoints | Custom Deployments | Enterprise Deployments | |
| Model Support | Up-to-date LLMs like Qwen3, DeepSeek, LLaMA3 | Hundreds of Warm-Started Models + Custom Model Upload | Hundreds of Warm-Started Models + Custom Model Upload |
| Pricing | Pay-As-You-Go Token-Based | On-Demand GPU/Hour | Performance-Based Token Pricing |
| Integration | Self-Service, One-Line Integration | Self-Service GPU Deployment, One-Line Integration | Expert Deployment & Enterprise Services |
| Elastic Scaling | Elastic Scaling Within Rate Limits | Dedicated Endpoints: Auto-Scaling GPUs Based on Usage | Performance-Based Elastic Scaling |
| Best Use Case | Fast access to new models without managing infrastructure | Need for greater model control and custom setups | Fully managed deployments with guaranteed performance |
Note: The maximum GPU for Dedicated Endpoints is 8. If you need more GPUs, Contact Sales for enterprise service.
Final Thoughts
Whether you’re running a fine-tuned model for a niche use case or experimenting with the latest open-source LLM, Novita AI gives you closed-model convenience at open-source prices. If you’re interested in a custom solution or want to talk through your setup, schedule a chat with our engineers here.
Acknowledgement: Special thanks to Charles, Novita’s LLM Project Manager, for his contributions and insights to this article.
