Developers exploring powerful open-weight language models face a common question: how do I actually start using this model? Qwen3.5-397B-A17B offers three distinct access paths: instant web chat for testing, managed APIs for production applications, and self-hosted deployment for full control. Each method suits different scenarios — from quick prototyping to enterprise-scale inference.
This guide walks through all access methods with setup instructions, real pricing data, and hardware requirements. You’ll learn which path fits your use case and how to get started in minutes.
What is Qwen3.5-397B-A17B?
Qwen3.5-397B-A17B is Alibaba Cloud’s flagship open-weight Mixture-of-Experts (MoE) language model with 403 billion total parameters and 17 billion active parameters per token. The model handles 262,144 tokens of context (256k context window) and supports native multimodal inputs including text and images. According to Artificial Analysis benchmarks, Qwen3.5-397B-A17B achieves a GDPval-AA ELO score of 1,221, representing a 361-point increase over the previous Qwen3 235B model (860). The model demonstrates particular strength in coding, reasoning, and agent tasks while maintaining cost efficiency through its MoE architecture.

Qwen3.5-397B-A17B Benchmark Overview
| Category | Benchmark | Score | Leading Model |
|---|---|---|---|
| Instruction Following | IFBench | 76.5 | Qwen3.5 |
| Complex Tasks | MultiChallenge | 67.6 | Qwen3.5 |
| Agent / Browsing | BrowseComp | 78.6 | Qwen3.5 |
| Scientific Reasoning | GPQA Diamond | 88.4 | Qwen3.5 (open models) |
| Knowledge | MMLU-Pro | 87.8 | Gemini |
| Knowledge | MMLU-Redux | 94.9 | Gemini |
| Knowledge | C-Eval | 93.0 | Competitive |
| Coding | LiveCodeBench v6 | 83.6 | Gemini / GPT |
| Multimodal | MMMU | 85.0 | Competitive |
| Multimodal | MathVision | 88.6 | Competitive |
| Multimodal | OCRBench | 93.1 | Competitive |
| Multimodal | Video-MME | 87.5 | Competitive |
Qwen3.5-397B achieves its strongest results on instruction-following and agent-oriented benchmarks, including IFBench, MultiChallenge, and BrowseComp, where it leads competing models. It also reaches state-of-the-art among open models on GPQA Diamond, indicating strong scientific reasoning ability.
On broader knowledge benchmarks such as MMLU-Pro and MMLU-Redux, performance is high but typically slightly behind leading proprietary models. Coding benchmarks show competitive results without leading the field.
Overall, the benchmark profile suggests that Qwen3.5 is optimized for complex instructions, tool use, and agent workflows, rather than purely maximizing traditional academic benchmarks like coding or knowledge recall.
Method 1: Web Chat Access (Fastest)
Best for: Quick testing, experimentation, demos, and non-production use cases where you need immediate access without API keys or infrastructure.

Setup Time: Less than 1 minute
The official Qwen chat interface provides instant access to Qwen3.5-397B-A17B through your browser:
- Navigate to Novita AI
- Select Qwen3.5-397B-A17B from the model dropdown menu
- Choose between “Thinking” mode for deep reasoning tasks
- Start chatting immediately — no account creation or API keys required
Limitations
- No programmatic access — web UI only, no API integration
- Rate limits apply — designed for interactive use, not batch processing
- No fine-tuning — you use the base model as-is
- Limited context persistence — conversation history managed by the interface
Method 2: API Access via Novita AI (Production)
Best for: Production applications, custom integrations, programmatic access, scalable inference, and applications requiring OpenAI-compatible API format.
Setup Time: 5 minutes
Novita AI provides managed API access to Qwen3.5-397B-A17B with competitive pricing among major providers: $0.60 per 1M input tokens and $3.60 per 1M output tokens. The service offers OpenAI-compatible endpoints, making integration straightforward for developers already familiar with the OpenAI SDK.

Step-by-Step Setup
Step 1: Log In and Access the Model Library
Log in to your account and click on the Model Library button.

Step 2: Choose Your Model
Browse through the available options and select the model that suits your needs.

Step 3: Start Your Free Trial
Begin your free trial to explore the capabilities of the selected model.

Step 4: Get Your API Key
To authenticate with the API, we will provide you with a new API key. Entering the “Settings“ page, you can copy the API key as indicated in the image.

Step 5: Install the API
Install the API using the package manager specific to your programming language. You can manage your API keys from the Novita AI Settings page.
After installation, import the necessary libraries into your development environment. Initialize the API with your API key to start interacting with Novita AI LLM. This is an example of using chat completions API for python users.
from openai import OpenAI
client = OpenAI(
api_key="<Your API Key>",
base_url="https://api.novita.ai/openai"
)
response = client.chat.completions.create(
model="qwen/qwen3.5-397b-a17b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, how are you?"}
],
max_tokens=64000,
temperature=0.7
)
print(response.choices[0].message.content)
API Features
| Feature | Availability |
|---|---|
| OpenAI Compatibility | ✅ Full support |
| Streaming Responses | ✅ Supported |
| Function Calling | ✅ Supported |
| Context Window | 262,144 tokens |
| Multimodal Input | ✅ Text + Images |
| SLA/Uptime | Enterprise-grade infrastructure |
Novita AI’s pricing for Qwen3.5-397B-A17B is among the most competitive in the market. The OpenAI-compatible API means you can integrate it into existing applications by changing just the base URL and API key — no code refactoring required.
Integration with Development Tools
Seamlessly connect Qwen 3 to your applications, workflows, or chatbots with Novita AI’s unified REST API—no need to manage model weights or infrastructure. Novita AI offers multi-language SDKs (Python, Node.js, cURL, and more) and advanced parameter controls for power users.
Claude Code Integration
Claude Code uses environment variables to route requests to custom model endpoints. Set these four variables before starting Claude Code:
For macOS/Linux:
# Set the Anthropic SDK compatible API endpoint provided by Novita. export ANTHROPIC_BASE_URL="https://api.novita.ai/anthropic" export ANTHROPIC_AUTH_TOKEN="<Novita API Key>" # Set the model provided by Novita. export ANTHROPIC_MODEL="qwen/qwen3.5-397b-a17b" export ANTHROPIC_SMALL_FAST_MODEL="qwen/qwen3.5-397b-a17b"
For Windows (PowerShell):
$env:ANTHROPIC_BASE_URL = "https://api.novita.ai/anthropic" $env:ANTHROPIC_AUTH_TOKEN = "Novita API Key" $env:ANTHROPIC_MODEL = "qwen/qwen3.5-397b-a17b" $env:ANTHROPIC_SMALL_FAST_MODEL = "qwen/qwen3.5-397b-a17b"
Trae IDE Integration
- Open Trae and toggle the AI Side Bar
- Navigate to AI Management → Models
- Click Add Custom Model
- Select Novita AI as provider
- Enter your API key and select qwen/qwen3.5-397b-a17b
- Save configuration and start coding
OpenCode CLI Integration
# Launch OpenCode opencode # Connect to Novita AI /connect # Select Novita AI as provider, paste API key # Choose qwen/qwen3.5-397b-a17b from model list
Method 3: Local Deployment (Full Control)
Best for: Data privacy requirements, offline inference, customized inference pipelines, research environments, or scenarios where you need complete control over model execution.
Setup Time: 1-2 hours
Local deployment gives you full control but requires significant hardware resources. The full model weights occupy approximately 807GB of disk space at full precision.
Hardware Requirements
| Precision Level | VRAM/RAM Required | Recommended Hardware |
|---|---|---|
| 8-bit quantization | About 420GB | 5× H100 80GB or equivalent |
| 4-bit quantization | About 200GB | M3 Ultra Mac (256GB unified memory) or 1×24GB GPU + 256GB system RAM |
According to Unsloth’s deployment guide, the 4-bit quantized version achieves 25+ tokens per second on a system with a 24GB GPU and 256GB system RAM using MoE offloading techniques. This makes 4-bit quantization the most practical option for high-end consumer or small business deployments.
Cloud GPU Rental for Local Deployment
If you lack the hardware but still want self-hosted deployment, cloud GPU instances offer a middle ground. Based on Novita AI GPU instance pricing:
| Configuration | Hourly Cost (On-Demand) | Hourly Cost (Spot) | Use Case |
|---|---|---|---|
| 5× H100 80GB | $12.95/hr | $6.5/hr | 8-bit quantization, production-grade |
| 1× RTX 4090 24GB | $0.73/hr | $0.37/hr | 4-bit quantization, cost-effective |
Novita AI’s Spot mode is a cost-optimized GPU rental system that leverages the platform’s idle or unused GPU capacity. Unlike on-demand instances, which reserve dedicated hardware for stable, continuous usage, Spot instances are interruptible—your job may be paused or terminated if the GPU is reclaimed by the system. Because Spot mode reallocates otherwise unused GPU resources, it is typically 40–60% cheaper than on-demand pricing.
Method Comparison Table
| Method | Setup Time | Cost | Best For |
|---|---|---|---|
| Web Chat (Novita AI LLM Playground) | <1 minute | Free (with rate limits) | Quick testing, demos, experimentation |
| API via Novita AI | 5 minutes | $0.60/$3.60 per 1M tokens | Production apps, scalable inference, custom integrations |
| Local Deployment (INT4) | 1-2 hours | Hardware cost and 256GB RAM system | Data privacy, offline use, full control |
| Cloud GPU Rental(INT4) | 30 minutes | $0.37/hr | High-volume inference |
Qwen3.5-397B-A17B offers flexible access paths for different deployment scenarios. For immediate testing, the Novita AI LLM Playground requires zero setup and provides instant access to both reasoning and fast modes. For production applications requiring programmatic access, Novita AI’s API delivers the best cost-performance balance at $0.60/$3.60 per 1M input/output tokens with OpenAI-compatible endpoints that integrate seamlessly into existing codebases.
Local deployment remains viable for teams with specific privacy requirements or extremely high-volume inference needs. The INT4 quantized version can run on high-end consumer hardware with 256GB RAM, achieving 25+ tokens per second. However, for most developers and small-to-medium businesses, managed API access eliminates infrastructure complexity while delivering enterprise-grade reliability.
Frequently Asked Questions
Novita AI charges $0.60 per 1M input tokens and $3.60 per 1M output tokens for Qwen3.5-397B-A17B — among the most competitive rates available.
Yes, with INT4 quantization Qwen3.5-397B-A17B runs on systems with 256GB RAM (like M3 Ultra Mac) at 25+ tokens/s, requiring ~214GB disk space.
Yes, Qwen3.5-397B-A17B supports function calling when accessed via API providers like Novita AI using OpenAI-compatible endpoints.
Novita AI is an AI & agent cloud platform helping developers and startups build, deploy, and scale models and agentic applications with high performance, reliability, and cost efficiency.
Recommended Reading
- Comparing Kimi K2 API Providers: Why Novita AI Stands Out
- Comparing Kimi K2-0905 API Providers: Why Novita AI Stands Out
- How to Use GLM-4.6 in Cursor to Boost Productivity for Small Teams
Discover more from Novita
Subscribe to get the latest posts sent to your email.





