Qwen3.5-397B-A17B Access: Web, API, and Local Deployment

https://blogs.novita.ai/qwen3-5-397b-a17b-access-web-api-and-local-deployment/

Developers exploring powerful open-weight language models face a common question: how do I actually start using this model? Qwen3.5-397B-A17B offers three distinct access paths: instant web chat for testing, managed APIs for production applications, and self-hosted deployment for full control. Each method suits different scenarios — from quick prototyping to enterprise-scale inference.

This guide walks through all access methods with setup instructions, real pricing data, and hardware requirements. You’ll learn which path fits your use case and how to get started in minutes.

What is Qwen3.5-397B-A17B?

Qwen3.5-397B-A17B is Alibaba Cloud’s flagship open-weight Mixture-of-Experts (MoE) language model with 403 billion total parameters and 17 billion active parameters per token. The model handles 262,144 tokens of context (256k context window) and supports native multimodal inputs including text and images. According to Artificial Analysis benchmarks, Qwen3.5-397B-A17B achieves a GDPval-AA ELO score of 1,221, representing a 361-point increase over the previous Qwen3 235B model (860). The model demonstrates particular strength in coding, reasoning, and agent tasks while maintaining cost efficiency through its MoE architecture.

Qwen3.5-397B-A17B‘s benchamrk
From Artificial Analysis

Qwen3.5-397B-A17B Benchmark Overview

CategoryBenchmarkScoreLeading Model
Instruction FollowingIFBench76.5Qwen3.5
Complex TasksMultiChallenge67.6Qwen3.5
Agent / BrowsingBrowseComp78.6Qwen3.5
Scientific ReasoningGPQA Diamond88.4Qwen3.5 (open models)
KnowledgeMMLU-Pro87.8Gemini
KnowledgeMMLU-Redux94.9Gemini
KnowledgeC-Eval93.0Competitive
CodingLiveCodeBench v683.6Gemini / GPT
MultimodalMMMU85.0Competitive
MultimodalMathVision88.6Competitive
MultimodalOCRBench93.1Competitive
MultimodalVideo-MME87.5Competitive

Qwen3.5-397B achieves its strongest results on instruction-following and agent-oriented benchmarks, including IFBench, MultiChallenge, and BrowseComp, where it leads competing models. It also reaches state-of-the-art among open models on GPQA Diamond, indicating strong scientific reasoning ability.

On broader knowledge benchmarks such as MMLU-Pro and MMLU-Redux, performance is high but typically slightly behind leading proprietary models. Coding benchmarks show competitive results without leading the field.

Overall, the benchmark profile suggests that Qwen3.5 is optimized for complex instructions, tool use, and agent workflows, rather than purely maximizing traditional academic benchmarks like coding or knowledge recall.

Method 1: Web Chat Access (Fastest)

Best for: Quick testing, experimentation, demos, and non-production use cases where you need immediate access without API keys or infrastructure.

try  Qwen3.5-397B-A17B in web

Setup Time: Less than 1 minute

The official Qwen chat interface provides instant access to Qwen3.5-397B-A17B through your browser:

  1. Navigate to Novita AI
  2. Select Qwen3.5-397B-A17B from the model dropdown menu
  3. Choose between “Thinking” mode for deep reasoning tasks
  4. Start chatting immediately — no account creation or API keys required

Limitations

  • No programmatic access — web UI only, no API integration
  • Rate limits apply — designed for interactive use, not batch processing
  • No fine-tuning — you use the base model as-is
  • Limited context persistence — conversation history managed by the interface

Method 2: API Access via Novita AI (Production)

Best for: Production applications, custom integrations, programmatic access, scalable inference, and applications requiring OpenAI-compatible API format.

Setup Time: 5 minutes

Novita AI provides managed API access to Qwen3.5-397B-A17B with competitive pricing among major providers: $0.60 per 1M input tokens and $3.60 per 1M output tokens. The service offers OpenAI-compatible endpoints, making integration straightforward for developers already familiar with the OpenAI SDK.

Qwen3.5-397B-A17B's cheapest api providers
From HuggingFace

Step-by-Step Setup

Step 1: Log In and Access the Model Library

Log in to your account and click on the Model Library button.

Log In and Access the Model Library

Step 2: Choose Your Model

Browse through the available options and select the model that suits your needs.

Choose Your Model

Step 3: Start Your Free Trial

Begin your free trial to explore the capabilities of the selected model.

start a free trail of qwen 3.5 397b a17b

Step 4: Get Your API Key

To authenticate with the API, we will provide you with a new API key. Entering the “Settings“ page, you can copy the API key as indicated in the image.

get api key

Step 5: Install the API

Install the API using the package manager specific to your programming language. You can manage your API keys from the Novita AI Settings page.

After installation, import the necessary libraries into your development environment. Initialize the API with your API key to start interacting with Novita AI LLM. This is an example of using chat completions API for python users.

from openai import OpenAI

client = OpenAI(
    api_key="<Your API Key>",
    base_url="https://api.novita.ai/openai"
)

response = client.chat.completions.create(
    model="qwen/qwen3.5-397b-a17b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello, how are you?"}
    ],
    max_tokens=64000,
    temperature=0.7
)

print(response.choices[0].message.content)

API Features

FeatureAvailability
OpenAI Compatibility✅ Full support
Streaming Responses✅ Supported
Function Calling✅ Supported
Context Window262,144 tokens
Multimodal Input✅ Text + Images
SLA/UptimeEnterprise-grade infrastructure

Novita AI’s pricing for Qwen3.5-397B-A17B is among the most competitive in the market. The OpenAI-compatible API means you can integrate it into existing applications by changing just the base URL and API key — no code refactoring required.

Integration with Development Tools

Seamlessly connect Qwen 3 to your applications, workflows, or chatbots with Novita AI’s unified REST API—no need to manage model weights or infrastructure. Novita AI offers multi-language SDKs (Python, Node.js, cURL, and more) and advanced parameter controls for power users.

Claude Code Integration

Claude Code uses environment variables to route requests to custom model endpoints. Set these four variables before starting Claude Code:

For macOS/Linux:

# Set the Anthropic SDK compatible API endpoint provided by Novita.
export ANTHROPIC_BASE_URL="https://api.novita.ai/anthropic"
export ANTHROPIC_AUTH_TOKEN="<Novita API Key>"
# Set the model provided by Novita.
export ANTHROPIC_MODEL="qwen/qwen3.5-397b-a17b"
export ANTHROPIC_SMALL_FAST_MODEL="qwen/qwen3.5-397b-a17b"

For Windows (PowerShell):

$env:ANTHROPIC_BASE_URL = "https://api.novita.ai/anthropic"
$env:ANTHROPIC_AUTH_TOKEN = "Novita API Key"
$env:ANTHROPIC_MODEL = "qwen/qwen3.5-397b-a17b"
$env:ANTHROPIC_SMALL_FAST_MODEL = "qwen/qwen3.5-397b-a17b"

Trae IDE Integration

  1. Open Trae and toggle the AI Side Bar
  2. Navigate to AI Management → Models
  3. Click Add Custom Model
  4. Select Novita AI as provider
  5. Enter your API key and select qwen/qwen3.5-397b-a17b
  6. Save configuration and start coding

OpenCode CLI Integration

# Launch OpenCode
opencode

# Connect to Novita AI
/connect

# Select Novita AI as provider, paste API key
# Choose qwen/qwen3.5-397b-a17b from model list

Method 3: Local Deployment (Full Control)

Best for: Data privacy requirements, offline inference, customized inference pipelines, research environments, or scenarios where you need complete control over model execution.

Setup Time: 1-2 hours

Local deployment gives you full control but requires significant hardware resources. The full model weights occupy approximately 807GB of disk space at full precision.

Hardware Requirements

Precision LevelVRAM/RAM RequiredRecommended Hardware
8-bit quantizationAbout 420GB5× H100 80GB or equivalent
4-bit quantizationAbout 200GBM3 Ultra Mac (256GB unified memory) or 1×24GB GPU + 256GB system RAM

According to Unsloth’s deployment guide, the 4-bit quantized version achieves 25+ tokens per second on a system with a 24GB GPU and 256GB system RAM using MoE offloading techniques. This makes 4-bit quantization the most practical option for high-end consumer or small business deployments.

Cloud GPU Rental for Local Deployment

If you lack the hardware but still want self-hosted deployment, cloud GPU instances offer a middle ground. Based on Novita AI GPU instance pricing:

ConfigurationHourly Cost (On-Demand)Hourly Cost (Spot)Use Case
5× H100 80GB$12.95/hr$6.5/hr8-bit quantization, production-grade
1× RTX 4090 24GB$0.73/hr$0.37/hr4-bit quantization, cost-effective

Novita AI’s Spot mode is a cost-optimized GPU rental system that leverages the platform’s idle or unused GPU capacity. Unlike on-demand instances, which reserve dedicated hardware for stable, continuous usage, Spot instances are interruptible—your job may be paused or terminated if the GPU is reclaimed by the system. Because Spot mode reallocates otherwise unused GPU resources, it is typically 40–60% cheaper than on-demand pricing.

Method Comparison Table

MethodSetup TimeCostBest For
Web Chat (Novita AI LLM Playground)<1 minuteFree (with rate limits)Quick testing, demos, experimentation
API via Novita AI5 minutes$0.60/$3.60 per 1M tokensProduction apps, scalable inference, custom integrations
Local Deployment (INT4)1-2 hoursHardware cost and 256GB RAM systemData privacy, offline use, full control
Cloud GPU Rental(INT4)30 minutes$0.37/hrHigh-volume inference

Qwen3.5-397B-A17B offers flexible access paths for different deployment scenarios. For immediate testing, the Novita AI LLM Playground requires zero setup and provides instant access to both reasoning and fast modes. For production applications requiring programmatic access, Novita AI’s API delivers the best cost-performance balance at $0.60/$3.60 per 1M input/output tokens with OpenAI-compatible endpoints that integrate seamlessly into existing codebases.

Local deployment remains viable for teams with specific privacy requirements or extremely high-volume inference needs. The INT4 quantized version can run on high-end consumer hardware with 256GB RAM, achieving 25+ tokens per second. However, for most developers and small-to-medium businesses, managed API access eliminates infrastructure complexity while delivering enterprise-grade reliability.

Frequently Asked Questions

How much does Qwen3.5-397B-A17B cost via API?

Novita AI charges $0.60 per 1M input tokens and $3.60 per 1M output tokens for Qwen3.5-397B-A17B — among the most competitive rates available.

Can I run Qwen3.5-397B-A17B on consumer hardware?

Yes, with INT4 quantization Qwen3.5-397B-A17B runs on systems with 256GB RAM (like M3 Ultra Mac) at 25+ tokens/s, requiring ~214GB disk space.

Does Qwen3.5-397B-A17B support function calling?

Yes, Qwen3.5-397B-A17B supports function calling when accessed via API providers like Novita AI using OpenAI-compatible endpoints.

Novita AI is an AI & agent cloud platform helping developers and startups build, deploy, and scale models and agentic applications with high performance, reliability, and cost efficiency.

Recommended Reading


Discover more from Novita

Subscribe to get the latest posts sent to your email.

Leave a Comment

Scroll to Top

Discover more from Novita

Subscribe now to keep reading and get access to the full archive.

Continue reading