How to Access Qwen 3 Locally or via API: A Complete Guide

how to access qwen 3

Refer your friends to Novita AI and both of you will earn $10 in LLM API credits—up to $500 in total rewards.

To support the developer community, Qwen2.5-7B, Qwen 3 0.6B, Qwen 3 1.7B, Qwen 3 4B is currently available for free on Novita AI.

qwen 2.5 7b

Qwen 3 is a versatile and powerful open-source language model family built by Alibaba. With cutting-edge architecture and dual-mode reasoning, it’s designed to serve both edge devices and large-scale enterprise needs. This article explores its capabilities, model types, and how to use it—either locally or through API.

What is Qwen 3?

Qwen 3 is Alibaba’s 2025 open-source large language model family, featuring switchable “thinking” and “non-thinking” modes for enhanced reasoning and multilingual performance across 119+ languages. The Qwen 3 model lineup includes:

Qwen 3 – Shared Features

Open‑source & Commercial‑friendly

Apache 2.0 license, freely available weights for research and business use.

Efficient Transformer Core

Decoder‑only with Grouped‑Query‑Attention for long‑context KV memory savings up to 128 K tokens.

Dual “Thinking / Non‑thinking” Modes

Detailed chain‑of‑thought when needed, snappy direct answers when speed matters.

Massive 36 T‑token Corpus

119 languages with expanded STEM & code data for stronger reasoning and programming skills.

Three‑Stage Pre‑training

Base skills → STEM enrichment → 32 K‑token long‑context adaptation.

Four‑Stage Post‑training

Long CoT SFT → reasoning RL → mode fusion → general RLHF alignment.

Multilingual Instruction Following

Strong in English & Chinese, robust across 100+ languages for global applications.

Tool / Agent Readiness

Built‑in function‑calling schema to decide and format external tool invocations.

Text‑in / Text‑out Modality

Optimized for language tasks today; vision variants planned for future releases.

Qwen 3 Series Architecture

qwen 3

Qwen 3 Series Benchmark

Qwen 3 Series Benchmark
Qwen 3 Series Benchmark

High-parameter models like Qwen-23B and Qwen-14B consistently follow the rules, with larger models and reasoning-enabled versions scoring higher. These discrepancies in low-parameter models may stem from limitations in their reasoning capabilities, as they lack the capacity to fully leverage reasoning mechanisms, leading to suboptimal performance.

How to Access Qwen 3 Locally?

Hardware Requirements

ModelRecommended GPUVRAMvCPUsRAMStorage
Qwen3-0.6BRTX 3060 / T48 GB48 GB20 GB
Qwen3-1.7BRTX 3060 / A500012–24 GB6–816 GB30 GB
Qwen3-4BA100 40GB / RTX 309024–40 GB12+24 GB40 GB
Qwen3-8BA100 80GB / H10040–80 GB16+48 GB60 GB
Qwen3-14B2× A100 80GB / 1× H10080 GB+24+64 GB80 GB
Qwen3-30B (MoE)2× H100 / 4× A100160 GB48+128 GB160 GB
Qwen3-32B2× H100 / 4× A100160 GB64160 GB200 GB
Qwen3-235B (MoE)8× H100 / 8× A100640 GB128+512 GB500+ GB

Step-by-Step Installation Guide

# Step 1: Install Python and Create a Virtual Environment
# Ensure Python (>=3.8) is installed. Then create and activate a virtual environment.
python3 -m venv llama_env
source llama_env/bin/activate  # On Windows, use `llama_env\Scripts\activate`

# Step 2: Install Required Libraries
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118  # For GPU optimization
pip install bitsandbytes  # Efficient GPU memory utilization

# Step 3: Install the Hugging Face CLI and Log In
pip install huggingface-cli
huggingface-cli login  # Follow the prompts to authenticate

# Step 4: Request Access to Llama-3.3 70B
# Visit the Hugging Face model page for Llama-3.3 70B and request access.
# URL: https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct

# Step 5: Download the Model Files
huggingface-cli download meta-llama/Llama-3.3-70B-Instruct --include "original/*" --local-dir Llama-3.3-70B-Instruct

# Step 6: Load the Model Locally
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Model ID and local directory path
model_id = "meta-llama/Llama-3.3-70B-Instruct"
local_model_dir = "./Llama-3.3-70B-Instruct"

# Load the model with GPU optimization
model = AutoModelForCausalLM.from_pretrained(
    local_model_dir,
    device_map="auto",          # Automatically map model layers to GPU(s)
    torch_dtype=torch.bfloat16  # Use bfloat16 for efficient memory usage
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(local_model_dir)

# Step 7: Run Inference
# Define input text
input_text = "Explain the theory of relativity in simple terms."

# Tokenize the input
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")  # Send inputs to GPU

# Generate a response
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_length=100,  # Set maximum response length
        temperature=0.7,  # Adjust creativity (lower = less creative, higher = more creative)
        top_k=50,         # Top-k sampling for diversity
    )

# Decode the output tokens
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Response:", response)

How to Access Qwen 3 via API

Novita AI offers an affordable, reliable, and simple inference platform with scalable Llama 3.3 70b API, empowering developers to build AI applications. Try the Novita AI Llama 3.3 70b API Demo today!

Option 1: Direct API Integration (Python Example)

qwen 3 api

Key Features:

  • Unified endpoint:/v3/openai supports OpenAI’s Chat Completions API format.
  • Flexible controls: Adjust temperature, top-p, penalties, and more for tailored results.
  • Streaming & batching: Choose your preferred response mode.

Option 2: Multi-Agent Workflows with OpenAI Agents SDK

Build advanced multi-agent systems by integrating Novita AI with the OpenAI Agents SDK:

  • Plug-and-play: Use Novita AI’s LLMs in any OpenAI Agents workflow.
  • Supports handoffs, routing, and tool use: Design agents that can delegate, triage, or run functions, all powered by Novita AI’s models.
  • Python integration: Simply point the SDK to Novita’s endpoint (https://api.novita.ai/v3/openai) and use your API key.

Connect Qwen 3 API on Third-Party Platforms

  • Hugging Face: Use Qwen 3 in Spaces, pipelines, or with the Transformers library via Novita AI endpoints.
 Qwen 3 API on Third-Party Platforms
  • Agent & Orchestration Frameworks: Easily connect Novita AI with partner platforms like Continue, AnythingLLM,LangChain, Dify and Langflow through official connectors and step-by-step integration guides.
  • OpenAI-Compatible API: Enjoy hassle-free migration and integration with tools such as Cline and Cursor, designed for the OpenAI API standard.

Which Methods Are Suitable for You?

Comparison of Local vs. API Access

AspectLocal AccessAPI Access
ScalabilityLimited; requires manual upgrades.Scales automatically and efficiently.
FlexibilityHigh flexibility; full control over settings.Less flexible; depends on provider’s configurations.
UsabilityRequires technical expertise.Easier to use, no complex setup needed.
AffordabilityHigh initial cost, low ongoing costs. Best for long-term use.Pay-per-use, ideal for small-scale or occasional use.

Recommendations for Different User Groups

  • Researchers → Prefer local access for full control and experiment flexibility.
  • Developers → Use API for fast testing and building apps; go local for custom training.
  • BusinessesAPI is great for easy integration; local suits teams with stable needs.
  • Small Teams & IndividualsAPI is more budget-friendly and easier to start with.
  • Non-technical Users → Definitely go with API—no complex setup required.

Whether you’re a researcher, developer, or business team, Qwen 3 adapts to your needs. Local access provides control and customization, while APIs offer instant scalability and low-barrier entry. Qwen 3’s design ensures strong multilingual, reasoning, and tool-augmented capabilities for real-world tasks.

Frequently Asked Questions

What makes Qwen 3 different from other LLMs?

It supports dual thinking modes, strong multilingual instruction, and long context (128k tokens), with open weights and commercial-friendly licensing.

Can I run Qwen 3 on my PC?

Only the smallest models (e.g., 0.6B) are suitable for consumer GPUs. Larger models require A100/H100 setups.

Is API access easier?

Yes! Novita AI and Hugging Face offer low-cost, plug-and-play Qwen 3 APIs—perfect for quick integration and low-latency use.

Novita AI is the All-in-one cloud platform that empowers your AI ambitions. Integrated APIs, serverless, GPU Instance — the cost-effective tools you need. Eliminate infrastructure, start free, and make your AI vision a reality.

Recommend Reading


Discover more from Novita

Subscribe to get the latest posts sent to your email.

Leave a Comment

Scroll to Top

Discover more from Novita

Subscribe now to keep reading and get access to the full archive.

Continue reading