Qwen 3 for RAG (llm, embedding, reranking) is an open-source AI solution designed for Retrieval-Augmented Generation. It combines three main models: embedding models to find relevant documents, reranking models to sort the best results, and a powerful LLM to generate clear, accurate answers. Qwen 3 supports long context, multiple languages, and is easy to use, making it ideal for building smart search and question-answering systems.
How LLM, Embedding Models, Reranking Models Work Together?
1. Embedding Models: Understanding Retrieval
Purpose:
Find relevant information from a large collection of documents.
How it works:
- Each document (or chunk of text) is converted into a vector (an array of numbers) using an embedding model (e.g., OpenAI’s Ada, Sentence Transformers).
- The user’s query is also embedded into a vector.
- The system searches for document vectors that are most similar to the query vector (using similarity metrics like cosine similarity).
- The top-N most similar documents are retrieved.
2. Reranking Models: Improving Relevance
Purpose:
Refine the results from the embedding retrieval step by ranking them more precisely based on their relevance to the query.
How it works:
- The initial set of retrieved documents (say, top 20) is further evaluated using a reranker.
- Rerankers often use cross-encoder models (like BERT, RoBERTa) that take both the query and each document as input and output a relevance score.
- The top-ranked documents are selected for the next step.
3. LLM (Large Language Model): Generating Answers
Purpose:
Generate a coherent and informative answer based on the retrieved context.
How it works:
- The top-ranked documents are concatenated or summarized as “context.”
- The LLM is prompted with the user’s question and the retrieved context.
- The LLM generates a response, ideally citing or using the retrieved information.
How They All Work Together (RAG Pipeline)
- User submits query.
- Embedding model retrieves relevant documents.
- Reranker sorts these documents by relevance.
- LLM uses the top documents to generate an answer.
What are Qwen 3 Models for RAG?
Qwen 3 Embedding Model
| Model | Size | Layers | Sequence Length | Embedding Dimension | MRL Support | Instruct Aware |
|---|---|---|---|---|---|---|
| Qwen3 Embedding 0.6B | 0.6B | 28 | 32K | 1024 | Yes | Yes |
| Qwen3 Embedding 4B | 4B | 36 | 32K | 2560 | Yes | Yes |
| Qwen3 Embedding 8B | 8B | 36 | 32K | 4096 | Yes | Yes |
Qwen 3 Reranker Model
| Model | Size | Layers | Sequence Length | Instruct Aware |
| Qwen3-Reranker-0.6B | 0.6B | 32 | 32K | Yes |
| Qwen3-Reranker-4B | 4B | 36 | 32K | Yes |
| Qwen3-Reranker-8B | 8B | 36 | 32K | Yes |
Qwen 3 LLM Model
| Model | Architecture | Parameters (Total / Activated) | Layers | Attention Heads (Q / KV) | Experts (Total / Active) | Context Window (tokens) |
|---|---|---|---|---|---|---|
| Qwen3-235B-A22B | MoE | 235B / 22B | 94 | 64 / 4 | 128 / 8 | 32,768 (131,072 w/ YaRN) |
| Qwen3-30B-A3B | MoE | 30.5B / 3.3B | 48 | 32 / 4 | 128 / 8 | 32,768 (131,072 w/ YaRN) |
| Qwen3-32B | Dense | 32.8B | 64 | 64 / 8 | – | 32,768 (131,072 w/ YaRN) |
| Qwen3-14B | Dense | 14.8B | 40 | 40 / 8 | – | 32,768 (131,072 w/ YaRN) |
| Qwen3-8B | Dense | 8.2B | 36 | 32 / 8 | – | 32,768 (131,072 w/ YaRN) |
| Qwen3-4B | Dense | 4.0B | 36 | 32 / 8 | – | 32,768 (131,072 w/ YaRN) |
| Qwen3-1.7B | Dense | 1.7B | 28 | 16 / 8 | – | 32,768 |
| Qwen3-0.6B | Dense | 0.6B | 28 | 16 / 8 | – | 32,768 |
Why Developers Are Switching to Qwen3 for RAG?
| Feature | Qwen 3 |
|---|---|
| Long Context Window | 32,000 tokens |
| Multiple Model Sizes | 0.6B / 4B / 8B |
| Multilingual Support | 100+ languages |
| Advanced Architectures | Reranker models use a cross-encoder setup/Embedding models use a bi-encoder setup |
| Open-Sourced | Apache-2.0 |
| Instruction Awareness | Instruction-aware supports understanding and following specific instructions |
The Perfomance of Qwen 3 Models

You can check the evaluation of embedding models on this leaderboard!
How to Access Qwen 3 Models?
Novita AI is an AI cloud platform that offers developers an easy way to deploy AI models using our simple API, while also providing the affordable and reliable GPU cloud for building and scaling.
In addition to Qwen 3 Reranker 8B and Embedding 8B , Novita AI also provides free bge-m3 to support development of open source community!
Step 1: Log In and Access the Model Library
Log in to your account and click on the Model Library button.

Step 2: Choose Your Model and Start a Free Trail
Browse through the available options and select the model that suits your needs.


Step 3: Get Your API Key
To authenticate with the API, we will provide you with a new API key. Entering the “Settings“ page, you can copy the API key as indicated in the image.

Step 4: Install the API (Example: Qwen 3 Ranker Model)
Install API using the package manager specific to your programming language.

After installation, import the necessary libraries into your development environment. Initialize the API with your API key to start interacting with Novita AI models. This is an example of using chat completions API for python users.
from openai import OpenAI
base_url = "https://api.novita.ai/v3/openai"
api_key = "<Your API Key>"
model = "qwen/qwen3-reranker-8b"
client = OpenAI(
base_url=base_url,
api_key=api_key,
)
stream = True # or False
max_tokens = 1000
response_format = { "type": "text" }
chat_completion_res = client.chat.completions.create(
model=model,
messages=[
{
"role": "user",
"content": "Hi there!",
}
],
stream=stream,
extra_body={
}
)
if stream:
for chunk in chat_completion_res:
print(chunk.choices[0].delta.content or "", end="")
else:
print(chat_completion_res.choices[0].message.content)
As AI applications demand more precise understanding of user intent, reranking models have become essential tools for delivering smarter search results. Acting as a second layer of intelligence after initial retrieval, rerankers fine-tune document rankings using deeper contextual analysis. The Qwen 3 Reranker series sets a new benchmark in this space, offering impressive performance across languages, long documents, and even code retrieval tasks. With deployment made simple through Novita AI, developers can harness these advanced models without heavy infrastructure—making high-accuracy retrieval more accessible than ever.
Frequently Asked Questions
A reranker reorders a list of retrieved documents by scoring their relevance to a query, improving precision in AI search systems.
Embedding Model: Converts each text into a vector and compares them using similarity.
Reranker Model: Reads both query and document together and gives a smart score for relevance.
Qwen3-Reranker-8B achieves top-tier scores:
MTEB-R: 69.02,
CMTEB-R: 77.45,
MTEB-Code: 81.22
It outperforms popular models like BGE and GTE in multiple categories.
Novita AI is the All-in-one cloud platform that empowers your AI ambitions. Integrated APIs, serverless, GPU Instance — the cost-effective tools you need. Eliminate infrastructure, start free, and make your AI vision a reality.
Recommend Reading
- How many H100 GPUs are needed to Fine-tune DeepSeek R1?
- Choose Between Qwen 3 and Qwen 2.5: Lightweight Efficiency or Advanced Reasoning Power?
- Qwen 2.5 7B VRAM Tips Every Dev Should Know
Discover more from Novita
Subscribe to get the latest posts sent to your email.





