Qwen 3 in RAG Pipelines: All-in-One LLM, Embedding, and Reranking Solution

qwen 3 in rag

Qwen 3 for RAG (llm, embedding, reranking) is an open-source AI solution designed for Retrieval-Augmented Generation. It combines three main models: embedding models to find relevant documents, reranking models to sort the best results, and a powerful LLM to generate clear, accurate answers. Qwen 3 supports long context, multiple languages, and is easy to use, making it ideal for building smart search and question-answering systems.

How LLM, Embedding Models, Reranking Models Work Together?

1. Embedding Models: Understanding Retrieval

Purpose:
Find relevant information from a large collection of documents.

How it works:

  • Each document (or chunk of text) is converted into a vector (an array of numbers) using an embedding model (e.g., OpenAI’s Ada, Sentence Transformers).
  • The user’s query is also embedded into a vector.
  • The system searches for document vectors that are most similar to the query vector (using similarity metrics like cosine similarity).
  • The top-N most similar documents are retrieved.

2. Reranking Models: Improving Relevance

Purpose:
Refine the results from the embedding retrieval step by ranking them more precisely based on their relevance to the query.

How it works:

  • The initial set of retrieved documents (say, top 20) is further evaluated using a reranker.
  • Rerankers often use cross-encoder models (like BERT, RoBERTa) that take both the query and each document as input and output a relevance score.
  • The top-ranked documents are selected for the next step.

3. LLM (Large Language Model): Generating Answers

Purpose:
Generate a coherent and informative answer based on the retrieved context.

How it works:

  • The top-ranked documents are concatenated or summarized as “context.”
  • The LLM is prompted with the user’s question and the retrieved context.
  • The LLM generates a response, ideally citing or using the retrieved information.

How They All Work Together (RAG Pipeline)

  1. User submits query.
  2. Embedding model retrieves relevant documents.
  3. Reranker sorts these documents by relevance.
  4. LLM uses the top documents to generate an answer.

What are Qwen 3 Models for RAG?

Qwen 3 Embedding Model

ModelSizeLayersSequence LengthEmbedding DimensionMRL SupportInstruct Aware
Qwen3 Embedding 0.6B0.6B2832K1024YesYes
Qwen3 Embedding 4B4B3632K2560YesYes
Qwen3 Embedding 8B8B3632K4096YesYes

Qwen 3 Reranker Model

ModelSizeLayersSequence LengthInstruct Aware
Qwen3-Reranker-0.6B0.6B3232KYes
Qwen3-Reranker-4B4B3632KYes
Qwen3-Reranker-8B8B3632KYes

Qwen 3 LLM Model

ModelArchitectureParameters (Total / Activated)LayersAttention Heads (Q / KV)Experts (Total / Active)Context Window (tokens)
Qwen3-235B-A22BMoE235B / 22B9464 / 4128 / 832,768 (131,072 w/ YaRN)
Qwen3-30B-A3BMoE30.5B / 3.3B4832 / 4128 / 832,768 (131,072 w/ YaRN)
Qwen3-32BDense32.8B6464 / 832,768 (131,072 w/ YaRN)
Qwen3-14BDense14.8B4040 / 832,768 (131,072 w/ YaRN)
Qwen3-8BDense8.2B3632 / 832,768 (131,072 w/ YaRN)
Qwen3-4BDense4.0B3632 / 832,768 (131,072 w/ YaRN)
Qwen3-1.7BDense1.7B2816 / 832,768
Qwen3-0.6BDense0.6B2816 / 832,768

Why Developers Are Switching to Qwen3 for RAG?

FeatureQwen 3
Long Context Window32,000 tokens
Multiple Model Sizes0.6B / 4B / 8B
Multilingual Support100+ languages
Advanced ArchitecturesReranker models use a cross-encoder setup/Embedding models use a bi-encoder setup
Open-SourcedApache-2.0
Instruction AwarenessInstruction-aware supports understanding and following specific instructions

The Perfomance of Qwen 3 Models

You can check the evaluation of embedding models on this leaderboard!

How to Access Qwen 3 Models?

Novita AI is an AI cloud platform that offers developers an easy way to deploy AI models using our simple API, while also providing the affordable and reliable GPU cloud for building and scaling.

In addition to Qwen 3 Reranker 8B and Embedding 8B , Novita AI also provides free bge-m3 to support development of open source community!

Step 1: Log In and Access the Model Library

Log in to your account and click on the Model Library button.

Log In and Access the Model Library

Step 2: Choose Your Model and Start a Free Trail

Browse through the available options and select the model that suits your needs.

qwen 3 embedding 8b model list
Step 2: Choose Your Model and Start a Free Trail

Step 3: Get Your API Key

To authenticate with the API, we will provide you with a new API key. Entering the “Settings“ page, you can copy the API key as indicated in the image.

get api key

Step 4: Install the API (Example: Qwen 3 Ranker Model)

Install API using the package manager specific to your programming language.

Step 4: Install the API

After installation, import the necessary libraries into your development environment. Initialize the API with your API key to start interacting with Novita AI models. This is an example of using chat completions API for python users.

from openai import OpenAI

base_url = "https://api.novita.ai/v3/openai"
api_key = "<Your API Key>"
model = "qwen/qwen3-reranker-8b"

client = OpenAI(
    base_url=base_url,
    api_key=api_key,
)

stream = True # or False
max_tokens = 1000

response_format = { "type": "text" }

chat_completion_res = client.chat.completions.create(
    model=model,
    messages=[
        
        {
            "role": "user",
            "content": "Hi there!",
        }
    ],
    stream=stream,
    extra_body={
    }
  )

if stream:
    for chunk in chat_completion_res:
        print(chunk.choices[0].delta.content or "", end="")
else:
    print(chat_completion_res.choices[0].message.content)
  
    

As AI applications demand more precise understanding of user intent, reranking models have become essential tools for delivering smarter search results. Acting as a second layer of intelligence after initial retrieval, rerankers fine-tune document rankings using deeper contextual analysis. The Qwen 3 Reranker series sets a new benchmark in this space, offering impressive performance across languages, long documents, and even code retrieval tasks. With deployment made simple through Novita AI, developers can harness these advanced models without heavy infrastructure—making high-accuracy retrieval more accessible than ever.

Frequently Asked Questions

What is a reranker model?

A reranker reorders a list of retrieved documents by scoring their relevance to a query, improving precision in AI search systems.

How is a reranker different from an embedding model?

Embedding Model: Converts each text into a vector and compares them using similarity.
Reranker Model: Reads both query and document together and gives a smart score for relevance.

How does Qwen 3 Reranker perform?

Qwen3-Reranker-8B achieves top-tier scores:
MTEB-R: 69.02,
CMTEB-R: 77.45,
MTEB-Code: 81.22
It outperforms popular models like BGE and GTE in multiple categories.

Novita AI is the All-in-one cloud platform that empowers your AI ambitions. Integrated APIs, serverless, GPU Instance — the cost-effective tools you need. Eliminate infrastructure, start free, and make your AI vision a reality.

Recommend Reading


Discover more from Novita

Subscribe to get the latest posts sent to your email.

Leave a Comment

Scroll to Top

Discover more from Novita

Subscribe now to keep reading and get access to the full archive.

Continue reading