Key Highlights
Multimodal + Long Video Understanding
Supports images, documents, and long videos
Suitable for education, media, surveillance
Accurate Localization + Structured Output
Detects objects precisely
Extracts structured data from invoices, forms, charts
Useful in finance, law, logistics
You can start a free trail on Novita AI API in a very convenient way!
Qwen2.5-VL-72B-Instruct is a powerful 72B-parameter large vision-language model (LVLM) fine-tuned for instruction-following tasks. It supports both textual and visual inputs (images & videos), making it ideal for multimodal reasoning, document understanding, video analysis, and agentic interaction.
What is Qwen2.5-VL-72B?
An Example to Show Qwen2.5-VL-72B’s Ability
Input:Give the query: ’the user is experiencing the image generation feature’, when does the described content occur in the video? Use seconds for time format.
Output: The described content occurs from 28 seconds to 50 seconds in the video. During this segment, the user interacts with the image generation feature, requesting and receiving an artistic double scene painting of a mountain during day and night. The user then adds a bird to the generated image, demonstrating the functionality of the image generation tool.
Qwen2.5-VL-72B Overview
| Category | Item | Details |
|---|---|---|
| Basic Info | Release Date | January 28, 2025 |
| Model Size | 73.4B parameters | |
| Open Source | Yes (released by Qwen) | |
| Architecture | Core Components | Dynamic Resolution & Frame Rate Training\SwiGLU + RMSNorm + Window Attention\Dynamic FPS Sampling |
| Language Support | Supported Languages | Excels in multilingual documents and scene text recognition |
| Multimodal | Capability | Visual (images & videos) and textual inputs |
| Context | Context Window | Configurable up to 64K tokens for long videos |
| Precision | Tensor Type | BF16 |
| Benchmarks | MMMU (Image) | 70.2 (Qwen2.5-VL-72B) vs 70.3 (GPT-4o) |
| MVBench (Video) | 70.4 (Qwen2.5-VL-72B) vs 64.6 (GPT-4o) | |
| AITZ_EM (Agent) | 83.2 (Qwen2.5-VL-72B) vs 35.3 (GPT-4o) |
How to Access Qwen2.5-VL-72B Locally?
Qwen2.5-VL-72B Hardware Requirements
| Category | Item | Details |
|---|---|---|
| Hardware | Nvidia A100 (80 GB) | 8 GPUs × 80 GB = 640 GB Total VRAM |
| Nvidia H100 (80 GB) | 8 GPUs × 80 GB = 640 GB Total VRAM | |
| RTX 4090 (24 GB) | 24 GPUs × 24 GB = 576 GB Total VRAM | |
| Nvidia L40S (48 GB) | 8 GPUs × 48 GB = 384 GB Total VRAM |
Install Qwen2.5-VL-72B locally
1.Install Dependencies
bashCopyEdit<code># Install the latest Hugging Face Transformers from source (required for Qwen2.5-VL)<br>pip install git+https://github.com/huggingface/transformers accelerate<br><br># Install the vision utility toolkit (recommended with decord for fast video loading)<br>pip install 'qwen-vl-utils[decord]==0.0.8'</code>
2.Using Qwen2.5-VL for Visual Question Answering
import torch
from transformers import AutoTokenizer, AutoModelForVision2Seq
from qwen_vl_utils import load_image, load_video, build_multimodal_inputs
# 🔧 Model name (can also use a local path)
model_name = "Qwen/Qwen2.5-VL-7B-Instruct"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForVision2Seq.from_pretrained(model_name, trust_remote_code=True).eval()
#Load an image (can be local path, URL, or base64)
image = load_image("https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg")
#Define the query
query = "What is happening in the image?"
#Build inputs for the model
inputs = build_multimodal_inputs(tokenizer, query=query, images=[image])
#Inference
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=128)
#Decode and print response
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Answer:", response)
3.Video Input Example
video = load_video("path_or_url_to_video.mp4")
query = "Summarize the video content."
inputs = build_multimodal_inputs(tokenizer, query=query, videos=)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=128)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Answer:", response)
How to Access Qwen2.5-VL-72B via Novita API?
Step 1: Log In and Access the Model Library
Log in to your account and click on the Model Library button.

Step 2: Start Your Free Trial
Begin your free trial to explore the capabilities of the selected model.

Step 3: Get Your API Key
To authenticate with the API, we will provide you with a new API key. Entering the “Settings“ page, you can copy the API key as indicated in the image.

Step 4: Install the API
Install API using the package manager specific to your programming language.

After installation, import the necessary libraries into your development environment. Initialize the API with your API key to start interacting with Novita AI LLM. This is an example of using chat completions API for python users.
from openai import OpenAI
client = OpenAI(
base_url="https://api.novita.ai/v3/openai",
api_key="<YOUR Novita AI API Key>",
)
model = "qwen/qwen2.5-vl-72b-instruct"
stream = True # or False
max_tokens = 2048
system_content = """Be a helpful assistant"""
temperature = 1
top_p = 1
min_p = 0
top_k = 50
presence_penalty = 0
frequency_penalty = 0
repetition_penalty = 1
response_format = { "type": "text" }
chat_completion_res = client.chat.completions.create(
model=model,
messages=[
{
"role": "system",
"content": system_content,
},
{
"role": "user",
"content": "Hi there!",
}
],
stream=stream,
max_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
presence_penalty=presence_penalty,
frequency_penalty=frequency_penalty,
response_format=response_format,
extra_body={
"top_k": top_k,
"repetition_penalty": repetition_penalty,
"min_p": min_p
}
)
if stream:
for chunk in chat_completion_res:
print(chunk.choices[0].delta.content or "", end="")
else:
print(chat_completion_res.choices[0].message.content)
Using Qwen2.5-VL-72B via Cloud GPU
Step1:Register an account
If you’re new to Novita AI, begin by creating an account on our website. Once you’re registered, head to the “GPUs” tab to explore available resources and start your journey.

Step2:Exploring Templates and GPU Servers
Start by selecting a template that matches your project needs, such as PyTorch, TensorFlow, or CUDA. Choose the version that fits your requirements, like PyTorch 2.2.1 or CUDA 11.8.0. Then, select the A100 GPU server configuration, which offers powerful performance to handle demanding workloads with ample VRAM, RAM, and disk capacity.

Step3:Tailor Your Deployment
After selecting a template and GPU, customize your deployment settings by adjusting parameters like the operating system version (e.g., CUDA 11.8). You can also tweak other configurations to tailor the environment to your project’s specific requirements.

Step4:Launch an instance
Once you’ve finalized the template and deployment settings, click “Launch Instance” to set up your GPU instance. This will start the environment setup, enabling you to begin using the GPU resources for your AI tasks.

Qwen2.5-VL-72B-Instruct delivers cutting-edge performance across a wide range of vision-language tasks. Whether you’re automating workflows in finance or analyzing videos in real time, it combines depth, scale, and flexibility. With open-source access and multiple deployment paths—local GPU, cloud instances, or API—Qwen2.5-VL empowers developers and enterprises to build smarter, more capable AI systems.
Frequently Asked Questions
Yes. You can run it on machines with sufficient VRAM (e.g., 8×A100 or 24×4090 GPUs).
You can access Qwen2.5-VL-72B-Instruct via Novita AI’s Model Library, start a free trial, and get an API key for fast integration.
The base model handles general visual-language tasks; the “Instruct” version is fine-tuned to follow user instructions more accurately.
Novita AI is an AI cloud platform that offers developers an easy way to deploy AI models using our simple API, while also providing an affordable and reliable GPU cloud for building and scaling.
Recommend Reading
- Qwen2.5-VL: Powerful but RAM-Hungry Vision-Language Model
- Qwen 2.5 72b vs Llama 3.3 70b: Which Model Suits Your Needs?
- Qwen 2.5 vs Llama 3.2 90B: A Comparative Analysis of Coding and Image Reasoning Capabilities
Discover more from Novita
Subscribe to get the latest posts sent to your email.





