Tutorial: How to Access Qwen2.5-VL-72B Locally, via API, on Cloud GPU

how to access qwen 2.5 vl 72b

Key Highlights

Multimodal + Long Video Understanding
Supports images, documents, and long videos
Suitable for education, media, surveillance

Accurate Localization + Structured Output
Detects objects precisely
Extracts structured data from invoices, forms, charts
Useful in finance, law, logistics

You can start a free trail on Novita AI API in a very convenient way!

Qwen2.5-VL-72B-Instruct is a powerful 72B-parameter large vision-language model (LVLM) fine-tuned for instruction-following tasks. It supports both textual and visual inputs (images & videos), making it ideal for multimodal reasoning, document understanding, video analysis, and agentic interaction.

What is Qwen2.5-VL-72B?

An Example to Show Qwen2.5-VL-72B’s Ability

Input:Give the query: ’the user is experiencing the image generation feature’, when does the described content occur in the video? Use seconds for time format.

Output: The described content occurs from 28 seconds to 50 seconds in the video. During this segment, the user interacts with the image generation feature, requesting and receiving an artistic double scene painting of a mountain during day and night. The user then adds a bird to the generated image, demonstrating the functionality of the image generation tool.

Qwen2.5-VL-72B Overview

Category Item Details
Basic Info Release Date January 28, 2025
Model Size 73.4B parameters
Open Source Yes (released by Qwen)
Architecture Core Components Dynamic Resolution & Frame Rate Training\SwiGLU + RMSNorm + Window Attention\Dynamic FPS Sampling
Language Support Supported Languages Excels in multilingual documents and scene text recognition
Multimodal Capability Visual (images & videos) and textual inputs
Context Context Window Configurable up to 64K tokens for long videos
Precision Tensor Type BF16
Benchmarks MMMU (Image) 70.2 (Qwen2.5-VL-72B) vs 70.3 (GPT-4o)
MVBench (Video) 70.4 (Qwen2.5-VL-72B) vs 64.6 (GPT-4o)
AITZ_EM (Agent) 83.2 (Qwen2.5-VL-72B) vs 35.3 (GPT-4o)

How to Access Qwen2.5-VL-72B Locally?

Qwen2.5-VL-72B Hardware Requirements

Category Item Details
Hardware Nvidia A100 (80 GB) 8 GPUs × 80 GB = 640 GB Total VRAM
Nvidia H100 (80 GB) 8 GPUs × 80 GB = 640 GB Total VRAM
RTX 4090 (24 GB) 24 GPUs × 24 GB = 576 GB Total VRAM
Nvidia L40S (48 GB) 8 GPUs × 48 GB = 384 GB Total VRAM

Install Qwen2.5-VL-72B locally 

1.Install Dependencies

bashCopyEdit<code># Install the latest Hugging Face Transformers from source (required for Qwen2.5-VL)<br>pip install git+https://github.com/huggingface/transformers accelerate<br><br># Install the vision utility toolkit (recommended with decord for fast video loading)<br>pip install 'qwen-vl-utils[decord]==0.0.8'</code>

2.Using Qwen2.5-VL for Visual Question Answering

import torch
from transformers import AutoTokenizer, AutoModelForVision2Seq
from qwen_vl_utils import load_image, load_video, build_multimodal_inputs

# 🔧 Model name (can also use a local path)
model_name = "Qwen/Qwen2.5-VL-7B-Instruct"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForVision2Seq.from_pretrained(model_name, trust_remote_code=True).eval()

#Load an image (can be local path, URL, or base64)
image = load_image("https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg")

#Define the query
query = "What is happening in the image?"

#Build inputs for the model
inputs = build_multimodal_inputs(tokenizer, query=query, images=[image])

#Inference
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=128)

#Decode and print response
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Answer:", response)

3.Video Input Example

video = load_video("path_or_url_to_video.mp4")
query = "Summarize the video content."

inputs = build_multimodal_inputs(tokenizer, query=query, videos=)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=128)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Answer:", response)

How to Access Qwen2.5-VL-72B via Novita API?

Step 1: Log In and Access the Model Library

Log in to your account and click on the Model Library button.

Log In and Access the Model Library

Step 2: Start Your Free Trial

Begin your free trial to explore the capabilities of the selected model.

Step 3: Get Your API Key

To authenticate with the API, we will provide you with a new API key. Entering the “Settings“ page, you can copy the API key as indicated in the image.

get api key

Step 4: Install the API

Install API using the package manager specific to your programming language.

After installation, import the necessary libraries into your development environment. Initialize the API with your API key to start interacting with Novita AI LLM. This is an example of using chat completions API for python users.

from openai import OpenAI
  
client = OpenAI(
    base_url="https://api.novita.ai/v3/openai",
    api_key="<YOUR Novita AI API Key>",
)

model = "qwen/qwen2.5-vl-72b-instruct"
stream = True # or False
max_tokens = 2048
system_content = """Be a helpful assistant"""
temperature = 1
top_p = 1
min_p = 0
top_k = 50
presence_penalty = 0
frequency_penalty = 0
repetition_penalty = 1
response_format = { "type": "text" }

chat_completion_res = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "system",
            "content": system_content,
        },
        {
            "role": "user",
            "content": "Hi there!",
        }
    ],
    stream=stream,
    max_tokens=max_tokens,
    temperature=temperature,
    top_p=top_p,
    presence_penalty=presence_penalty,
    frequency_penalty=frequency_penalty,
    response_format=response_format,
    extra_body={
      "top_k": top_k,
      "repetition_penalty": repetition_penalty,
      "min_p": min_p
    }
  )

if stream:
    for chunk in chat_completion_res:
        print(chunk.choices[0].delta.content or "", end="")
else:
    print(chat_completion_res.choices[0].message.content)
  
  

Using Qwen2.5-VL-72B via Cloud GPU

Step1:Register an account

If you’re new to Novita AI, begin by creating an account on our website. Once you’re registered, head to the “GPUs” tab to explore available resources and start your journey.

Novita AI website screenshot

Step2:Exploring Templates and GPU Servers

Start by selecting a template that matches your project needs, such as PyTorch, TensorFlow, or CUDA. Choose the version that fits your requirements, like PyTorch 2.2.1 or CUDA 11.8.0. Then, select the A100 GPU server configuration, which offers powerful performance to handle demanding workloads with ample VRAM, RAM, and disk capacity.

novita ai website screenshot using cloud gpu

Step3:Tailor Your Deployment

After selecting a template and GPU, customize your deployment settings by adjusting parameters like the operating system version (e.g., CUDA 11.8). You can also tweak other configurations to tailor the environment to your project’s specific requirements.

novita ai website screenshot using cloud gpu

Step4:Launch an instance

Once you’ve finalized the template and deployment settings, click “Launch Instance” to set up your GPU instance. This will start the environment setup, enabling you to begin using the GPU resources for your AI tasks.

novita ai website screenshot using cloud gpu

Qwen2.5-VL-72B-Instruct delivers cutting-edge performance across a wide range of vision-language tasks. Whether you’re automating workflows in finance or analyzing videos in real time, it combines depth, scale, and flexibility. With open-source access and multiple deployment paths—local GPU, cloud instances, or API—Qwen2.5-VL empowers developers and enterprises to build smarter, more capable AI systems.

Frequently Asked Questions

Can I deploy Qwen2.5-VL-72B-Instruct locally?

Yes. You can run it on machines with sufficient VRAM (e.g., 8×A100 or 24×4090 GPUs).

How do I use Qwen2.5-VL-72B-Instruct via API?

You can access Qwen2.5-VL-72B-Instruct via Novita AI’s Model Library, start a free trial, and get an API key for fast integration.

What is Qwen2.5-VL-72B vs Qwen2.5-VL-72B-Instruct?

The base model handles general visual-language tasks; the “Instruct” version is fine-tuned to follow user instructions more accurately.

Novita AI is an AI cloud platform that offers developers an easy way to deploy AI models using our simple API, while also providing an affordable and reliable GPU cloud for building and scaling.

Recommend Reading

Simple APIs and Scalable GPU

Novita AI is an AI cloud platform that offers developers an easy way to deploy AI models using our simple API, while also providing an affordable and reliable GPU cloud for building and scaling.


Discover more from Novita

Subscribe to get the latest posts sent to your email.

Leave a Comment

Scroll to Top

Discover more from Novita

Subscribe now to keep reading and get access to the full archive.

Continue reading