Qwen2.5-VL-72B: Powerful but RAM-Hungry Vision-Language Model

Qwen2.5-VL: Powerful but RAM-Hungry Vision-Language Model

Key Highlights

Text-to-Image Analysis: Qwen2.5-VL excels in extracting and analyzing text, charts, icons, and layouts from images.

Agent Capabilities: Acts as a visual agent, supporting tasks like managing smartphones and computers.

Video Understanding: Processes long videos (1+ hour) with precise event pinpointing.

Efficient Access: Novita AI provides an affordable API option to avoid high hardware costs.

Qwen2.5-VL is the latest flagship vision-language model in the Qwen series, representing a significant leap forward from its predecessor, Qwen2-VL. The Qwen2.5-VL-72B-Instruct model is an instruction-tuned version with 72 billion parameters, designed to be a more effective and practical vision-language model based on valuable feedback from developers.

What is Qwen2.5-VL-72B-Instruct?

Qwen2.5-VL-72B-Instruct is a large vision-language model (LVLM) with 72 billion parameters, fine-tuned for instruction-based tasks. It is capable of understanding and analyzing both visual (images/videos) and textual inputs to perform a wide range of tasks. Key enhancements over Qwen2-VL include:

  • Enhanced Visual Understanding: Proficient at recognizing common objects, analyzing text, charts, icons, graphics, and layouts within images.
PRODUCTS IDENNNTIFICATION  via Qwen2.5-VL
From Qwen
  • Agentic Capabilities: Acts as a visual agent capable of reasoning and dynamically directing tools for computer and phone use.
mobile use  via Qwen2.5-VL
From Qwen
  • Improved Video Understanding: Can comprehend videos longer than an hour, pinpoint relevant video segments, and supports dynamic FPS training and absolute time encoding for better temporal understanding.
improved video understanding via Qwen2.5-VL
From Qwen
  • Precise Object Localization: Accurately detects objects in an image using bounding boxes/points and provides stable JSON outputs for coordinates and attributes.
  • Structured Output Generation: Supports structured outputs for scanned invoices and tables, benefiting applications in finance and commerce.
object grounding via Qwen2.5-VL
From Qwen

Qwen2-VL-72B-Instruct Benchmark

Qwen2.5-VL benchmark
From Qwen

Qwen2.5-VL-72B-Instruct has demonstrated competitive or superior performance across numerous benchmarks, excelling in image tasks (e.g., MMMUval, MathVista_MINI, DocVQA_VAL), video tasks (e.g., VideoMME, MVBench, EgoSchema), and agent-based tasks (e.g., ScreenSpot, Android Control, MobileMiniWob++_SR). It often surpasses Qwen2-VL-72B and rivals leading models like GPT4o, Claude3.5 Sonnet, and Gemini 2.0. Additionally, Qwen2.5-VL-7B outperforms GPT-4o-mini in several tasks, while Qwen2.5-VL-3B surpasses the 7B version of Qwen2-VL.

Qwen2-VL-72B-Instruct Hardware Requirements

GPU Model VRAM per GPU GPUs Used Total VRAM Estimated Cost
Nvidia A100 (80 GB) 80 GB 8 GPUs 640 GB Approximately $205,496
Nvidia H100 (80 GB) 80 GB 8 GPUs 640 GB Approximately $200,000 – $320,000
Nvidia RTX 4090 24 GB 24 GPUs 576 GB Approximately $57,600 – $66,120
Nvidia L40S 48 GB 8 GPUs 384 GB Approximately $46,799.60 – $83,712.80

For running Qwen2-VL-72B-Instruct, the Nvidia A100 (80 GB) is the best all-around option, balancing cost, VRAM, and performance. The H100 (80 GB) is suitable for organizations prioritizing performance, while the RTX 4090 and L40S are viable for budget-conscious setups, though they come with trade-offs in complexity and VRAM.

Qwen2-VL-72B-Instruct Test

Input:Give the query: ’the user is experiencing the image generation feature’, when does the described content occur in the video? Use seconds for time format.

Output: The described content occurs from 28 seconds to 50 seconds in the video. During this segment, the user interacts with the image generation feature, requesting and receiving an artistic double scene painting of a mountain during day and night. The user then adds a bird to the generated image, demonstrating the functionality of the image generation tool.

How to Access Qwen2-VL-72B-Instruct?

Novita AI is an AI cloud platform that offers developers an easy way to deploy AI models using our simple API, while also providing the affordable and reliable GPU cloud for building and scaling.

Step 1: Log In and Access the Model Library

Log in to your account and click on the Model Library button.

Log In and Access the Model Library

Step 2: Start Your Free Trial

Begin your free trial to explore the capabilities of the selected model.

Step 3: Get Your API Key

To authenticate with the API, we will provide you with a new API key. Entering the “Settings“ page, you can copy the API key as indicated in the image.

get api key

Step 4: Install the API

Install API using the package manager specific to your programming language.

After installation, import the necessary libraries into your development environment. Initialize the API with your API key to start interacting with Novita AI LLM. This is an example of using chat completions API for python users.

from openai import OpenAI
  
client = OpenAI(
    base_url="https://api.novita.ai/v3/openai",
    api_key="<YOUR Novita AI API Key>",
)

model = "qwen/qwen2.5-vl-72b-instruct"
stream = True # or False
max_tokens = 2048
system_content = """Be a helpful assistant"""
temperature = 1
top_p = 1
min_p = 0
top_k = 50
presence_penalty = 0
frequency_penalty = 0
repetition_penalty = 1
response_format = { "type": "text" }

chat_completion_res = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "system",
            "content": system_content,
        },
        {
            "role": "user",
            "content": "Hi there!",
        }
    ],
    stream=stream,
    max_tokens=max_tokens,
    temperature=temperature,
    top_p=top_p,
    presence_penalty=presence_penalty,
    frequency_penalty=frequency_penalty,
    response_format=response_format,
    extra_body={
      "top_k": top_k,
      "repetition_penalty": repetition_penalty,
      "min_p": min_p
    }
  )

if stream:
    for chunk in chat_completion_res:
        print(chunk.choices[0].delta.content or "", end="")
else:
    print(chat_completion_res.choices[0].message.content)
  
  

Qwen2.5-VL-72B-Instruct represents a major advancement in vision-language models, enhancing visual understanding, video comprehension, agent capabilities, localization, and structured output generation. With dynamic resolution training and an efficient visual encoder, it achieves superior benchmark performance across image, video, and agent tasks.

Frequently Asked Questions

How does Qwen2.5-VL-instruct improve visual understanding?

It analyzes text (multilingual, vertical), charts, icons, and layouts while extracting key information and converting documents into structured formats like HTML.

What are Qwen2.5-VL-instruct new video capabilities?

Processes videos over 1 hour, pinpoints events to the second, performs temporal grounding, generates structured captions, and summarizes content.

What is the recommended hardware for running Qwen2.5-VL-instruct?

For local use, a GPU with at least 384GB of VRAM is recommended. Or you can choose effective api like Novita AI to use it!

Novita AI is an AI cloud platform that offers developers an easy way to deploy AI models using our simple API, while also providing an affordable and reliable GPU cloud for building and scaling.

Recommend Reading


Discover more from Novita

Subscribe to get the latest posts sent to your email.

Leave a Comment

Scroll to Top

Discover more from Novita

Subscribe now to keep reading and get access to the full archive.

Continue reading