English Arabic 简体中文 繁體中文 Français Deutsch 日本語 한국어 Português Русский Español

How to Access ERNIE-4.5-VL-A3B Into Tool-Augmented Workflows

How to Access ERNIE-4.5-VL-A3B Into Tool-Augmented Workflows

Novita AI is launching its “Build Month” campaign, offering developers an exclusive incentive of up to 20% off on all major products!

Enter your Build Month!

Modern developers increasingly struggle to integrate vision-heavy inputs such as diagrams, screenshots, and technical documents into code workflows, while still maintaining low latency and controllable costs. Traditional VLMs are either too slow to sit inside tool loops or too weak at structured reasoning to guide real engineering decisions.

This article explains how ERNIE-4.5-VL-28B-A3B-Thinking addresses this gap by combining strong visual–language reasoning benchmarks with an A3B architecture that enables fast, repeated inference, and demonstrates how these properties make it suitable for tool-augmented code workflows.

Architecture of ERNIE-4.5-VL-28B-A3B

By activating only 3 billion parameters per token from a 28B dense reservoir, the model achieves flagship intelligence with edge-tier inference costs.

The “A3B” in the model name stands for Active 3B, signaling a Mixture-of-Experts (MoE) architecture designed for extreme efficiency.

  • Total Parameters: 28-30 Billion (Sparse MoE)

  • Active Parameters: 3 Billion (per token inference)

  • Context Window: 128k tokens

  • Core Enhancements:

    • Thinking with Images: Unlike standard VLMs that process images as static tokens, this model can iteratively “zoom” and “search” within an image to resolve fine-grained details.
    • GSPO & IcePop RL: Uses advanced reinforcement learning (Group-based Self-Play Optimization) to stabilize the MoE training, ensuring experts are routed correctly for complex logic.

Case 1 : ERNIE-4.5-VL-28B-A3Bming Thinking with Images

What’s the text of the sign with a blue background on the wall next to the sidewalk?

ERNIE-4.5-VL-28B-A3Bming ability

From Baidu

Case: Solving a Bridge Circuit to Compute Equivalent Resistance

In this example, the model is presented with a non-trivial bridge circuit and asked to calculate the equivalent resistance between nodes A and B.

ERNIE-4.5-VL-28B-A3Bming ability

From Baidu

Why ERNIE-4.5-VL-28B-A3B-Thinking can Improve Tool-Augmented Code Workflows

The benchmark scores show consistent strength in STEM reasoning, document understanding, and visual grounding, which directly correspond to the hardest cognitive steps in real-world code workflows.

Across document understanding and structured reasoning benchmarks, ERNIE-4.5-VL-A3B frequently reaches or exceeds the 95th percentile range of Gemini-2.5-Pro and GPT-5-High, despite activating far fewer parameters per token.

BenchmarkERNIE-4.5-VL-A3BGemini-2.5-ProGPT-5-HighWhat This Means for Developers
MathVista82.582.781.3Reliable multi-step symbolic reasoning
MathVerse81.082.984.1Strong abstraction under constraints
MMMU72.281.784.2Multimodal problem decomposition
ChartQA87.178.378.2Structured data extraction
DocVQA (val)93.691.294.2Precise document grounding
OCRBench85.886.481.0Robust text recognition from visuals
CharXiv-DQ90.391.293.5Long-form technical reasoning
CV-Bench83.884.885.0Visual logic consistency
Average (All)73.175.476.6Compact model, near-flagship reasoning

Although the model has 28B parameters, only 3B are active per token, enabling fast, low-latency reasoning suitable for repeated calls inside tool loops.

Key characteristics relevant to users:

  • Active parameters: 3B per token
  • Effective latency: Comparable to small and mid-size models
  • Context length: Up to 128k tokens, supporting system-level reasoning

The A3B design enables:

  • Frequent reasoning passes without prohibitive cost
  • Stable latency in agentic workflows
  • Practical deployment as an always-on reasoning API

Try ERNIE-4.5-VL-28B-A3B-Thinking Now!

What ERNIE-4.5-VL-28B-A3B-Thinking Actually Does Inside a Code Tool Workflow

ERNIE-4.5-VL-28B-A3B-Thinking treats vision as a reasoning input, not just a feature extractor, enabling developers to integrate screenshots, diagrams, and documents directly into code workflows.This is not OCR-plus-text generation. The model reasons over visual structure and aligns it with intent.

1. Diagram and Architecture Understanding

The model can interpret system diagrams and convert visual structure into logical relationships relevant for code decisions.

What the VL capability provides

  • Identifies components, boundaries, and data flow from diagrams
  • Aligns visual elements with textual descriptions
  • Preserves structural relationships in reasoning

Example

  • Input: Microservice architecture diagram + short design note
  • Output: Explanation of service dependencies and communication paths
  • Impact: Code tools are guided to the correct modules instead of scanning the entire codebase

2. Screenshot-Based Code Context Understanding

The model can reason over UI or IDE screenshots to infer underlying logic and intent.

What the VL capability provides

  • Reads UI layouts, logs, and error states from screenshots
  • Connects visual states to likely code paths
  • Handles incomplete or partial textual information

Example

  • Input: Screenshot of a failing dashboard with partial error messages
  • Output: Hypothesis about frontend-backend mismatch and relevant API layer
  • Impact: Faster debugging without requiring full log reproduction

3. Document-Centric Code Reasoning

The model excels at extracting actionable logic from technical documents that mix text, tables, and visuals.

What the VL capability provides

  • Parses specs, PDFs, and research-style documents
  • Links figures and tables to implementation logic
  • Maintains alignment across long documents

Example

  • Input: API specification PDF with tables and flowcharts
  • Output: Structured summary of endpoints, constraints, and edge cases
  • Impact: Code generation tools start from a correct, grounded understanding

4. Visual Reasoning for Problem Decomposition

Visual inputs are used to drive multi-step reasoning, not just recognition.

What the VL capability provides

  • Converts visual problems into symbolic representations
  • Maintains consistency across reasoning steps
  • Supports abstraction before implementation

Example

  • Input: Data pipeline flowchart
  • Output: Stepwise breakdown of processing stages and failure points
  • Impact: Enables targeted tool calls instead of broad debugging

Try ERNIE-4.5-VL-28B-A3B-Thinking Now!

How to Access ERNIE-4.5-VL-28B-A3B-Thinking at Good Price?

Novita AI offers ERNIE-4.5-VL-28B-A3B-Thinking APIs with a 30K context window at $0.112 per input and $0.448 per output. supporting structured outputs and function calling.

Step 1: Log In and Access the Model Library

Log in to your account and click on the Model Library button.

Log in to your account and click on the Model Library button.

Step 2: Choose Your Model

Browse through the available options and select the model that suits your needs.

Browse through the available options and select the model that suits your needs.

Try ERNIE-4.5-VL-28B-A3B-Thinking Now!

Step 3: Start Your Free Trial

Begin your free trial to explore the capabilities of the selected model.

Begin your free trial to explore the capabilities of the selected model.

Step 4: Get Your API Key

To authenticate with the API, we will provide you with a new API key. Entering the “Settings“ page, you can copy the API key as indicated in the image.

get api key

from openai import OpenAI

client = OpenAI(
    api_key="<Your API Key>",
    base_url="https://api.novita.ai/openai"
)

response = client.chat.completions.create(
    model="baidu/ernie-4.5-vl-28b-a3b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello, how are you?"}
    ],
    max_tokens=8000,
    temperature=0.7
)

print(response.choices[0].message.content)

ERNIE-4.5-VL-28B-A3B-Thinking achieves near-flagship visual–language reasoning performance while activating only 3B parameters per token, enabling low-latency, high-frequency reasoning inside tool workflows. Its benchmark-proven strengths in document understanding, visual grounding, and STEM reasoning allow it to act as a reasoning coordinator rather than a syntax engine. As a result, it is well suited for developers who need to integrate diagrams, screenshots, and technical documents into code tools without sacrificing speed or cost efficiency.

Frequently Asked Questions

What kind of reasoning tasks is ERNIE-4.5-VL-28B-A3B-Thinking best suited for?

ERNIE-4.5-VL-28B-A3B-Thinking is best suited for visual–language reasoning tasks such as diagram interpretation, document understanding, and structured problem decomposition, rather than pure syntax-level code generation.

Can ERNIE-4.5-VL-28B-A3B-Thinking replace a code-specialized LLM?

No. ERNIE-4.5-VL-28B-A3B-Thinking is designed to complement code-specialized models by handling visual understanding, planning, and validation, not low-level code execution.

What makes the visual–language capability of ERNIE-4.5-VL-28B-A3B-Thinking different from OCR-based models?

ERNIE-4.5-VL-28B-A3B-Thinking reasons over visual structure and intent, enabling tasks such as diagram-based system understanding and screenshot-driven debugging rather than simple text extraction.

Novita AI is the All-in-one cloud platform that empowers your AI ambitions. Integrated APIs, serverless, GPU Instance — the cost-effective tools you need. Eliminate infrastructure, start free, and make your AI vision a reality.