How to Access ERNIE-4.5-VL-A3B Into Tool-Augmented Workflows

Table Of Contents

Architecture of ERNIE-4.5-VL-28B-A3B
Why ERNIE-4.5-VL-28B-A3B-Thinking can Improve Tool-Augmented Code Workflows
What ERNIE-4.5-VL-28B-A3B-Thinking Actually Does Inside a Code Tool Workflow
How to Access ERNIE-4.5-VL-28B-A3B-Thinking at Good Price？

Novita AI is launching its “Build Month” campaign, offering developers an exclusive incentive of up to 20% off on all major products!

Enter your Build Month!

Modern developers increasingly struggle to integrate vision-heavy inputs such as diagrams, screenshots, and technical documents into code workflows, while still maintaining low latency and controllable costs. Traditional VLMs are either too slow to sit inside tool loops or too weak at structured reasoning to guide real engineering decisions.

This article explains how ERNIE-4.5-VL-28B-A3B-Thinking addresses this gap by combining strong visual–language reasoning benchmarks with an A3B architecture that enables fast, repeated inference, and demonstrates how these properties make it suitable for tool-augmented code workflows.

Architecture of ERNIE-4.5-VL-28B-A3B

By activating only 3 billion parameters per token from a 28B dense reservoir, the model achieves flagship intelligence with edge-tier inference costs.

The “A3B” in the model name stands for Active 3B, signaling a Mixture-of-Experts (MoE) architecture designed for extreme efficiency.

Total Parameters: 28-30 Billion (Sparse MoE)
Active Parameters: 3 Billion (per token inference)
Context Window: 128k tokens
Core Enhancements:
- Thinking with Images: Unlike standard VLMs that process images as static tokens, this model can iteratively “zoom” and “search” within an image to resolve fine-grained details.
- GSPO & IcePop RL: Uses advanced reinforcement learning (Group-based Self-Play Optimization) to stabilize the MoE training, ensuring experts are routed correctly for complex logic.

Case 1 : ERNIE-4.5-VL-28B-A3Bming Thinking with Images

What’s the text of the sign with a blue background on the wall next to the sidewalk?

From Baidu

Case: Solving a Bridge Circuit to Compute Equivalent Resistance

In this example, the model is presented with a non-trivial bridge circuit and asked to calculate the equivalent resistance between nodes A and B.

From Baidu

Why ERNIE-4.5-VL-28B-A3B-Thinking can Improve Tool-Augmented Code Workflows

The benchmark scores show consistent strength in STEM reasoning, document understanding, and visual grounding, which directly correspond to the hardest cognitive steps in real-world code workflows.

Across document understanding and structured reasoning benchmarks, ERNIE-4.5-VL-A3B frequently reaches or exceeds the 95th percentile range of Gemini-2.5-Pro and GPT-5-High, despite activating far fewer parameters per token.

Benchmark	ERNIE-4.5-VL-A3B	Gemini-2.5-Pro	GPT-5-High	What This Means for Developers
MathVista	82.5	82.7	81.3	Reliable multi-step symbolic reasoning
MathVerse	81.0	82.9	84.1	Strong abstraction under constraints
MMMU	72.2	81.7	84.2	Multimodal problem decomposition
ChartQA	87.1	78.3	78.2	Structured data extraction
DocVQA (val)	93.6	91.2	94.2	Precise document grounding
OCRBench	85.8	86.4	81.0	Robust text recognition from visuals
CharXiv-DQ	90.3	91.2	93.5	Long-form technical reasoning
CV-Bench	83.8	84.8	85.0	Visual logic consistency
Average (All)	73.1	75.4	76.6	Compact model, near-flagship reasoning

Although the model has 28B parameters, only 3B are active per token, enabling fast, low-latency reasoning suitable for repeated calls inside tool loops.

Key characteristics relevant to users:

Active parameters: 3B per token
Effective latency: Comparable to small and mid-size models
Context length: Up to 128k tokens, supporting system-level reasoning

The A3B design enables:

Frequent reasoning passes without prohibitive cost
Stable latency in agentic workflows
Practical deployment as an always-on reasoning API

Try ERNIE-4.5-VL-28B-A3B-Thinking Now!

What ERNIE-4.5-VL-28B-A3B-Thinking Actually Does Inside a Code Tool Workflow

ERNIE-4.5-VL-28B-A3B-Thinking treats vision as a reasoning input, not just a feature extractor, enabling developers to integrate screenshots, diagrams, and documents directly into code workflows.This is not OCR-plus-text generation. The model reasons over visual structure and aligns it with intent.

1. Diagram and Architecture Understanding

The model can interpret system diagrams and convert visual structure into logical relationships relevant for code decisions.

What the VL capability provides

Identifies components, boundaries, and data flow from diagrams
Aligns visual elements with textual descriptions
Preserves structural relationships in reasoning

Example

Input: Microservice architecture diagram + short design note
Output: Explanation of service dependencies and communication paths
Impact: Code tools are guided to the correct modules instead of scanning the entire codebase

2. Screenshot-Based Code Context Understanding

The model can reason over UI or IDE screenshots to infer underlying logic and intent.

What the VL capability provides

Reads UI layouts, logs, and error states from screenshots
Connects visual states to likely code paths
Handles incomplete or partial textual information

Example

Input: Screenshot of a failing dashboard with partial error messages
Output: Hypothesis about frontend-backend mismatch and relevant API layer
Impact: Faster debugging without requiring full log reproduction

3. Document-Centric Code Reasoning

The model excels at extracting actionable logic from technical documents that mix text, tables, and visuals.

What the VL capability provides

Parses specs, PDFs, and research-style documents
Links figures and tables to implementation logic
Maintains alignment across long documents

Example

Input: API specification PDF with tables and flowcharts
Output: Structured summary of endpoints, constraints, and edge cases
Impact: Code generation tools start from a correct, grounded understanding

4. Visual Reasoning for Problem Decomposition

Visual inputs are used to drive multi-step reasoning, not just recognition.

What the VL capability provides

Converts visual problems into symbolic representations
Maintains consistency across reasoning steps
Supports abstraction before implementation

Example

Input: Data pipeline flowchart
Output: Stepwise breakdown of processing stages and failure points
Impact: Enables targeted tool calls instead of broad debugging

Try ERNIE-4.5-VL-28B-A3B-Thinking Now!

How to Access ERNIE-4.5-VL-28B-A3B-Thinking at Good Price？

Novita AI offers ERNIE-4.5-VL-28B-A3B-Thinking APIs with a 30K context window at $0.112 per input and $0.448 per output. supporting structured outputs and function calling.

Step 1: Log In and Access the Model Library

Step 2: Choose Your Model

Browse through the available options and select the model that suits your needs.

Try ERNIE-4.5-VL-28B-A3B-Thinking Now!

Step 3: Start Your Free Trial

Begin your free trial to explore the capabilities of the selected model.

Step 4: Get Your API Key

To authenticate with the API, we will provide you with a new API key. Entering the “Settings“ page, you can copy the API key as indicated in the image.

from openai import OpenAI

client = OpenAI(
    api_key="<Your API Key>",
    base_url="https://api.novita.ai/openai"
)

response = client.chat.completions.create(
    model="baidu/ernie-4.5-vl-28b-a3b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello, how are you?"}
    ],
    max_tokens=8000,
    temperature=0.7
)

print(response.choices[0].message.content)

ERNIE-4.5-VL-28B-A3B-Thinking achieves near-flagship visual–language reasoning performance while activating only 3B parameters per token, enabling low-latency, high-frequency reasoning inside tool workflows. Its benchmark-proven strengths in document understanding, visual grounding, and STEM reasoning allow it to act as a reasoning coordinator rather than a syntax engine. As a result, it is well suited for developers who need to integrate diagrams, screenshots, and technical documents into code tools without sacrificing speed or cost efficiency.

Frequently Asked Questions

What kind of reasoning tasks is ERNIE-4.5-VL-28B-A3B-Thinking best suited for?

ERNIE-4.5-VL-28B-A3B-Thinking is best suited for visual–language reasoning tasks such as diagram interpretation, document understanding, and structured problem decomposition, rather than pure syntax-level code generation.

Can ERNIE-4.5-VL-28B-A3B-Thinking replace a code-specialized LLM?

No. ERNIE-4.5-VL-28B-A3B-Thinking is designed to complement code-specialized models by handling visual understanding, planning, and validation, not low-level code execution.

What makes the visual–language capability of ERNIE-4.5-VL-28B-A3B-Thinking different from OCR-based models?

ERNIE-4.5-VL-28B-A3B-Thinking reasons over visual structure and intent, enabling tasks such as diagram-based system understanding and screenshot-driven debugging rather than simple text extraction.

Novita AI is the All-in-one cloud platform that empowers your AI ambitions. Integrated APIs, serverless, GPU Instance — the cost-effective tools you need. Eliminate infrastructure, start free, and make your AI vision a reality.

How to Access ERNIE-4.5-VL-A3B Into Tool-Augmented Workflows

Architecture of ERNIE-4.5-VL-28B-A3B

Case 1 : ERNIE-4.5-VL-28B-A3Bming Thinking with Images

Case: Solving a Bridge Circuit to Compute Equivalent Resistance

Why ERNIE-4.5-VL-28B-A3B-Thinking can Improve Tool-Augmented Code Workflows

What ERNIE-4.5-VL-28B-A3B-Thinking Actually Does Inside a Code Tool Workflow

1. Diagram and Architecture Understanding