PaddleOCR on Novita AI: Ultra-Compact 0.9B Vision-Language Model for Document Parsing

PaddleOCR on Novita AI

PaddleOCR-VL is now available on the Novita AI platform, bringing state-of-the-art multilingual document parsing capabilities through an ultra-compact 0.9B vision-language model. This innovative solution integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition across 109 languages.

PaddleOCR-VL-0.9B is a compact yet powerful vision-language model that excels in recognizing complex elements like text, tables, formulas, and charts, while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition.

It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds suitable for practical deployment in real-world scenarios.

What is PaddleOCR-VL?

PaddleOCR-VL is a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition.

This innovative model efficiently supports 109 languages and excels in recognizing complex elements including text, tables, formulas, and charts, while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition.

The model significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.

Core Features

Compact yet Powerful VLM Architecture

PaddleOCR-VL presents a novel vision-language model that is specifically designed for resource-efficient inference, achieving outstanding performance in element recognition. By integrating a NaViT-style dynamic high-resolution visual encoder with the lightweight ERNIE-4.5-0.3B language model, the system significantly enhances the model’s recognition capabilities and decoding efficiency. This integration maintains high accuracy while reducing computational demands, making it well-suited for efficient and practical document processing applications.

SOTA Performance on Document Parsing

PaddleOCR-VL achieves state-of-the-art performance in both page-level document parsing and element-level recognition. It significantly outperforms existing pipeline-based solutions and exhibits strong competitiveness against leading vision-language models in document parsing. Moreover, PaddleOCR-VL excels in recognizing complex document elements, such as text, tables, formulas, and charts, making it suitable for a wide range of challenging content types, including handwritten text and historical documents. This makes it highly versatile and suitable for a wide range of document types and scenarios.

Multilingual Support

PaddleOCR-VL supports 109 languages, covering major global languages, including but not limited to Chinese, English, Japanese, Latin, and Korean. It also supports languages with different scripts and structures, such as Russian (Cyrillic script), Arabic, Hindi (Devanagari script), and Thai.

This broad language coverage substantially enhances the applicability of the system to multilingual and globalized document processing scenarios.

Model Architecture

Model Architecture pf paddle ocr vl

The NaViT-style dynamic high-resolution visual encoder enables the model to process documents of varying resolutions efficiently, maintaining high-quality feature extraction across different document types and layouts. The lightweight ERNIE-4.5-0.3B language model provides robust language understanding and generation capabilities, processing the visual features to generate structured outputs.

This architectural design achieves an optimal balance between model size, inference speed, and recognition accuracy, making PaddleOCR-VL-0.9B ideal for practical deployment where both performance and efficiency are critical requirements.

Performance Benchmarks

modeel benchmark

PaddleOCR-VL demonstrates exceptional performance across multiple evaluation dimensions, establishing itself as a state-of-the-art solution for document parsing and element recognition.

Page-Level Document Parsing

OmniDocBench v1.5: PaddleOCR-VL achieves SOTA performance for overall, text, formula, tables and reading order on OmniDocBench v1.5.

The model consistently outperforms competing solutions across all evaluated categories, demonstrating its comprehensive document understanding capabilities.

OmniDocBench v1.0: PaddleOCR-VL achieves SOTA performance for almost all metrics of overall, text, formula, tables and reading order on OmniDocBench v1.0.

These results validate the model’s robust capabilities across diverse document types and complexity levels.

Note: The metrics are from MinerU, OmniDocBench, and internal evaluations.

Element-Level Recognition

Text Recognition: PaddleOCR-VL’s robust and versatile capability in handling diverse document types establishes it as the leading method in the OmniDocBench-OCR-block performance evaluation.

The in-house OCR evaluation provides an assessment of performance across multiple languages and text types. PaddleOCR-VL demonstrates outstanding accuracy with the lowest edit distances in all evaluated scripts.

Table Recognition: The self-built evaluation set contains diverse types of table images, such as Chinese, English, and mixed Chinese-English tables, tables with full, partial, or no borders, book/manual formats, lists, academic papers, tables with merged cells, as well as low-quality and watermarked tables.

PaddleOCR-VL achieves remarkable performance across all categories.

Formula Recognition: The evaluation set contains simple prints, complex prints, camera scans, and handwritten formulas.

PaddleOCR-VL demonstrates the best performance in every category.

Chart Recognition: The evaluation set is broadly categorized into 11 chart categories, including bar-line hybrid, pie, 100% stacked bar, area, bar, bubble, histogram, line, scatterplot, stacked area, and stacked bar.

PaddleOCR-VL not only outperforms expert OCR VLMs but also surpasses some 72B-level multimodal language models.

Use Cases and Applications

Document Digitization

Transform paper documents into searchable digital formats with PaddleOCR-VL’s powerful text recognition across 109 languages. Process invoices, receipts, contracts, and business documents efficiently while maintaining high accuracy even with low-quality scans or watermarked content.

Academic Research

Extract mathematical formulas, tables, and text from research papers and scientific publications. PaddleOCR-VL’s exceptional formula recognition handles both simple and complex mathematical expressions, making it ideal for literature review and data extraction from academic content.

Financial Document Processing

Automate the extraction of data from financial statements, balance sheets, and reports. The model’s advanced table recognition accurately parses complex tables with merged cells, multiple languages, and various formatting styles commonly found in financial documents.

Historical Archive Digitization

Preserve historical documents and manuscripts with PaddleOCR-VL’s robust handling of challenging content including handwritten text, old fonts, faded ink, and aged paper. The model maintains accuracy even with historical documents in various scripts and languages.

Chart and Data Analysis

Extract insights from visual data representations across 11 chart types including bar charts, pie charts, line graphs, and complex hybrid visualizations. Perfect for business intelligence applications and automated reporting systems.

Getting Started with PaddleOCR on Novita AI Platform

Accessing PaddleOCR-VL through Novita AI offers multiple pathways tailored to different technical expertise levels and use cases. Whether you’re a business user exploring AI capabilities or a developer building production applications, Novita AI provides the tools you need.

Use the Playground (Available Now – No Coding Required)

  • Instant Access: Sign up and start experimenting with PaddleOCR-VL in seconds
  • Interactive Interface: Test document parsing and visualize outputs in real-time
  • Model Comparison: Compare PaddleOCR-VL with other leading models for your specific use case

The playground enables you to test various document types and see immediate results without any technical setup. Perfect for prototyping, testing ideas, and understanding model capabilities before full implementation.

Integrate via API (Live and Ready – For Developers)

Connect PaddleOCR-VL to your applications with Novita AI’s unified REST API.

Option 1: Direct API Integration (Python Example)

from openai import OpenAI
  
client = OpenAI(
    base_url="https://api.novita.ai/openai",
    api_key="",
)

model = "paddlepaddle/paddleocr-vl"
stream = True # or False
max_tokens = 8192
system_content = "Be a helpful assistant"
temperature = 1
top_p = 1
min_p = 0
top_k = 50
presence_penalty = 0
frequency_penalty = 0
repetition_penalty = 1
response_format = { "type": "text" }

chat_completion_res = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "system",
            "content": system_content,
        },
        {
            "role": "user",
            "content": "Hi there!",
        }
    ],
    stream=stream,
    max_tokens=max_tokens,
    temperature=temperature,
    top_p=top_p,
    presence_penalty=presence_penalty,
    frequency_penalty=frequency_penalty,
    response_format=response_format,
    extra_body={
      "top_k": top_k,
      "repetition_penalty": repetition_penalty,
      "min_p": min_p
    }
  )

if stream:
    for chunk in chat_completion_res:
        print(chunk.choices[0].delta.content or "", end="")
else:
    print(chat_completion_res.choices[0].message.content)
  
  

Option 2: Multi-Agent Workflows with OpenAI Agents SDK

Build sophisticated multi-agent systems leveraging PaddleOCR-VL’s advanced document parsing capabilities:

  • Plug-and-Play Integration: Use PaddleOCR-VL in any OpenAI Agents workflow
  • Advanced Agent Capabilities: Support for handoffs, routing, and tool integration with document understanding
  • Scalable Architecture: Design agents that leverage PaddleOCR-VL’s multilingual OCR and element recognition capabilities

Option 3: Connect with Third-Party Platforms

Development Tools: Seamlessly integrate with popular IDEs and development environments like Cursor, Trae, and Cline through OpenAI-compatible APIs and Anthropic-compatible APIs.

Orchestration Frameworks: Connect with LangChain, Dify, CrewAI, Langflow, and other AI orchestration platforms using official connectors.

Hugging Face Integration: Novita AI serves as an official inference provider of Hugging Face, ensuring broad ecosystem compatibility.

Conclusion

PaddleOCR on Novita AI delivers state-of-the-art multilingual document parsing capabilities through an ultra-compact 0.9B vision-language model that combines exceptional accuracy with remarkable efficiency. With support for 109 languages, SOTA performance on OmniDocBench benchmarks, and excellence in recognizing complex document elements including text, tables, formulas, and charts, PaddleOCR-VL represents the definitive choice for modern document processing applications.

The model’s compact architecture, fast inference speeds, and resource efficiency make it highly suitable for practical deployment in real-world scenarios. Whether you’re processing multilingual documents, extracting data from complex tables, recognizing mathematical formulas, or analyzing charts, PaddleOCR-VL on Novita AI provides the performance and reliability you need.

Start exploring PaddleOCR-VL‘s revolutionary document parsing capabilities on Novita AI today and experience the future of intelligent document processing with our developer-friendly platform and seamless integration options.

Novita AI is an AI cloud platform that offers developers an easy way to deploy AI models using our simple API, while also providing the affordable and reliable GPU cloud for building and scaling.


Discover more from Novita

Subscribe to get the latest posts sent to your email.

Leave a Comment

Scroll to Top

Discover more from Novita

Subscribe now to keep reading and get access to the full archive.

Continue reading