Close Menu
    Facebook X (Twitter) Instagram
    • Privacy Policy
    • Terms Of Service
    • Social Media Disclaimer
    • DMCA Compliance
    • Anti-Spam Policy
    Facebook X (Twitter) Instagram
    Automation Glance
    • Home
    • Crypto News
      • Bitcoin
      • Ethereum
      • Altcoins
      • Blockchain
      • DeFi
    • AI News
    • Stock News
    • Learn
      • AI for Beginners
      • AI Tips
      • Make Money with AI
    • Reviews
    • Tools
      • Best AI Tools
      • Crypto Market Cap List
      • Stock Market Overview
      • Market Heatmap
    • Contact
    Automation Glance
    Home»AI News»Baidu’s PaddlePaddle Team Releases PaddleOCR-VL (0.9B): a NaViT-style + ERNIE-4.5-0.3B VLM Targeting End-to-End Multilingual Document Parsing
    AI News

    Baidu’s PaddlePaddle Team Releases PaddleOCR-VL (0.9B): a NaViT-style + ERNIE-4.5-0.3B VLM Targeting End-to-End Multilingual Document Parsing

    October 17, 20253 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email
    kraken


    How do you convert complex, multilingual documents—dense layouts, small scripts, formulas, charts, and handwriting—into faithful structured Markdown/JSON with state-of-the-art accuracy while keeping inference latency and memory low enough for real deployments?Baidu’s PaddlePaddle group has released PaddleOCR-VL, a 0.9B-parameter vision-language model designed for end-to-end document parsing across text, tables, formulas, charts, and handwriting. The core model combines a NaViT-style (Native-resolution ViT) dynamic-resolution vision encoder with the ERNIE-4.5-0.3B decoder. It supports 109 languages.

    https://ernie.baidu.com/blog/publication/PaddleOCR-VL_Technical_Report.pdf

    Understanding the system design

    PaddleOCR-VL is deployed as a two-stage pipeline. Stage one (PP-DocLayoutV2) performs page-level layout analysis: an RT-DETR detector localizes and classifies regions; a pointer network predicts reading order. Stage two (PaddleOCR-VL-0.9B) conducts element-level recognition conditioned on the detected layout. Final outputs are aggregated to Markdown and JSON for downstream consumption. This decoupling mitigates long-sequence decoding latency and instability that end-to-end VLMs face on dense, multi-column, mixed text–graphic pages.

    At the model level, PaddleOCR-VL-0.9B integrates a NaViT-style dynamic high-resolution encoder (native-resolution sequence packing) with a 2-layer MLP projector and the ERNIE-4.5-0.3B language model; 3D-RoPE is used for positional representation. The technical report attributes lower hallucinations and better text-dense performance to native-resolution processing relative to fixed-resize or tiling approaches. The NaViT idea—patch-and-pack variable-resolution inputs without destructive resizing—originates from prior work showing improved efficiency and robustness; PaddleOCR-VL adopts this encoder style directly.

    Benchmarks

    PaddleOCR-VL achieves state-of-the-art results on OmniDocBench v1.5 and competitive or leading scores on v1.0, covering overall quality as well as sub-tasks (text edit distances, Formula-CDM, Table-TEDS/TEDS-S, and reading-order edit), with complementary strength on olmOCR-Bench and in-house handwriting, table, formula, and chart evaluations.

    binance
    https://ernie.baidu.com/blog/publication/PaddleOCR-VL_Technical_Report.pdf

    Key Takeaways

    • 0.9B-parameter PaddleOCR-VL integrates a NaViT-style dynamic-resolution encoder with ERNIE-4.5-0.3B for document parsing.
    • Targets end-to-end extraction across text, tables, formulas, charts, and handwriting with structured Markdown/JSON outputs.
    • Claims SOTA performance on public document benchmarks with fast inference suitable for deployment.
    • Supports 109 languages, including small scripts and complex page layouts.

    This release is meaningful because it joins a NaViT-style dynamic-resolution visual encoder with the lightweight ERNIE-4.5-0.3B decoder to deliver SOTA page-level document parsing and element-level recognition at practical inference cost. The two-stage PP-DocLayoutV2 → PaddleOCR-VL-0.9B design stabilizes reading order and preserves native typography cues, which matter for small scripts, formulas, charts, and handwriting across 109 languages. Structured Markdown/JSON outputs and optional vLLM/SGLang acceleration make the system operationally clean for production document intelligence.

    Check out the Technical Paper, Model on HF, and Technical details . Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

    Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

    🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.



    Source link

    aistudios
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    CryptoExpert
    • Website

    Related Posts

    MIT Energy Initiative launches Data Center Power Forum | MIT News

    November 8, 2025

    Moonshot's Kimi K2 Thinking emerges as leading open source AI, outperforming GPT-5, Claude Sonnet 4.5 on key benchmarks

    November 7, 2025

    Is AI in a bubble? Succeed despite a market correction

    November 6, 2025

    Google AI Introduces Consistency Training for Safer Language Models Under Sycophantic and Jailbreak Style Prompts

    November 5, 2025
    Add A Comment

    Comments are closed.

    binance
    Latest Posts

    Rising Liquidity Pushes Bitcoin Into Bullish Consolidation

    November 9, 2025

    Investors Eye Storage Coins as Next Market Movers

    November 8, 2025

    Dogecoin Price Prediction: Bitwise ETF Filing Hints at November Launch

    November 8, 2025

    Ethereum Struggles to Reclaim $3,900 as Weak Demand and Fear Persist

    November 8, 2025

    Investors: How to Turn $20K Into a Cash Flow Machine

    November 8, 2025
    10web
    LEGAL INFORMATION
    • Privacy Policy
    • Terms Of Service
    • Social Media Disclaimer
    • DMCA Compliance
    • Anti-Spam Policy
    Top Insights

    AI Glasses: The Next Big Thing in Wearables | Google Glass, Meta Rayban, Lenskart & More

    November 9, 2025

    GROK AI STEP BY STEP GUIDE 2025 l HOW TO USE GROK AI FOR BEGINNERS l MASTER GROK AI IN 2025

    November 9, 2025
    aistudios
    Facebook X (Twitter) Instagram Pinterest
    © 2025 AutomationGlance.com - All rights reserved.

    Type above and press Enter to search. Press Esc to cancel.