Building a Local Medical Document Intelligence System with PaddleOCR-VL

Traditional OCR extracts text, but Vision-Language Models understand meaning- Transforming medical documents from unstructured data into queryable knowledge that runs entirely on your infrastructure.

Arindam Ghosh

Nov 19, 2025

6 min read

In healthcare, data is rarely the problem. The format of that data is.

Medical discharge summaries, lab reports, and insurance claims are often trapped in PDFs or scanned images. For years, hospital technical teams have relied on Optical Character Recognition (OCR) to unlock this data. While traditional OCR is effective at recognizing characters, it fails at recognizing context.

This guide shows you how to build a medical document intelligence system that runs entirely on-premise, ensuring data security and compliance, using PaddleOCR-VL and retrieval-augmented generation (RAG) to transform static documents into queryable knowledge.

The Problem: Layout Determines Meaning

Traditional OCR engines treat documents as streams of text. They excel at detecting text boxes and converting pixels into strings, but they miss what matters most in medical documents: structure.

In a discharge summary, layout determines meaning. A number implies a dosage only because it sits in a specific column of a medication table. A date signifies a follow-up appointment only because it appears under a specific header.

Traditional OCR flattens this multi-dimensional document into a "bag of words." You might get the text "50mg," but without the visual structure, the system cannot associate it with "Metoprolol" located three lines above and two inches to the left.

Traditional OCR outputs unstructured text where spatial relationships are lost. PaddleOCR-VL preserves document structure, maintaining tables, hierarchies, and formatting.

This structural loss is devastating for intelligent systems. Without proper document understanding, downstream RAG systems cannot accurately retrieve relevant information or generate meaningful answers.

The Solution: Vision-Language Models

PaddleOCR-VL represents a paradigm shift. Instead of treating text processing and image processing as separate pipelines, it processes documents holistically, combining visual understanding with language comprehension.

PaddleOCR-VL pipeline transforming unstructured medical documents into structured data formats

What makes this different:

Unlike traditional OCR that only "sees" text, PaddleOCR-VL uses a vision encoder to understand document layout (tables, headers, indentation) while simultaneously decoding textual content. It includes a dedicated layout analysis component that identifies different content regions: text blocks, tables, formulas, and charts.

The model accepts task-specific instructions. Rather than generic OCR, you can direct it to focus on table recognition, formula extraction, or chart interpretation. This allows it to handle the content types most relevant to your use case.

The multimodal architecture combining vision encoding (NaViT) with language decoding (ERNIE) to process complex documents

For medical discharge summaries, this means:

Medication lists are extracted as structured records with dosages, frequencies, and routes
Lab results maintain their tabular format with proper value-to-parameter associations
Section headings correctly delineate content boundaries
Clinical notes preserve their logical flow and hierarchy

Instead of returning raw text, PaddleOCR-VL returns structured data (JSON or Markdown) that preserves the hierarchy and spatial relationships of the original document. This structure is critical when the goal is not just to archive a file, but to query it.

Building Your Document Querying System

A system that allows clinicians to ask questions like "What medications is the patient taking?" or "Summarize the primary diagnosis" requires three main components:

1. Intelligent Document Processing

Process medical PDFs through PaddleOCR-VL to extract structured data. The model analyzes layout and converts the visual document into Markdown format, ensuring tables remain tables and headers remain headers. This runs entirely on your infrastructure, keeping sensitive patient data secure.

2. Semantic Search Infrastructure

Convert the structured document into searchable knowledge:

Split documents into semantically meaningful chunks (by medical sections or logical content blocks)
Generate vector embeddings using models like sentence-transformers
Store embeddings in a vector database (ChromaDB or similar) for efficient similarity search

The chunking strategy is critical. For discharge summaries, section-aware chunking that respects document structure yields the best retrieval results.

3. Retrieval-Augmented Generation

Enable natural language querying:

Convert user questions into embeddings
Perform semantic search to find relevant document chunks
Send retrieved context to an LLM (Qwen 2.5 or similar) alongside the question
Generate answers grounded in actual document content
Include citations linking back to specific document sections

This RAG architecture ensures answers are based on extracted document content rather than hallucinated, a critical requirement for medical applications.

The System in Action

Here's how the end-to-end system works:

A discharge summary PDF is processed through PaddleOCR-VL, extracting structured content with preserved layout. The structured document is chunked intelligently, embedded, and stored in a vector database. When a clinician asks "What medications is the patient taking?" the system searches for relevant chunks, retrieves the medications section, and an LLM generates a structured response with dosages, frequencies, and routes. The answer includes citations pointing to specific sections in the original document.

A complete medical document querying interface built on PaddleOCR-VL and RAG architecture

The result is a conversational interface where technical teams or medical staff can interact with static files as if they were databases. This shift from OCR to VLM enables several critical improvements:

Data Interoperability: Converting unstructured PDFs into structured JSON/Markdown makes it easier to map data to FHIR standards or integrate with existing EHR systems.

Semantic Search: Finding concepts like "cardiac issues" even when exact keywords are missing, unlike simple text searching.

Clinical Efficiency: Automating extraction of medication lists and diagnoses reduces manual data entry burden on administrative staff.

Security and Compliance: Running entirely on-premise ensures patient data never leaves your infrastructure, meeting HIPAA and regulatory requirements.

Implementation Considerations

When building your own system, several factors impact success:

Document Quality: While PaddleOCR-VL is robust, aim for 300 DPI minimum resolution, apply orientation correction for scans, and ensure proper page separation for multi-page PDFs.

Model Configuration: The system works well out-of-the-box, but you may fine-tune on your specific document types or adjust instruction prompts for your content.

Retrieval Strategy: Experiment with embedding models (general vs. medical-specific), chunk sizes, and the number of chunks to retrieve (top-k parameter).

LLM Selection: Choose between open-source options (Qwen 2.5, Llama) or proprietary models for better medical reasoning. Consider cost versus quality tradeoffs for production deployment.

Compliance: For medical applications, implement confidence scoring, require source citations, add appropriate disclaimers, maintain audit logs, and consider human-in-the-loop validation for critical decisions.

Build It Yourself

We have created a comprehensive Colab notebook that implements the complete system, from document processing with PaddleOCR-VL to building a queryable RAG interface.

The notebook includes:

Full PaddleOCR-VL setup and configuration
Document processing with real discharge summaries
Vector database creation and indexing
RAG pipeline implementation with Qwen 2.5
Interactive querying interface built with Gradio

Access the Complete Tutorial Notebook (link to your Colab notebook)

Resources

Technical Documentation:

PaddleOCR-VL: https://huggingface.co/PaddlePaddle/PaddleOCR-VL
PaddleOCR GitHub: https://github.com/PaddlePaddle/PaddleOCR
Qwen 2.5: https://github.com/QwenLM/Qwen2.5

From Prototype to Production with Rasyn AI

While this tutorial demonstrates the technical foundations of document intelligence, moving from prototype to production at scale requires robust infrastructure, workflow orchestration, and enterprise-grade reliability.

At Rasyn AI, we have built an agentic document platform that takes this approach further. Our system allows healthcare organizations to build custom document workflows with a much better agentic approach, experiment with their documents at scale, and deploy on-premise or in the cloud while maintaining full data security and compliance.

If you are looking to implement document intelligence across your healthcare organization without building and maintaining complex infrastructure, we would be happy to discuss how Rasyn AI can help.

Contact us: [email protected] | Learn more: https://www.rasyn.tech

This technical guide is provided for educational purposes. When implementing document intelligence systems in healthcare settings, ensure compliance with all relevant regulations including HIPAA and institutional requirements.

View All Blog Posts