Skip to content

Enhancement: Integrate IBM Docling for Advanced Document Processing #255

@manavgup

Description

@manavgup

Summary

Replace current document processors with IBM Docling to significantly enhance document ingestion capabilities, including superior table extraction, layout analysis, reading order detection, and support for additional file formats.

Background

Current Implementation

RAG Modulo currently uses custom processors for document ingestion:

  • PDF Processing: PyMuPDF (pymupdf) - backend/rag_solution/data_ingestion/pdf_processor.py
  • Word Processing: python-docx - backend/rag_solution/data_ingestion/word_processor.py
  • Excel Processing: openpyxl - backend/rag_solution/data_ingestion/excel_processor.py
  • Text Processing: Custom text processor - backend/rag_solution/data_ingestion/txt_processor.py

Current Limitations:

  • Limited to 4 file formats (.pdf, .docx, .xlsx, .txt)
  • Basic table extraction using PyMuPDF's built-in methods
  • No layout analysis or reading order detection
  • No formula/code detection in PDFs
  • Basic metadata extraction
  • No PowerPoint, HTML, or image format support
  • Manual processor management per file type

IBM Docling Overview

Docling is IBM's open-source, MIT-licensed document processing toolkit with:

Key Features:

  • Support for PDF, DOCX, PPTX, XLSX, HTML, images, audio (WAV/MP3)
  • AI-powered layout analysis using DocLayNet model
  • Advanced table structure recognition with TableFormer model
  • Reading order detection for multi-column documents
  • Formula and code extraction from PDFs
  • Image classification
  • Unified DoclingDocument representation
  • Export to Markdown, HTML, JSON
  • Pre-built integrations with LangChain, LlamaIndex

Recent Developments (2025):

  • Granite-Docling-258M: Ultra-compact VLM (258M parameters) for one-shot document processing
  • 37,000+ GitHub stars
  • Hosted by LF AI & Data Foundation
  • Active development by IBM Research

Benefits of Integration

High Impact Improvements

  1. Enhanced Table Extraction

    • TableFormer AI model significantly outperforms PyMuPDF for complex tables
    • Better handling of merged cells, nested tables, and irregular structures
    • Impact: Critical for enterprise documents with financial data, reports
  2. Reading Order Detection

    • AI-powered layout analysis determines correct reading flow
    • Essential for multi-column documents, scientific papers, magazines
    • Impact: Improves RAG search quality by preserving document semantics
  3. Format Expansion

    • Add PPTX support (presentations)
    • Add HTML support (web content)
    • Add image format support (PNG, JPEG, TIFF with OCR)
    • Impact: Expands RAG Modulo's document ingestion capabilities without custom processors
  4. Reduced Maintenance

    • Single library replaces 4+ custom processors
    • IBM-maintained with active development
    • Impact: Reduced technical debt, faster feature adoption
  5. Better Structure Preservation

    • Layout-aware extraction maintains document hierarchy
    • Preserves headings, sections, lists, code blocks
    • Impact: Improved context for RAG retrieval

Moderate Impact Improvements

  • Formula/code detection useful for technical/scientific documents
  • Image classification for better image chunk metadata
  • Markdown export for document preview/debugging
  • Alignment with existing WatsonX integration strategy

Implementation Approach

Option 1: Full Replacement (Recommended)

Create a unified Docling adapter that handles all document types:

New File: `backend/rag_solution/data_ingestion/docling_processor.py`

```python
from docling import DocumentConverter
from rag_solution.data_ingestion.base_processor import BaseProcessor

class DoclingProcessor(BaseProcessor):
"""Unified document processor using IBM Docling."""

def __init__(self, settings: Settings):
    super().__init__(settings)
    self.converter = DocumentConverter()

async def process(self, file_path: str, document_id: str) -> AsyncIterator[Document]:
    """Process any document type using Docling."""
    result = self.converter.convert(file_path)
    
    # Convert DoclingDocument to RAG Modulo Document format
    for chunk in self._convert_to_chunks(result, document_id):
        yield chunk

def _convert_to_chunks(self, docling_doc, document_id: str) -> list[Document]:
    """Convert Docling's DoclingDocument to RAG Modulo Document format."""
    # Preserve metadata, layout information, table structures
    # Apply existing chunking strategies
    # Maintain compatibility with embedding pipeline
    pass

```

Update: `backend/rag_solution/data_ingestion/document_processor.py`

```python
self.processors: dict[str, BaseProcessor] = {
".pdf": DoclingProcessor(settings), # Replace PyMuPDF
".docx": DoclingProcessor(settings), # Replace python-docx
".pptx": DoclingProcessor(settings), # NEW FORMAT
".html": DoclingProcessor(settings), # NEW FORMAT
".png": DoclingProcessor(settings), # NEW FORMAT
".jpg": DoclingProcessor(settings), # NEW FORMAT
".txt": TxtProcessor(settings), # Keep for simplicity
".xlsx": ExcelProcessor(settings), # Keep for simplicity
}
```

Option 2: Hybrid Approach (Lower Risk)

  • Keep current processors for simple formats (.txt, .xlsx)
  • Use Docling for complex formats (.pdf, .docx, .pptx, .html)
  • Gradual migration with feature flag

Implementation Plan

Phase 1: Setup & Infrastructure

  • Add `docling` dependency to `backend/pyproject.toml`
  • Create `DoclingProcessor` class in `backend/rag_solution/data_ingestion/`
  • Implement DoclingDocument → Document adapter
  • Add feature flag for Docling vs legacy processors

Phase 2: Core Integration

  • Integrate DoclingProcessor with PDF files
  • Update `DocumentProcessor` to route to appropriate processor
  • Ensure compatibility with existing embedding pipeline (`DocumentStore._embed_documents_batch()`)
  • Preserve existing chunking strategies integration

Phase 3: Testing & Validation

  • Unit tests: Compare Docling output vs current processors
  • Integration tests: Full ingestion pipeline with Docling
  • Performance benchmarks: Processing speed, memory usage
  • Quality validation: Table extraction accuracy, reading order correctness
  • Test on representative document corpus

Phase 4: Format Expansion

  • Add PPTX support
  • Add HTML support
  • Add image format support (PNG, JPEG)
  • Update API documentation with new supported formats

Phase 5: Migration & Rollout

  • Gradual rollout with feature flag
  • Monitor performance metrics
  • Deprecate old processors
  • Update documentation

Technical Considerations

Dependencies

```toml

backend/pyproject.toml

[tool.poetry.dependencies]
docling = "^2.0.0"
```

Performance

  • Docling runs efficiently on commodity hardware (no GPU required)
  • AI models (DocLayNet, TableFormer) have small resource footprint
  • Compatible with existing async/batch processing architecture

Compatibility

  • Maintains compatibility with existing `Document`/`DocumentChunk` schema
  • Works with current embedding generation pipeline
  • Preserves chunking strategies (simple, semantic, token-based)

Risks & Mitigations

Risks:

  1. New dependency introduces potential instability
    • Mitigation: MIT license, IBM-backed, active development, 37K+ stars
  2. Performance impact of AI models
    • Mitigation: Benchmark before rollout, feature flag for rollback
  3. Breaking changes to document processing behavior
    • Mitigation: Side-by-side testing, gradual migration
  4. Migration effort for existing ingested documents
    • Mitigation: Optional re-ingestion, version document processor in metadata

Mitigation Strategy:

  • Start with hybrid approach (PDF only via Docling)
  • Feature flag to toggle between processors
  • Comprehensive benchmarking on production document corpus
  • Fallback to legacy processors if Docling fails

Success Metrics

Quantitative

  • Table extraction accuracy improvement (target: >30%)
  • Reading order correctness for multi-column documents (target: >90%)
  • Support for 3+ new file formats (PPTX, HTML, images)
  • Performance: Processing time within 20% of current implementation

Qualitative

  • Improved RAG search quality for documents with complex layouts
  • Reduced maintenance burden (fewer custom processors)
  • Better document structure preservation in chunks

References

Related Files

Current Implementation:

  • `backend/rag_solution/data_ingestion/document_processor.py` - Main processor orchestrator
  • `backend/rag_solution/data_ingestion/pdf_processor.py` - PDF processing (to be replaced)
  • `backend/rag_solution/data_ingestion/word_processor.py` - Word processing (to be replaced)
  • `backend/rag_solution/data_ingestion/excel_processor.py` - Excel processing (keep)
  • `backend/rag_solution/data_ingestion/txt_processor.py` - Text processing (keep)
  • `backend/rag_solution/data_ingestion/ingestion.py` - Ingestion pipeline
  • `backend/rag_solution/data_ingestion/chunking.py` - Chunking strategies

New Files to Create:

  • `backend/rag_solution/data_ingestion/docling_processor.py` - Docling adapter

Tests to Create/Update:

  • `backend/tests/unit/test_docling_processor.py` - Unit tests
  • `backend/tests/integration/test_docling_integration.py` - Integration tests
  • `backend/tests/performance/test_docling_performance.py` - Performance benchmarks

Estimated Effort

  • Setup & Core Integration: 2-3 days
  • Testing & Validation: 2-3 days
  • Format Expansion: 1-2 days
  • Documentation & Rollout: 1 day

Total: 6-9 days of development effort

Priority

High Priority - This enhancement provides significant improvements to core document processing capabilities, aligns with IBM ecosystem strategy (WatsonX), and reduces technical debt.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestinfrastructureInfrastructure and deployment

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions