-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Summary
Replace current document processors with IBM Docling to significantly enhance document ingestion capabilities, including superior table extraction, layout analysis, reading order detection, and support for additional file formats.
Background
Current Implementation
RAG Modulo currently uses custom processors for document ingestion:
- PDF Processing: PyMuPDF (
pymupdf
) -backend/rag_solution/data_ingestion/pdf_processor.py
- Word Processing: python-docx -
backend/rag_solution/data_ingestion/word_processor.py
- Excel Processing: openpyxl -
backend/rag_solution/data_ingestion/excel_processor.py
- Text Processing: Custom text processor -
backend/rag_solution/data_ingestion/txt_processor.py
Current Limitations:
- Limited to 4 file formats (.pdf, .docx, .xlsx, .txt)
- Basic table extraction using PyMuPDF's built-in methods
- No layout analysis or reading order detection
- No formula/code detection in PDFs
- Basic metadata extraction
- No PowerPoint, HTML, or image format support
- Manual processor management per file type
IBM Docling Overview
Docling is IBM's open-source, MIT-licensed document processing toolkit with:
Key Features:
- Support for PDF, DOCX, PPTX, XLSX, HTML, images, audio (WAV/MP3)
- AI-powered layout analysis using DocLayNet model
- Advanced table structure recognition with TableFormer model
- Reading order detection for multi-column documents
- Formula and code extraction from PDFs
- Image classification
- Unified DoclingDocument representation
- Export to Markdown, HTML, JSON
- Pre-built integrations with LangChain, LlamaIndex
Recent Developments (2025):
- Granite-Docling-258M: Ultra-compact VLM (258M parameters) for one-shot document processing
- 37,000+ GitHub stars
- Hosted by LF AI & Data Foundation
- Active development by IBM Research
Benefits of Integration
High Impact Improvements
-
Enhanced Table Extraction
- TableFormer AI model significantly outperforms PyMuPDF for complex tables
- Better handling of merged cells, nested tables, and irregular structures
- Impact: Critical for enterprise documents with financial data, reports
-
Reading Order Detection
- AI-powered layout analysis determines correct reading flow
- Essential for multi-column documents, scientific papers, magazines
- Impact: Improves RAG search quality by preserving document semantics
-
Format Expansion
- Add PPTX support (presentations)
- Add HTML support (web content)
- Add image format support (PNG, JPEG, TIFF with OCR)
- Impact: Expands RAG Modulo's document ingestion capabilities without custom processors
-
Reduced Maintenance
- Single library replaces 4+ custom processors
- IBM-maintained with active development
- Impact: Reduced technical debt, faster feature adoption
-
Better Structure Preservation
- Layout-aware extraction maintains document hierarchy
- Preserves headings, sections, lists, code blocks
- Impact: Improved context for RAG retrieval
Moderate Impact Improvements
- Formula/code detection useful for technical/scientific documents
- Image classification for better image chunk metadata
- Markdown export for document preview/debugging
- Alignment with existing WatsonX integration strategy
Implementation Approach
Option 1: Full Replacement (Recommended)
Create a unified Docling adapter that handles all document types:
New File: `backend/rag_solution/data_ingestion/docling_processor.py`
```python
from docling import DocumentConverter
from rag_solution.data_ingestion.base_processor import BaseProcessor
class DoclingProcessor(BaseProcessor):
"""Unified document processor using IBM Docling."""
def __init__(self, settings: Settings):
super().__init__(settings)
self.converter = DocumentConverter()
async def process(self, file_path: str, document_id: str) -> AsyncIterator[Document]:
"""Process any document type using Docling."""
result = self.converter.convert(file_path)
# Convert DoclingDocument to RAG Modulo Document format
for chunk in self._convert_to_chunks(result, document_id):
yield chunk
def _convert_to_chunks(self, docling_doc, document_id: str) -> list[Document]:
"""Convert Docling's DoclingDocument to RAG Modulo Document format."""
# Preserve metadata, layout information, table structures
# Apply existing chunking strategies
# Maintain compatibility with embedding pipeline
pass
```
Update: `backend/rag_solution/data_ingestion/document_processor.py`
```python
self.processors: dict[str, BaseProcessor] = {
".pdf": DoclingProcessor(settings), # Replace PyMuPDF
".docx": DoclingProcessor(settings), # Replace python-docx
".pptx": DoclingProcessor(settings), # NEW FORMAT
".html": DoclingProcessor(settings), # NEW FORMAT
".png": DoclingProcessor(settings), # NEW FORMAT
".jpg": DoclingProcessor(settings), # NEW FORMAT
".txt": TxtProcessor(settings), # Keep for simplicity
".xlsx": ExcelProcessor(settings), # Keep for simplicity
}
```
Option 2: Hybrid Approach (Lower Risk)
- Keep current processors for simple formats (.txt, .xlsx)
- Use Docling for complex formats (.pdf, .docx, .pptx, .html)
- Gradual migration with feature flag
Implementation Plan
Phase 1: Setup & Infrastructure
- Add `docling` dependency to `backend/pyproject.toml`
- Create `DoclingProcessor` class in `backend/rag_solution/data_ingestion/`
- Implement DoclingDocument → Document adapter
- Add feature flag for Docling vs legacy processors
Phase 2: Core Integration
- Integrate DoclingProcessor with PDF files
- Update `DocumentProcessor` to route to appropriate processor
- Ensure compatibility with existing embedding pipeline (`DocumentStore._embed_documents_batch()`)
- Preserve existing chunking strategies integration
Phase 3: Testing & Validation
- Unit tests: Compare Docling output vs current processors
- Integration tests: Full ingestion pipeline with Docling
- Performance benchmarks: Processing speed, memory usage
- Quality validation: Table extraction accuracy, reading order correctness
- Test on representative document corpus
Phase 4: Format Expansion
- Add PPTX support
- Add HTML support
- Add image format support (PNG, JPEG)
- Update API documentation with new supported formats
Phase 5: Migration & Rollout
- Gradual rollout with feature flag
- Monitor performance metrics
- Deprecate old processors
- Update documentation
Technical Considerations
Dependencies
```toml
backend/pyproject.toml
[tool.poetry.dependencies]
docling = "^2.0.0"
```
Performance
- Docling runs efficiently on commodity hardware (no GPU required)
- AI models (DocLayNet, TableFormer) have small resource footprint
- Compatible with existing async/batch processing architecture
Compatibility
- Maintains compatibility with existing `Document`/`DocumentChunk` schema
- Works with current embedding generation pipeline
- Preserves chunking strategies (simple, semantic, token-based)
Risks & Mitigations
Risks:
- New dependency introduces potential instability
- Mitigation: MIT license, IBM-backed, active development, 37K+ stars
- Performance impact of AI models
- Mitigation: Benchmark before rollout, feature flag for rollback
- Breaking changes to document processing behavior
- Mitigation: Side-by-side testing, gradual migration
- Migration effort for existing ingested documents
- Mitigation: Optional re-ingestion, version document processor in metadata
Mitigation Strategy:
- Start with hybrid approach (PDF only via Docling)
- Feature flag to toggle between processors
- Comprehensive benchmarking on production document corpus
- Fallback to legacy processors if Docling fails
Success Metrics
Quantitative
- Table extraction accuracy improvement (target: >30%)
- Reading order correctness for multi-column documents (target: >90%)
- Support for 3+ new file formats (PPTX, HTML, images)
- Performance: Processing time within 20% of current implementation
Qualitative
- Improved RAG search quality for documents with complex layouts
- Reduced maintenance burden (fewer custom processors)
- Better document structure preservation in chunks
References
- Docling GitHub Repository (37K+ stars)
- Docling Documentation
- IBM Research: Docling Announcement
- Granite-Docling-258M Model
- IBM Announcement: Granite-Docling End-to-End
Related Files
Current Implementation:
- `backend/rag_solution/data_ingestion/document_processor.py` - Main processor orchestrator
- `backend/rag_solution/data_ingestion/pdf_processor.py` - PDF processing (to be replaced)
- `backend/rag_solution/data_ingestion/word_processor.py` - Word processing (to be replaced)
- `backend/rag_solution/data_ingestion/excel_processor.py` - Excel processing (keep)
- `backend/rag_solution/data_ingestion/txt_processor.py` - Text processing (keep)
- `backend/rag_solution/data_ingestion/ingestion.py` - Ingestion pipeline
- `backend/rag_solution/data_ingestion/chunking.py` - Chunking strategies
New Files to Create:
- `backend/rag_solution/data_ingestion/docling_processor.py` - Docling adapter
Tests to Create/Update:
- `backend/tests/unit/test_docling_processor.py` - Unit tests
- `backend/tests/integration/test_docling_integration.py` - Integration tests
- `backend/tests/performance/test_docling_performance.py` - Performance benchmarks
Estimated Effort
- Setup & Core Integration: 2-3 days
- Testing & Validation: 2-3 days
- Format Expansion: 1-2 days
- Documentation & Rollout: 1 day
Total: 6-9 days of development effort
Priority
High Priority - This enhancement provides significant improvements to core document processing capabilities, aligns with IBM ecosystem strategy (WatsonX), and reduces technical debt.