Enhancement: Integrate IBM Docling for Advanced Document Processing

## Summary

Replace current document processors with IBM Docling to significantly enhance document ingestion capabilities, including superior table extraction, layout analysis, reading order detection, and support for additional file formats.

## Background

### Current Implementation

RAG Modulo currently uses custom processors for document ingestion:
- **PDF Processing**: PyMuPDF (`pymupdf`) - `backend/rag_solution/data_ingestion/pdf_processor.py`
- **Word Processing**: python-docx - `backend/rag_solution/data_ingestion/word_processor.py`
- **Excel Processing**: openpyxl - `backend/rag_solution/data_ingestion/excel_processor.py`
- **Text Processing**: Custom text processor - `backend/rag_solution/data_ingestion/txt_processor.py`

**Current Limitations:**
- Limited to 4 file formats (.pdf, .docx, .xlsx, .txt)
- Basic table extraction using PyMuPDF's built-in methods
- No layout analysis or reading order detection
- No formula/code detection in PDFs
- Basic metadata extraction
- No PowerPoint, HTML, or image format support
- Manual processor management per file type

### IBM Docling Overview

[Docling](https://github.com/docling-project/docling) is IBM's open-source, MIT-licensed document processing toolkit with:

**Key Features:**
- Support for PDF, DOCX, PPTX, XLSX, HTML, images, audio (WAV/MP3)
- AI-powered layout analysis using DocLayNet model
- Advanced table structure recognition with TableFormer model
- Reading order detection for multi-column documents
- Formula and code extraction from PDFs
- Image classification
- Unified DoclingDocument representation
- Export to Markdown, HTML, JSON
- Pre-built integrations with LangChain, LlamaIndex

**Recent Developments (2025):**
- Granite-Docling-258M: Ultra-compact VLM (258M parameters) for one-shot document processing
- 37,000+ GitHub stars
- Hosted by LF AI & Data Foundation
- Active development by IBM Research

## Benefits of Integration

### High Impact Improvements

1. **Enhanced Table Extraction**
   - TableFormer AI model significantly outperforms PyMuPDF for complex tables
   - Better handling of merged cells, nested tables, and irregular structures
   - **Impact**: Critical for enterprise documents with financial data, reports

2. **Reading Order Detection**
   - AI-powered layout analysis determines correct reading flow
   - Essential for multi-column documents, scientific papers, magazines
   - **Impact**: Improves RAG search quality by preserving document semantics

3. **Format Expansion**
   - Add PPTX support (presentations)
   - Add HTML support (web content)
   - Add image format support (PNG, JPEG, TIFF with OCR)
   - **Impact**: Expands RAG Modulo's document ingestion capabilities without custom processors

4. **Reduced Maintenance**
   - Single library replaces 4+ custom processors
   - IBM-maintained with active development
   - **Impact**: Reduced technical debt, faster feature adoption

5. **Better Structure Preservation**
   - Layout-aware extraction maintains document hierarchy
   - Preserves headings, sections, lists, code blocks
   - **Impact**: Improved context for RAG retrieval

### Moderate Impact Improvements

- Formula/code detection useful for technical/scientific documents
- Image classification for better image chunk metadata
- Markdown export for document preview/debugging
- Alignment with existing WatsonX integration strategy

## Implementation Approach

### Option 1: Full Replacement (Recommended)

Create a unified Docling adapter that handles all document types:

**New File**: \`backend/rag_solution/data_ingestion/docling_processor.py\`

\`\`\`python
from docling import DocumentConverter
from rag_solution.data_ingestion.base_processor import BaseProcessor

class DoclingProcessor(BaseProcessor):
    """Unified document processor using IBM Docling."""
    
    def __init__(self, settings: Settings):
        super().__init__(settings)
        self.converter = DocumentConverter()
    
    async def process(self, file_path: str, document_id: str) -> AsyncIterator[Document]:
        """Process any document type using Docling."""
        result = self.converter.convert(file_path)
        
        # Convert DoclingDocument to RAG Modulo Document format
        for chunk in self._convert_to_chunks(result, document_id):
            yield chunk
    
    def _convert_to_chunks(self, docling_doc, document_id: str) -> list[Document]:
        """Convert Docling's DoclingDocument to RAG Modulo Document format."""
        # Preserve metadata, layout information, table structures
        # Apply existing chunking strategies
        # Maintain compatibility with embedding pipeline
        pass
\`\`\`

**Update**: \`backend/rag_solution/data_ingestion/document_processor.py\`

\`\`\`python
self.processors: dict[str, BaseProcessor] = {
    ".pdf": DoclingProcessor(settings),    # Replace PyMuPDF
    ".docx": DoclingProcessor(settings),   # Replace python-docx
    ".pptx": DoclingProcessor(settings),   # NEW FORMAT
    ".html": DoclingProcessor(settings),   # NEW FORMAT
    ".png": DoclingProcessor(settings),    # NEW FORMAT
    ".jpg": DoclingProcessor(settings),    # NEW FORMAT
    ".txt": TxtProcessor(settings),        # Keep for simplicity
    ".xlsx": ExcelProcessor(settings),     # Keep for simplicity
}
\`\`\`

### Option 2: Hybrid Approach (Lower Risk)

- Keep current processors for simple formats (.txt, .xlsx)
- Use Docling for complex formats (.pdf, .docx, .pptx, .html)
- Gradual migration with feature flag

## Implementation Plan

### Phase 1: Setup & Infrastructure
- [ ] Add \`docling\` dependency to \`backend/pyproject.toml\`
- [ ] Create \`DoclingProcessor\` class in \`backend/rag_solution/data_ingestion/\`
- [ ] Implement DoclingDocument → Document adapter
- [ ] Add feature flag for Docling vs legacy processors

### Phase 2: Core Integration
- [ ] Integrate DoclingProcessor with PDF files
- [ ] Update \`DocumentProcessor\` to route to appropriate processor
- [ ] Ensure compatibility with existing embedding pipeline (\`DocumentStore._embed_documents_batch()\`)
- [ ] Preserve existing chunking strategies integration

### Phase 3: Testing & Validation
- [ ] Unit tests: Compare Docling output vs current processors
- [ ] Integration tests: Full ingestion pipeline with Docling
- [ ] Performance benchmarks: Processing speed, memory usage
- [ ] Quality validation: Table extraction accuracy, reading order correctness
- [ ] Test on representative document corpus

### Phase 4: Format Expansion
- [ ] Add PPTX support
- [ ] Add HTML support  
- [ ] Add image format support (PNG, JPEG)
- [ ] Update API documentation with new supported formats

### Phase 5: Migration & Rollout
- [ ] Gradual rollout with feature flag
- [ ] Monitor performance metrics
- [ ] Deprecate old processors
- [ ] Update documentation

## Technical Considerations

### Dependencies
\`\`\`toml
# backend/pyproject.toml
[tool.poetry.dependencies]
docling = "^2.0.0"
\`\`\`

### Performance
- Docling runs efficiently on commodity hardware (no GPU required)
- AI models (DocLayNet, TableFormer) have small resource footprint
- Compatible with existing async/batch processing architecture

### Compatibility
- Maintains compatibility with existing \`Document\`/\`DocumentChunk\` schema
- Works with current embedding generation pipeline
- Preserves chunking strategies (simple, semantic, token-based)

### Risks & Mitigations

**Risks:**
1. New dependency introduces potential instability
   - **Mitigation**: MIT license, IBM-backed, active development, 37K+ stars
2. Performance impact of AI models
   - **Mitigation**: Benchmark before rollout, feature flag for rollback
3. Breaking changes to document processing behavior
   - **Mitigation**: Side-by-side testing, gradual migration
4. Migration effort for existing ingested documents
   - **Mitigation**: Optional re-ingestion, version document processor in metadata

**Mitigation Strategy:**
- Start with hybrid approach (PDF only via Docling)
- Feature flag to toggle between processors
- Comprehensive benchmarking on production document corpus
- Fallback to legacy processors if Docling fails

## Success Metrics

### Quantitative
- Table extraction accuracy improvement (target: >30%)
- Reading order correctness for multi-column documents (target: >90%)
- Support for 3+ new file formats (PPTX, HTML, images)
- Performance: Processing time within 20% of current implementation

### Qualitative
- Improved RAG search quality for documents with complex layouts
- Reduced maintenance burden (fewer custom processors)
- Better document structure preservation in chunks

## References

- [Docling GitHub Repository](https://github.com/docling-project/docling) (37K+ stars)
- [Docling Documentation](https://docling-project.github.io/docling/)
- [IBM Research: Docling Announcement](https://research.ibm.com/blog/docling-generative-AI)
- [Granite-Docling-258M Model](https://huggingface.co/ibm-granite/granite-docling-258M)
- [IBM Announcement: Granite-Docling End-to-End](https://www.ibm.com/new/announcements/granite-docling-end-to-end-document-conversion)

## Related Files

**Current Implementation:**
- \`backend/rag_solution/data_ingestion/document_processor.py\` - Main processor orchestrator
- \`backend/rag_solution/data_ingestion/pdf_processor.py\` - PDF processing (to be replaced)
- \`backend/rag_solution/data_ingestion/word_processor.py\` - Word processing (to be replaced)
- \`backend/rag_solution/data_ingestion/excel_processor.py\` - Excel processing (keep)
- \`backend/rag_solution/data_ingestion/txt_processor.py\` - Text processing (keep)
- \`backend/rag_solution/data_ingestion/ingestion.py\` - Ingestion pipeline
- \`backend/rag_solution/data_ingestion/chunking.py\` - Chunking strategies

**New Files to Create:**
- \`backend/rag_solution/data_ingestion/docling_processor.py\` - Docling adapter

**Tests to Create/Update:**
- \`backend/tests/unit/test_docling_processor.py\` - Unit tests
- \`backend/tests/integration/test_docling_integration.py\` - Integration tests
- \`backend/tests/performance/test_docling_performance.py\` - Performance benchmarks

## Estimated Effort

- **Setup & Core Integration**: 2-3 days
- **Testing & Validation**: 2-3 days
- **Format Expansion**: 1-2 days
- **Documentation & Rollout**: 1 day

**Total**: 6-9 days of development effort

## Priority

**High Priority** - This enhancement provides significant improvements to core document processing capabilities, aligns with IBM ecosystem strategy (WatsonX), and reduces technical debt.

Enhancement: Integrate IBM Docling for Advanced Document Processing #255

Description

Summary

Background

Current Implementation

IBM Docling Overview

Benefits of Integration

High Impact Improvements

Moderate Impact Improvements

Implementation Approach

Option 1: Full Replacement (Recommended)

Option 2: Hybrid Approach (Lower Risk)

Implementation Plan

Phase 1: Setup & Infrastructure

Phase 2: Core Integration

Phase 3: Testing & Validation

Phase 4: Format Expansion

Phase 5: Migration & Rollout

Technical Considerations

Dependencies

backend/pyproject.toml

Performance

Compatibility

Risks & Mitigations

Success Metrics

Quantitative

Qualitative

References

Related Files

Estimated Effort

Priority

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions