Skip to content

🔴 [EPIC] Production-Grade Document Ingestion Error Handling #451

@manavgup

Description

@manavgup

Epic Overview

Implement comprehensive error handling for the document ingestion pipeline to prevent silent failures, provide actionable error messages, and ensure production reliability.

Problem Statement

Current ingestion pipeline fails silently when:

  1. Chunks exceed embedding model token limits (512 tokens for IBM Slate)
  2. Milvus connections are not established
  3. Background tasks fail without user notification

Impact

  • Production blocker: Cannot reindex collections reliably
  • User experience: Complete breakdown of trust (users assume success)
  • Debugging: Impossible without backend log access
  • Data quality: Inconsistent state due to partial failures

Solution Architecture

3-Layer Defense Strategy

Layer 1: Prevention (Validation)
└─ Pre-embed token validation with auto-splitting

Layer 2: Observability (Tracking)
└─ Background job status with progress updates

Layer 3: Communication (UI)
└─ Real-time error notifications with remediation

Implementation Phases

✅ Phase 1: Embedding Token Validation [4 hours]

Issue: #448
Status: Open
Owner: TBD

Deliverables:

  • .env configuration for model token limits
  • Pre-validation before embedding API calls
  • Automatic chunk splitting for oversized chunks
  • Milvus connection lifecycle management

Files Modified:

  • backend/core/config.py
  • backend/rag_solution/data_ingestion/ingestion.py
  • backend/vectordbs/milvus_store.py
  • .env.example

✅ Phase 2: Background Job Tracking [8 hours]

Issue: #449
Status: Open
Owner: TBD

Deliverables:

  • background_jobs database table
  • Job status REST API
  • WebSocket real-time updates
  • Progress tracking for long-running tasks

Files Created:

  • backend/rag_solution/models/background_job.py
  • backend/rag_solution/repository/job_repository.py
  • backend/rag_solution/services/job_service.py
  • backend/rag_solution/router/job_router.py
  • backend/rag_solution/websocket/job_notifications.py

Files Modified:

  • backend/rag_solution/services/collection_service.py
  • backend/alembic/versions/XXX_add_background_jobs.py

✅ Phase 3: UI Error Notifications [6 hours]

Issue: #450
Status: Open
Owner: TBD

Deliverables:

  • ErrorToast component with remediation messages
  • ReindexProgressModal with WebSocket integration
  • Error message catalog
  • Success/failure notifications

Files Created:

  • frontend/src/components/common/ErrorToast.tsx
  • frontend/src/components/collections/ReindexProgressModal.tsx
  • frontend/src/utils/errorMessages.ts
  • frontend/src/hooks/useJobStatus.ts

Files Modified:

  • frontend/src/pages/CollectionDetailPage.tsx

Total Effort

  • Phase 1: 4 hours
  • Phase 2: 8 hours
  • Phase 3: 6 hours
  • Testing & Integration: 4 hours
  • Documentation: 2 hours

Total: ~24 hours (3 days)

Dependencies

Phase 1 (Validation)
    ↓
Phase 2 (Tracking) ←─── Can start in parallel
    ↓
Phase 3 (UI)

Success Metrics

Before

  • ❌ Reindexing success rate: ~60% (40% silent failures)
  • ❌ User error visibility: 0%
  • ❌ Mean time to diagnosis: 30+ minutes
  • ❌ User trust: Low

After (Target)

  • ✅ Reindexing success rate: ~99% (validation prevents failures)
  • ✅ User error visibility: 100%
  • ✅ Mean time to diagnosis: <1 minute (error messages include fix)
  • ✅ User trust: High

Testing Strategy

Unit Tests

  • Token validation logic
  • Chunk auto-splitting
  • Job status state machine
  • WebSocket message handling

Integration Tests

  • End-to-end reindexing with validation
  • Background job progress tracking
  • WebSocket real-time updates
  • Error notification delivery

Manual Testing

  1. Attempt reindex with oversized chunks → should auto-split
  2. Kill Milvus mid-reindex → should show connection error
  3. Watch progress modal → should show real-time updates
  4. Trigger validation failure → should show remediation message

Rollout Plan

Week 1

  • Phase 1 implementation (validation)
  • Unit tests for validation
  • Deploy to staging

Week 2

  • Phase 2 implementation (tracking)
  • Phase 3 implementation (UI)
  • Integration tests

Week 3

  • End-to-end testing
  • Documentation
  • Deploy to production
  • Monitor metrics

Related Issues

Documentation Updates Needed

  • Add "Error Handling" section to CLAUDE.md
  • Update API documentation for job endpoints
  • Create troubleshooting guide for common errors
  • Update deployment guide with new .env variables

Breaking Changes

None - all changes are backwards compatible

Migration Notes

  • Database migration required for background_jobs table
  • .env updates required (EMBEDDING_MODEL_MAX_TOKENS, EMBEDDING_MODEL_SAFE_MAX_TOKENS)
  • No API breaking changes

Priority: 🔴 P0 - Critical
Labels: epic, backend, frontend, reliability, production-blocker
Milestone: v1.0 - Production Ready

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    backendBackend/API relatedepicLarge feature spanning multiple user storiesfrontendFrontend/UI relatedpriority:criticalCritical priority - blocks productionreliability

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions