-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Labels
backendBackend/API relatedBackend/API relatedepicLarge feature spanning multiple user storiesLarge feature spanning multiple user storiesfrontendFrontend/UI relatedFrontend/UI relatedpriority:criticalCritical priority - blocks productionCritical priority - blocks productionreliability
Description
Epic Overview
Implement comprehensive error handling for the document ingestion pipeline to prevent silent failures, provide actionable error messages, and ensure production reliability.
Problem Statement
Current ingestion pipeline fails silently when:
- Chunks exceed embedding model token limits (512 tokens for IBM Slate)
- Milvus connections are not established
- Background tasks fail without user notification
Impact
- Production blocker: Cannot reindex collections reliably
- User experience: Complete breakdown of trust (users assume success)
- Debugging: Impossible without backend log access
- Data quality: Inconsistent state due to partial failures
Solution Architecture
3-Layer Defense Strategy
Layer 1: Prevention (Validation)
└─ Pre-embed token validation with auto-splitting
Layer 2: Observability (Tracking)
└─ Background job status with progress updates
Layer 3: Communication (UI)
└─ Real-time error notifications with remediation
Implementation Phases
✅ Phase 1: Embedding Token Validation [4 hours]
Issue: #448
Status: Open
Owner: TBD
Deliverables:
.envconfiguration for model token limits- Pre-validation before embedding API calls
- Automatic chunk splitting for oversized chunks
- Milvus connection lifecycle management
Files Modified:
backend/core/config.pybackend/rag_solution/data_ingestion/ingestion.pybackend/vectordbs/milvus_store.py.env.example
✅ Phase 2: Background Job Tracking [8 hours]
Issue: #449
Status: Open
Owner: TBD
Deliverables:
background_jobsdatabase table- Job status REST API
- WebSocket real-time updates
- Progress tracking for long-running tasks
Files Created:
backend/rag_solution/models/background_job.pybackend/rag_solution/repository/job_repository.pybackend/rag_solution/services/job_service.pybackend/rag_solution/router/job_router.pybackend/rag_solution/websocket/job_notifications.py
Files Modified:
backend/rag_solution/services/collection_service.pybackend/alembic/versions/XXX_add_background_jobs.py
✅ Phase 3: UI Error Notifications [6 hours]
Issue: #450
Status: Open
Owner: TBD
Deliverables:
- ErrorToast component with remediation messages
- ReindexProgressModal with WebSocket integration
- Error message catalog
- Success/failure notifications
Files Created:
frontend/src/components/common/ErrorToast.tsxfrontend/src/components/collections/ReindexProgressModal.tsxfrontend/src/utils/errorMessages.tsfrontend/src/hooks/useJobStatus.ts
Files Modified:
frontend/src/pages/CollectionDetailPage.tsx
Total Effort
- Phase 1: 4 hours
- Phase 2: 8 hours
- Phase 3: 6 hours
- Testing & Integration: 4 hours
- Documentation: 2 hours
Total: ~24 hours (3 days)
Dependencies
Phase 1 (Validation)
↓
Phase 2 (Tracking) ←─── Can start in parallel
↓
Phase 3 (UI)
Success Metrics
Before
- ❌ Reindexing success rate: ~60% (40% silent failures)
- ❌ User error visibility: 0%
- ❌ Mean time to diagnosis: 30+ minutes
- ❌ User trust: Low
After (Target)
- ✅ Reindexing success rate: ~99% (validation prevents failures)
- ✅ User error visibility: 100%
- ✅ Mean time to diagnosis: <1 minute (error messages include fix)
- ✅ User trust: High
Testing Strategy
Unit Tests
- Token validation logic
- Chunk auto-splitting
- Job status state machine
- WebSocket message handling
Integration Tests
- End-to-end reindexing with validation
- Background job progress tracking
- WebSocket real-time updates
- Error notification delivery
Manual Testing
- Attempt reindex with oversized chunks → should auto-split
- Kill Milvus mid-reindex → should show connection error
- Watch progress modal → should show real-time updates
- Trigger validation failure → should show remediation message
Rollout Plan
Week 1
- Phase 1 implementation (validation)
- Unit tests for validation
- Deploy to staging
Week 2
- Phase 2 implementation (tracking)
- Phase 3 implementation (UI)
- Integration tests
Week 3
- End-to-end testing
- Documentation
- Deploy to production
- Monitor metrics
Related Issues
- 🔴 [P0] Add embedding token limit validation to prevent ingestion failures #448 - Embedding token validation (Phase 1)
- 🟡 [P0] Implement background job status tracking for async operations #449 - Background job tracking (Phase 2)
- 🟡 [P0] Add real-time UI error notifications for background task failures #450 - UI error notifications (Phase 3)
Documentation Updates Needed
- Add "Error Handling" section to CLAUDE.md
- Update API documentation for job endpoints
- Create troubleshooting guide for common errors
- Update deployment guide with new .env variables
Breaking Changes
None - all changes are backwards compatible
Migration Notes
- Database migration required for
background_jobstable .envupdates required (EMBEDDING_MODEL_MAX_TOKENS, EMBEDDING_MODEL_SAFE_MAX_TOKENS)- No API breaking changes
Priority: 🔴 P0 - Critical
Labels: epic, backend, frontend, reliability, production-blocker
Milestone: v1.0 - Production Ready
🤖 Generated with Claude Code
Metadata
Metadata
Assignees
Labels
backendBackend/API relatedBackend/API relatedepicLarge feature spanning multiple user storiesLarge feature spanning multiple user storiesfrontendFrontend/UI relatedFrontend/UI relatedpriority:criticalCritical priority - blocks productionCritical priority - blocks productionreliability