-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Priority
🔴 P0-3 - Critical Performance Optimization
Problem
The RAG pipeline's reranking stage processes document batches sequentially, causing unnecessary latency:
Current Performance (20 documents, batch_size=10)
- Batch 1 (10 docs): ~4-6 seconds
- Batch 2 (10 docs): ~4-6 seconds
- Total reranking time: ~8-12 seconds
Impact
- Adds 8-12 seconds to every query with reranking enabled
- Underutilizes LLM provider concurrent processing capabilities
- Blocks other pipeline stages unnecessarily
Root Cause
Sequential batch processing in LLMReranker._score_documents() (line 140-179):
# Current implementation - sequential
for i in range(0, len(results), self.batch_size):
batch = results[i : i + self.batch_size]
responses = self.llm_provider.generate_text(...) # SequentialProvider capabilities being underutilized:
- WatsonX: Native concurrency_limit=8
- OpenAI/Anthropic: Async with semaphore=10
- All providers support concurrent batch processing
Solution - Phase 1: Concurrent Batch Processing
Convert reranking to async concurrent processing:
async def _score_documents_async(self, query: str, results: list[QueryResult]):
"""Score documents with concurrent batch processing."""
# Split into batches
batches = [results[i:i+self.batch_size]
for i in range(0, len(results), self.batch_size)]
# Process all batches concurrently
tasks = [self._score_batch_async(query, batch) for batch in batches]
batch_results = await asyncio.gather(*tasks)
# Flatten and return
return [item for batch in batch_results for item in batch]Expected Impact
Performance Improvements
- Reranking time: 8-12s → 4-6s (50-60% reduction)
- Overall query time: 52-56s → 48-50s (8-12% improvement)
- Best case (small queries <10 docs): 2-3s reranking
User Experience
- Faster query responses
- Better utilization of LLM provider resources
- Reduced timeout risk
Implementation Plan
Phase 1: Async Reranking (High Priority - This Issue)
- Convert
_score_documents()to async method - Implement concurrent batch processing with
asyncio.gather() - Add provider-aware concurrency limits
- Add performance telemetry/logging
- Update
PipelineService._apply_reranking()for async compatibility
Expected Impact: 50-60% reranking improvement
Effort: 1-2 days
Phase 2: Provider Optimization (Future)
- Add provider-specific batch size configuration
- Make concurrency limits configurable per provider
- Update config schema
Expected Impact: +10-20% additional improvement
Effort: 1 day
Phase 3: Adaptive Optimization (Future)
- Implement adaptive batch sizing based on document count
- Add smart caching for repeated queries (optional)
- Performance monitoring dashboard
Expected Impact: +20-30% for small queries
Effort: 1-2 days
Files to Modify
Core Implementation
-
backend/rag_solution/retrieval/reranker.py(Lines 128-179)- Convert
_score_documentsto async - Implement concurrent batch processing
- Add performance telemetry
- Convert
-
backend/rag_solution/services/pipeline_service.py(Lines 222-264)- Update
_apply_reranking()for async compatibility - Add performance logging
- Add timeout handling
- Update
-
backend/core/config.py(Lines 154-162)- Add reranker concurrency limit settings
- Add telemetry flags
Testing
-
tests/unit/services/test_pipeline_reranking_order.py- Add async reranking tests
- Add concurrent processing tests
-
New file:
tests/unit/retrieval/test_reranker_performance.py- Performance regression tests
- Concurrency limit tests
- Provider-specific tests
Acceptance Criteria
- Reranking uses concurrent batch processing (asyncio.gather)
- Reranking time reduced by 50% (12s → 6s for 20 docs)
- Works with all LLM providers (WatsonX, OpenAI, Anthropic)
- No degradation in reranking accuracy
- All existing tests pass
- New tests added for async code paths
- Performance telemetry logs show improvements
- Documentation updated
Testing Strategy
Unit Tests
# Test async reranking
poetry run pytest tests/unit/retrieval/test_reranker.py -v
# Test concurrent processing
poetry run pytest tests/unit/retrieval/test_reranker_performance.py -v
# Regression tests
poetry run pytest tests/unit/services/test_pipeline_reranking_order.py -vPerformance Benchmarks
# Before optimization (baseline)
poetry run pytest tests/performance/test_reranking_performance.py --benchmark
# After optimization (validation)
poetry run pytest tests/performance/test_reranking_performance.py --benchmarkRelated Issues
Completed (Building Blocks)
- [P0-2] Fix Pipeline Ordering Bug - Reranking After LLM Generation #543 ✅ [P0-2] Fix Pipeline Ordering Bug - Reranking After LLM Generation (MERGED)
- [P0-1] Fix UI Display Issue - REST API Timeout Too Short #541 ✅ [P0-1] Fix UI Display Issue - REST API Timeout (MERGED)
- [P0 CRITICAL] Fix Reranking System Failure - Missing Template Parameter #510 ✅ Fix Reranking System Failure - Missing Template Parameter (CLOSED)
Future Work
- Performance Optimization and Production Readiness #124 - Performance Optimization and Production Readiness (Epic)
- perf: Add performance monitoring and metrics collection #531 - Performance monitoring and metrics collection
References
- Code:
backend/rag_solution/retrieval/reranker.pylines 128-179 - Provider implementations:
- WatsonX:
backend/rag_solution/generation/providers/watsonx.pylines 208-231 - Anthropic:
backend/rag_solution/generation/providers/anthropic.pylines 128-155 - OpenAI:
backend/rag_solution/generation/providers/openai.pylines 141-163
- WatsonX:
- P0-2 Documentation:
docs/fixes/PIPELINE_RERANKING_ORDER_FIX.mdline 109
Success Metrics
Before (Baseline)
- Reranking time: 8-12s (sequential batches)
- Overall query time: 52-56s
- LLM provider utilization: Low (sequential calls)
After (Target)
- Reranking time: 4-6s (concurrent batches) ✅ 50% reduction
- Overall query time: 48-50s ✅ 8-12% improvement
- LLM provider utilization: High (concurrent calls)
Priority: P0 - Critical Performance Optimization
Effort: 1-2 days (Phase 1)
Risk: Low (async refactoring, existing provider support)
ROI: High (50-60% reranking improvement with minimal effort)
🤖 Generated with Claude Code
Co-Authored-By: Claude noreply@anthropic.com