Skip to content

[P0-3] Performance Optimization - Concurrent Batch Reranking #545

@manavgup

Description

@manavgup

Priority

🔴 P0-3 - Critical Performance Optimization

Problem

The RAG pipeline's reranking stage processes document batches sequentially, causing unnecessary latency:

Current Performance (20 documents, batch_size=10)

  • Batch 1 (10 docs): ~4-6 seconds
  • Batch 2 (10 docs): ~4-6 seconds
  • Total reranking time: ~8-12 seconds

Impact

  • Adds 8-12 seconds to every query with reranking enabled
  • Underutilizes LLM provider concurrent processing capabilities
  • Blocks other pipeline stages unnecessarily

Root Cause

Sequential batch processing in LLMReranker._score_documents() (line 140-179):

# Current implementation - sequential
for i in range(0, len(results), self.batch_size):
    batch = results[i : i + self.batch_size]
    responses = self.llm_provider.generate_text(...)  # Sequential

Provider capabilities being underutilized:

  • WatsonX: Native concurrency_limit=8
  • OpenAI/Anthropic: Async with semaphore=10
  • All providers support concurrent batch processing

Solution - Phase 1: Concurrent Batch Processing

Convert reranking to async concurrent processing:

async def _score_documents_async(self, query: str, results: list[QueryResult]):
    """Score documents with concurrent batch processing."""
    
    # Split into batches
    batches = [results[i:i+self.batch_size] 
               for i in range(0, len(results), self.batch_size)]
    
    # Process all batches concurrently
    tasks = [self._score_batch_async(query, batch) for batch in batches]
    batch_results = await asyncio.gather(*tasks)
    
    # Flatten and return
    return [item for batch in batch_results for item in batch]

Expected Impact

Performance Improvements

  • Reranking time: 8-12s → 4-6s (50-60% reduction)
  • Overall query time: 52-56s → 48-50s (8-12% improvement)
  • Best case (small queries <10 docs): 2-3s reranking

User Experience

  • Faster query responses
  • Better utilization of LLM provider resources
  • Reduced timeout risk

Implementation Plan

Phase 1: Async Reranking (High Priority - This Issue)

  • Convert _score_documents() to async method
  • Implement concurrent batch processing with asyncio.gather()
  • Add provider-aware concurrency limits
  • Add performance telemetry/logging
  • Update PipelineService._apply_reranking() for async compatibility

Expected Impact: 50-60% reranking improvement
Effort: 1-2 days

Phase 2: Provider Optimization (Future)

  • Add provider-specific batch size configuration
  • Make concurrency limits configurable per provider
  • Update config schema

Expected Impact: +10-20% additional improvement
Effort: 1 day

Phase 3: Adaptive Optimization (Future)

  • Implement adaptive batch sizing based on document count
  • Add smart caching for repeated queries (optional)
  • Performance monitoring dashboard

Expected Impact: +20-30% for small queries
Effort: 1-2 days

Files to Modify

Core Implementation

  1. backend/rag_solution/retrieval/reranker.py (Lines 128-179)

    • Convert _score_documents to async
    • Implement concurrent batch processing
    • Add performance telemetry
  2. backend/rag_solution/services/pipeline_service.py (Lines 222-264)

    • Update _apply_reranking() for async compatibility
    • Add performance logging
    • Add timeout handling
  3. backend/core/config.py (Lines 154-162)

    • Add reranker concurrency limit settings
    • Add telemetry flags

Testing

  1. tests/unit/services/test_pipeline_reranking_order.py

    • Add async reranking tests
    • Add concurrent processing tests
  2. New file: tests/unit/retrieval/test_reranker_performance.py

    • Performance regression tests
    • Concurrency limit tests
    • Provider-specific tests

Acceptance Criteria

  • Reranking uses concurrent batch processing (asyncio.gather)
  • Reranking time reduced by 50% (12s → 6s for 20 docs)
  • Works with all LLM providers (WatsonX, OpenAI, Anthropic)
  • No degradation in reranking accuracy
  • All existing tests pass
  • New tests added for async code paths
  • Performance telemetry logs show improvements
  • Documentation updated

Testing Strategy

Unit Tests

# Test async reranking
poetry run pytest tests/unit/retrieval/test_reranker.py -v

# Test concurrent processing
poetry run pytest tests/unit/retrieval/test_reranker_performance.py -v

# Regression tests
poetry run pytest tests/unit/services/test_pipeline_reranking_order.py -v

Performance Benchmarks

# Before optimization (baseline)
poetry run pytest tests/performance/test_reranking_performance.py --benchmark

# After optimization (validation)
poetry run pytest tests/performance/test_reranking_performance.py --benchmark

Related Issues

Completed (Building Blocks)

Future Work

References

  • Code: backend/rag_solution/retrieval/reranker.py lines 128-179
  • Provider implementations:
    • WatsonX: backend/rag_solution/generation/providers/watsonx.py lines 208-231
    • Anthropic: backend/rag_solution/generation/providers/anthropic.py lines 128-155
    • OpenAI: backend/rag_solution/generation/providers/openai.py lines 141-163
  • P0-2 Documentation: docs/fixes/PIPELINE_RERANKING_ORDER_FIX.md line 109

Success Metrics

Before (Baseline)

  • Reranking time: 8-12s (sequential batches)
  • Overall query time: 52-56s
  • LLM provider utilization: Low (sequential calls)

After (Target)

  • Reranking time: 4-6s (concurrent batches) ✅ 50% reduction
  • Overall query time: 48-50s ✅ 8-12% improvement
  • LLM provider utilization: High (concurrent calls)

Priority: P0 - Critical Performance Optimization
Effort: 1-2 days (Phase 1)
Risk: Low (async refactoring, existing provider support)
ROI: High (50-60% reranking improvement with minimal effort)

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0Critical priority - highest user impactenhancementNew feature or requestperformancePerformance optimizationragRAG pipeline and search

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions