[P0-3] Performance Optimization - Concurrent Batch Reranking

## Priority
🔴 **P0-3 - Critical Performance Optimization**

## Problem

The RAG pipeline's reranking stage processes document batches **sequentially**, causing unnecessary latency:

### Current Performance (20 documents, batch_size=10)
- Batch 1 (10 docs): ~4-6 seconds
- Batch 2 (10 docs): ~4-6 seconds  
- **Total reranking time**: ~8-12 seconds

### Impact
- Adds 8-12 seconds to every query with reranking enabled
- Underutilizes LLM provider concurrent processing capabilities
- Blocks other pipeline stages unnecessarily

## Root Cause

**Sequential batch processing** in `LLMReranker._score_documents()` (line 140-179):

```python
# Current implementation - sequential
for i in range(0, len(results), self.batch_size):
    batch = results[i : i + self.batch_size]
    responses = self.llm_provider.generate_text(...)  # Sequential
```

**Provider capabilities being underutilized**:
- WatsonX: Native concurrency_limit=8
- OpenAI/Anthropic: Async with semaphore=10
- All providers support concurrent batch processing

## Solution - Phase 1: Concurrent Batch Processing

Convert reranking to **async concurrent processing**:

```python
async def _score_documents_async(self, query: str, results: list[QueryResult]):
    """Score documents with concurrent batch processing."""
    
    # Split into batches
    batches = [results[i:i+self.batch_size] 
               for i in range(0, len(results), self.batch_size)]
    
    # Process all batches concurrently
    tasks = [self._score_batch_async(query, batch) for batch in batches]
    batch_results = await asyncio.gather(*tasks)
    
    # Flatten and return
    return [item for batch in batch_results for item in batch]
```

## Expected Impact

### Performance Improvements
- **Reranking time**: 8-12s → 4-6s (**50-60% reduction**)
- **Overall query time**: 52-56s → 48-50s (**8-12% improvement**)
- **Best case** (small queries <10 docs): 2-3s reranking

### User Experience
- Faster query responses
- Better utilization of LLM provider resources
- Reduced timeout risk

## Implementation Plan

### Phase 1: Async Reranking (High Priority - This Issue)
- [ ] Convert `_score_documents()` to async method
- [ ] Implement concurrent batch processing with `asyncio.gather()`
- [ ] Add provider-aware concurrency limits
- [ ] Add performance telemetry/logging
- [ ] Update `PipelineService._apply_reranking()` for async compatibility

**Expected Impact**: 50-60% reranking improvement  
**Effort**: 1-2 days

### Phase 2: Provider Optimization (Future)
- [ ] Add provider-specific batch size configuration
- [ ] Make concurrency limits configurable per provider
- [ ] Update config schema

**Expected Impact**: +10-20% additional improvement  
**Effort**: 1 day

### Phase 3: Adaptive Optimization (Future)
- [ ] Implement adaptive batch sizing based on document count
- [ ] Add smart caching for repeated queries (optional)
- [ ] Performance monitoring dashboard

**Expected Impact**: +20-30% for small queries  
**Effort**: 1-2 days

## Files to Modify

### Core Implementation
1. **`backend/rag_solution/retrieval/reranker.py`** (Lines 128-179)
   - Convert `_score_documents` to async
   - Implement concurrent batch processing
   - Add performance telemetry

2. **`backend/rag_solution/services/pipeline_service.py`** (Lines 222-264)
   - Update `_apply_reranking()` for async compatibility
   - Add performance logging
   - Add timeout handling

3. **`backend/core/config.py`** (Lines 154-162)
   - Add reranker concurrency limit settings
   - Add telemetry flags

### Testing
4. **`tests/unit/services/test_pipeline_reranking_order.py`**
   - Add async reranking tests
   - Add concurrent processing tests

5. **New file: `tests/unit/retrieval/test_reranker_performance.py`**
   - Performance regression tests
   - Concurrency limit tests
   - Provider-specific tests

## Acceptance Criteria

- [ ] Reranking uses concurrent batch processing (asyncio.gather)
- [ ] Reranking time reduced by 50% (12s → 6s for 20 docs)
- [ ] Works with all LLM providers (WatsonX, OpenAI, Anthropic)
- [ ] No degradation in reranking accuracy
- [ ] All existing tests pass
- [ ] New tests added for async code paths
- [ ] Performance telemetry logs show improvements
- [ ] Documentation updated

## Testing Strategy

### Unit Tests
```bash
# Test async reranking
poetry run pytest tests/unit/retrieval/test_reranker.py -v

# Test concurrent processing
poetry run pytest tests/unit/retrieval/test_reranker_performance.py -v

# Regression tests
poetry run pytest tests/unit/services/test_pipeline_reranking_order.py -v
```

### Performance Benchmarks
```bash
# Before optimization (baseline)
poetry run pytest tests/performance/test_reranking_performance.py --benchmark

# After optimization (validation)
poetry run pytest tests/performance/test_reranking_performance.py --benchmark
```

## Related Issues

### Completed (Building Blocks)
- **#543** ✅ [P0-2] Fix Pipeline Ordering Bug - Reranking After LLM Generation (MERGED)
- **#541** ✅ [P0-1] Fix UI Display Issue - REST API Timeout (MERGED)
- **#510** ✅ Fix Reranking System Failure - Missing Template Parameter (CLOSED)

### Future Work
- **#124** - Performance Optimization and Production Readiness (Epic)
- **#531** - Performance monitoring and metrics collection

## References

- Code: `backend/rag_solution/retrieval/reranker.py` lines 128-179
- Provider implementations:
  - WatsonX: `backend/rag_solution/generation/providers/watsonx.py` lines 208-231
  - Anthropic: `backend/rag_solution/generation/providers/anthropic.py` lines 128-155
  - OpenAI: `backend/rag_solution/generation/providers/openai.py` lines 141-163
- P0-2 Documentation: `docs/fixes/PIPELINE_RERANKING_ORDER_FIX.md` line 109

## Success Metrics

### Before (Baseline)
- Reranking time: 8-12s (sequential batches)
- Overall query time: 52-56s
- LLM provider utilization: Low (sequential calls)

### After (Target)
- Reranking time: 4-6s (concurrent batches) ✅ 50% reduction
- Overall query time: 48-50s ✅ 8-12% improvement
- LLM provider utilization: High (concurrent calls)

---

**Priority**: P0 - Critical Performance Optimization  
**Effort**: 1-2 days (Phase 1)  
**Risk**: Low (async refactoring, existing provider support)  
**ROI**: High (50-60% reranking improvement with minimal effort)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[P0-3] Performance Optimization - Concurrent Batch Reranking #545

Priority

Problem

Current Performance (20 documents, batch_size=10)

Impact

Root Cause

Solution - Phase 1: Concurrent Batch Processing

Expected Impact

Performance Improvements

User Experience

Implementation Plan

Phase 1: Async Reranking (High Priority - This Issue)

Phase 2: Provider Optimization (Future)

Phase 3: Adaptive Optimization (Future)

Files to Modify

Core Implementation

Testing

Acceptance Criteria

Testing Strategy

Unit Tests

Performance Benchmarks

Related Issues

Completed (Building Blocks)

Future Work

References

Success Metrics

Before (Baseline)

After (Target)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[P0-3] Performance Optimization - Concurrent Batch Reranking #545

Description

Priority

Problem

Current Performance (20 documents, batch_size=10)

Impact

Root Cause

Solution - Phase 1: Concurrent Batch Processing

Expected Impact

Performance Improvements

User Experience

Implementation Plan

Phase 1: Async Reranking (High Priority - This Issue)

Phase 2: Provider Optimization (Future)

Phase 3: Adaptive Optimization (Future)

Files to Modify

Core Implementation

Testing

Acceptance Criteria

Testing Strategy

Unit Tests

Performance Benchmarks

Related Issues

Completed (Building Blocks)

Future Work

References

Success Metrics

Before (Baseline)

After (Target)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions