🔧 ENHANCEMENT: Improve Token Counting Accuracy and Consistency

## Summary

The current token counting implementation in the RAG system has several accuracy and consistency issues that need to be addressed for production reliability and proper billing/resource management.

## Current Status Analysis

### Token Counting Methods Found:

1. **Conversation Service** (`conversation_service.py:271`):
   ```python
   user_token_count = max(5, int(len(message_input.content.split()) * 1.3))  # Rough estimation
   ```

2. **Search Service** (`search_service.py:870`):
   ```python
   estimated_tokens = len(total_text) // 4  # ~4 characters per token
   estimated_tokens += 50  # Add overhead
   return max(50, estimated_tokens)  # Minimum 50 tokens
   ```

3. **Conversation Summarization Service** (`conversation_summarization_service.py:453`):
   ```python
   async def _estimate_tokens(self, text: str) -> int:
       # Simple estimation: ~4 characters per token for English text
   ```

4. **Data Ingestion/Chunking** (`chunking.py`):
   Uses actual tokenization but inconsistent with other services

### Issues Identified:

#### 1. **Inconsistent Estimation Methods**
- Word-based estimation (`len(text.split()) * 1.3`)
- Character-based estimation (`len(text) // 4`)
- Hardcoded minimums and overheads
- No standardized approach across services

#### 2. **Inaccurate Estimations**
- Simple heuristics don't account for:
  - Special tokens (system prompts, formatting)
  - Different tokenizers (GPT vs IBM vs Anthropic)
  - Code vs natural language text
  - Multilingual content

#### 3. **Provider-Specific Token Counting Not Implemented**
- Code exists to call `provider.client.tokenize()` but falls back to rough estimates
- Different LLM providers use different tokenizers
- No model-specific token counting

#### 4. **Missing Real Token Usage Tracking**
- `TokenUsageStats` returns hardcoded zeros for `total_tokens` and `total_calls`
- No actual accumulation of real token usage from LLM responses
- Token warnings based on estimates, not real usage

#### 5. **Test Failures Related to Token Counting**
- Mock objects causing validation errors in tests
- Token count fields expecting integers but receiving Mock objects

## Success Criteria

### Phase 1: Foundation
- [ ] Implement standardized token estimation utility
- [ ] Replace all inconsistent estimation methods
- [ ] Add proper model-specific tokenizers (tiktoken for OpenAI, etc.)
- [ ] Fix failing unit tests related to token counting

### Phase 2: Provider Integration
- [ ] Implement real token counting from LLM provider responses
- [ ] Extract actual token usage from provider APIs (OpenAI, IBM Watson, etc.)
- [ ] Store real token usage in database
- [ ] Update token warning system to use real data

### Phase 3: Advanced Features
- [ ] Model-specific context window limits
- [ ] Accurate billing/usage reporting
- [ ] Token optimization suggestions
- [ ] Historical usage analytics

## Proposed Implementation

### 1. Create Token Utility Service
```python
class TokenCountingService:
    def estimate_tokens(self, text: str, model_name: str = "gpt-3.5-turbo") -> int:
        """Accurate token estimation using appropriate tokenizer"""
        
    def count_tokens_with_provider(self, text: str, provider: LLMBase) -> int:
        """Get exact token count from LLM provider"""
        
    def get_context_limit(self, model_name: str) -> int:
        """Get context window size for model"""
```

### 2. Update Services
- Replace all estimation methods with standardized service
- Implement real token tracking from LLM responses
- Update schemas to handle proper token data

### 3. Database Updates
- Store real token usage per message/session
- Add token usage history tables
- Implement efficient queries for analytics

## Test Validation Plan

### Unit Tests
```python
def test_token_estimation_accuracy():
    """Test estimation within 10% of actual count"""
    
def test_provider_specific_counting():
    """Test different providers return correct counts"""
    
def test_context_limit_validation():
    """Test warnings trigger at correct thresholds"""
```

### Integration Tests
```python
def test_conversation_token_tracking():
    """Test full conversation flow tracks tokens correctly"""
    
def test_search_token_accuracy():
    """Test search service token counting"""
```

### Performance Tests
```python
def test_token_counting_performance():
    """Ensure token counting doesn't add significant latency"""
```

## Files to Modify

### Core Services
- `rag_solution/services/conversation_service.py` - Fix estimation method
- `rag_solution/services/search_service.py` - Standardize token counting
- `rag_solution/services/token_tracking_service.py` - Implement real tracking
- `rag_solution/services/conversation_summarization_service.py` - Fix estimation

### New Files
- `rag_solution/services/token_counting_service.py` - Centralized token utilities
- `rag_solution/utils/tokenizers.py` - Model-specific tokenizer support

### Schemas
- `rag_solution/schemas/llm_usage_schema.py` - Enhanced usage tracking
- `rag_solution/schemas/conversation_schema.py` - Fix token validation

### Tests
- `tests/unit/test_token_counting_service.py` - Comprehensive token testing
- `tests/integration/test_token_tracking_integration.py` - End-to-end validation

## Dependencies

### Required Packages
```python
tiktoken>=0.5.0        # OpenAI tokenizer
transformers>=4.30.0   # HuggingFace tokenizers for IBM models
sentencepiece>=0.1.99  # For various model tokenizers
```

## Acceptance Criteria

1. **Accuracy**: Token estimates within 5-10% of actual counts
2. **Consistency**: All services use same counting method
3. **Performance**: <50ms overhead for token counting
4. **Coverage**: Support for all integrated LLM providers
5. **Testing**: >90% test coverage for token-related functionality
6. **Documentation**: Clear usage examples and model support matrix

## Priority: High
This issue affects billing accuracy, resource management, and user experience with context limits.

## Labels
`enhancement`, `token-tracking`, `accuracy`, `production-ready`

🔧 ENHANCEMENT: Improve Token Counting Accuracy and Consistency #236

Description

Summary

Current Status Analysis

Token Counting Methods Found:

Issues Identified:

1. Inconsistent Estimation Methods

2. Inaccurate Estimations

3. Provider-Specific Token Counting Not Implemented

4. Missing Real Token Usage Tracking

5. Test Failures Related to Token Counting

Success Criteria

Phase 1: Foundation

Phase 2: Provider Integration

Phase 3: Advanced Features

Proposed Implementation

1. Create Token Utility Service

2. Update Services

3. Database Updates

Test Validation Plan

Unit Tests

Integration Tests

Performance Tests

Files to Modify

Core Services

New Files

Schemas

Tests

Dependencies

Required Packages

Acceptance Criteria

Priority: High

Labels

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions