Skip to content

🔧 ENHANCEMENT: Improve Token Counting Accuracy and Consistency #236

@manavgup

Description

@manavgup

Summary

The current token counting implementation in the RAG system has several accuracy and consistency issues that need to be addressed for production reliability and proper billing/resource management.

Current Status Analysis

Token Counting Methods Found:

  1. Conversation Service (conversation_service.py:271):

    user_token_count = max(5, int(len(message_input.content.split()) * 1.3))  # Rough estimation
  2. Search Service (search_service.py:870):

    estimated_tokens = len(total_text) // 4  # ~4 characters per token
    estimated_tokens += 50  # Add overhead
    return max(50, estimated_tokens)  # Minimum 50 tokens
  3. Conversation Summarization Service (conversation_summarization_service.py:453):

    async def _estimate_tokens(self, text: str) -> int:
        # Simple estimation: ~4 characters per token for English text
  4. Data Ingestion/Chunking (chunking.py):
    Uses actual tokenization but inconsistent with other services

Issues Identified:

1. Inconsistent Estimation Methods

  • Word-based estimation (len(text.split()) * 1.3)
  • Character-based estimation (len(text) // 4)
  • Hardcoded minimums and overheads
  • No standardized approach across services

2. Inaccurate Estimations

  • Simple heuristics don't account for:
    • Special tokens (system prompts, formatting)
    • Different tokenizers (GPT vs IBM vs Anthropic)
    • Code vs natural language text
    • Multilingual content

3. Provider-Specific Token Counting Not Implemented

  • Code exists to call provider.client.tokenize() but falls back to rough estimates
  • Different LLM providers use different tokenizers
  • No model-specific token counting

4. Missing Real Token Usage Tracking

  • TokenUsageStats returns hardcoded zeros for total_tokens and total_calls
  • No actual accumulation of real token usage from LLM responses
  • Token warnings based on estimates, not real usage

5. Test Failures Related to Token Counting

  • Mock objects causing validation errors in tests
  • Token count fields expecting integers but receiving Mock objects

Success Criteria

Phase 1: Foundation

  • Implement standardized token estimation utility
  • Replace all inconsistent estimation methods
  • Add proper model-specific tokenizers (tiktoken for OpenAI, etc.)
  • Fix failing unit tests related to token counting

Phase 2: Provider Integration

  • Implement real token counting from LLM provider responses
  • Extract actual token usage from provider APIs (OpenAI, IBM Watson, etc.)
  • Store real token usage in database
  • Update token warning system to use real data

Phase 3: Advanced Features

  • Model-specific context window limits
  • Accurate billing/usage reporting
  • Token optimization suggestions
  • Historical usage analytics

Proposed Implementation

1. Create Token Utility Service

class TokenCountingService:
    def estimate_tokens(self, text: str, model_name: str = "gpt-3.5-turbo") -> int:
        """Accurate token estimation using appropriate tokenizer"""
        
    def count_tokens_with_provider(self, text: str, provider: LLMBase) -> int:
        """Get exact token count from LLM provider"""
        
    def get_context_limit(self, model_name: str) -> int:
        """Get context window size for model"""

2. Update Services

  • Replace all estimation methods with standardized service
  • Implement real token tracking from LLM responses
  • Update schemas to handle proper token data

3. Database Updates

  • Store real token usage per message/session
  • Add token usage history tables
  • Implement efficient queries for analytics

Test Validation Plan

Unit Tests

def test_token_estimation_accuracy():
    """Test estimation within 10% of actual count"""
    
def test_provider_specific_counting():
    """Test different providers return correct counts"""
    
def test_context_limit_validation():
    """Test warnings trigger at correct thresholds"""

Integration Tests

def test_conversation_token_tracking():
    """Test full conversation flow tracks tokens correctly"""
    
def test_search_token_accuracy():
    """Test search service token counting"""

Performance Tests

def test_token_counting_performance():
    """Ensure token counting doesn't add significant latency"""

Files to Modify

Core Services

  • rag_solution/services/conversation_service.py - Fix estimation method
  • rag_solution/services/search_service.py - Standardize token counting
  • rag_solution/services/token_tracking_service.py - Implement real tracking
  • rag_solution/services/conversation_summarization_service.py - Fix estimation

New Files

  • rag_solution/services/token_counting_service.py - Centralized token utilities
  • rag_solution/utils/tokenizers.py - Model-specific tokenizer support

Schemas

  • rag_solution/schemas/llm_usage_schema.py - Enhanced usage tracking
  • rag_solution/schemas/conversation_schema.py - Fix token validation

Tests

  • tests/unit/test_token_counting_service.py - Comprehensive token testing
  • tests/integration/test_token_tracking_integration.py - End-to-end validation

Dependencies

Required Packages

tiktoken>=0.5.0        # OpenAI tokenizer
transformers>=4.30.0   # HuggingFace tokenizers for IBM models
sentencepiece>=0.1.99  # For various model tokenizers

Acceptance Criteria

  1. Accuracy: Token estimates within 5-10% of actual counts
  2. Consistency: All services use same counting method
  3. Performance: <50ms overhead for token counting
  4. Coverage: Support for all integrated LLM providers
  5. Testing: >90% test coverage for token-related functionality
  6. Documentation: Clear usage examples and model support matrix

Priority: High

This issue affects billing accuracy, resource management, and user experience with context limits.

Labels

enhancement, token-tracking, accuracy, production-ready

Metadata

Metadata

Assignees

No one assigned

    Labels

    backendBackend/API relatedllmLLM providers and integrationpriority:highHigh priority - important for release

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions