-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Labels
backendBackend/API relatedBackend/API relatedllmLLM providers and integrationLLM providers and integrationpriority:highHigh priority - important for releaseHigh priority - important for release
Milestone
Description
Summary
The current token counting implementation in the RAG system has several accuracy and consistency issues that need to be addressed for production reliability and proper billing/resource management.
Current Status Analysis
Token Counting Methods Found:
-
Conversation Service (
conversation_service.py:271):user_token_count = max(5, int(len(message_input.content.split()) * 1.3)) # Rough estimation
-
Search Service (
search_service.py:870):estimated_tokens = len(total_text) // 4 # ~4 characters per token estimated_tokens += 50 # Add overhead return max(50, estimated_tokens) # Minimum 50 tokens
-
Conversation Summarization Service (
conversation_summarization_service.py:453):async def _estimate_tokens(self, text: str) -> int: # Simple estimation: ~4 characters per token for English text
-
Data Ingestion/Chunking (
chunking.py):
Uses actual tokenization but inconsistent with other services
Issues Identified:
1. Inconsistent Estimation Methods
- Word-based estimation (
len(text.split()) * 1.3) - Character-based estimation (
len(text) // 4) - Hardcoded minimums and overheads
- No standardized approach across services
2. Inaccurate Estimations
- Simple heuristics don't account for:
- Special tokens (system prompts, formatting)
- Different tokenizers (GPT vs IBM vs Anthropic)
- Code vs natural language text
- Multilingual content
3. Provider-Specific Token Counting Not Implemented
- Code exists to call
provider.client.tokenize()but falls back to rough estimates - Different LLM providers use different tokenizers
- No model-specific token counting
4. Missing Real Token Usage Tracking
TokenUsageStatsreturns hardcoded zeros fortotal_tokensandtotal_calls- No actual accumulation of real token usage from LLM responses
- Token warnings based on estimates, not real usage
5. Test Failures Related to Token Counting
- Mock objects causing validation errors in tests
- Token count fields expecting integers but receiving Mock objects
Success Criteria
Phase 1: Foundation
- Implement standardized token estimation utility
- Replace all inconsistent estimation methods
- Add proper model-specific tokenizers (tiktoken for OpenAI, etc.)
- Fix failing unit tests related to token counting
Phase 2: Provider Integration
- Implement real token counting from LLM provider responses
- Extract actual token usage from provider APIs (OpenAI, IBM Watson, etc.)
- Store real token usage in database
- Update token warning system to use real data
Phase 3: Advanced Features
- Model-specific context window limits
- Accurate billing/usage reporting
- Token optimization suggestions
- Historical usage analytics
Proposed Implementation
1. Create Token Utility Service
class TokenCountingService:
def estimate_tokens(self, text: str, model_name: str = "gpt-3.5-turbo") -> int:
"""Accurate token estimation using appropriate tokenizer"""
def count_tokens_with_provider(self, text: str, provider: LLMBase) -> int:
"""Get exact token count from LLM provider"""
def get_context_limit(self, model_name: str) -> int:
"""Get context window size for model"""2. Update Services
- Replace all estimation methods with standardized service
- Implement real token tracking from LLM responses
- Update schemas to handle proper token data
3. Database Updates
- Store real token usage per message/session
- Add token usage history tables
- Implement efficient queries for analytics
Test Validation Plan
Unit Tests
def test_token_estimation_accuracy():
"""Test estimation within 10% of actual count"""
def test_provider_specific_counting():
"""Test different providers return correct counts"""
def test_context_limit_validation():
"""Test warnings trigger at correct thresholds"""Integration Tests
def test_conversation_token_tracking():
"""Test full conversation flow tracks tokens correctly"""
def test_search_token_accuracy():
"""Test search service token counting"""Performance Tests
def test_token_counting_performance():
"""Ensure token counting doesn't add significant latency"""Files to Modify
Core Services
rag_solution/services/conversation_service.py- Fix estimation methodrag_solution/services/search_service.py- Standardize token countingrag_solution/services/token_tracking_service.py- Implement real trackingrag_solution/services/conversation_summarization_service.py- Fix estimation
New Files
rag_solution/services/token_counting_service.py- Centralized token utilitiesrag_solution/utils/tokenizers.py- Model-specific tokenizer support
Schemas
rag_solution/schemas/llm_usage_schema.py- Enhanced usage trackingrag_solution/schemas/conversation_schema.py- Fix token validation
Tests
tests/unit/test_token_counting_service.py- Comprehensive token testingtests/integration/test_token_tracking_integration.py- End-to-end validation
Dependencies
Required Packages
tiktoken>=0.5.0 # OpenAI tokenizer
transformers>=4.30.0 # HuggingFace tokenizers for IBM models
sentencepiece>=0.1.99 # For various model tokenizersAcceptance Criteria
- Accuracy: Token estimates within 5-10% of actual counts
- Consistency: All services use same counting method
- Performance: <50ms overhead for token counting
- Coverage: Support for all integrated LLM providers
- Testing: >90% test coverage for token-related functionality
- Documentation: Clear usage examples and model support matrix
Priority: High
This issue affects billing accuracy, resource management, and user experience with context limits.
Labels
enhancement, token-tracking, accuracy, production-ready
Metadata
Metadata
Assignees
Labels
backendBackend/API relatedBackend/API relatedllmLLM providers and integrationLLM providers and integrationpriority:highHigh priority - important for releaseHigh priority - important for release