Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Count the number of tokens in documents [COG-1071] #476

Merged
merged 5 commits into from
Jan 29, 2025

Conversation

alekszievr
Copy link
Contributor

@alekszievr alekszievr commented Jan 28, 2025

Description

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin

Summary by CodeRabbit

  • New Features

    • Added token count tracking for documents and document chunks.
    • Enhanced document processing with token count information.
    • Introduced support for multiple tokenizers based on provider selection.
  • Improvements

    • Expanded data models to include token count attributes.
    • Implemented token count accumulation during document chunk extraction.
    • Updated configuration options for LLM and embedding functionalities.
    • Improved method signatures for better clarity in parameter usage.

Copy link
Contributor

coderabbitai bot commented Jan 28, 2025

Walkthrough

The changes introduce a new token_count attribute across multiple modules in the document processing system. This enhancement allows tracking of token counts at different levels: individual document chunks, entire documents, and during chunk extraction. The modification extends the existing data models to capture more granular information about document tokenization, providing additional metadata about the processed text. Additionally, the parameter max_tokens has been renamed to max_chunk_tokens in various methods, improving clarity in the context of chunk processing.

Changes

File Change Summary
cognee/modules/chunking/TextChunker.py Renamed max_tokens to max_chunk_tokens, updated methods to use new parameter, and included token_count in DocumentChunk instances.
cognee/modules/chunking/models/DocumentChunk.py Added token_count: int attribute to DocumentChunk class.
cognee/modules/data/processing/document_types/Document.py Removed max_tokens parameter from read method.
cognee/tasks/documents/extract_chunks_from_documents.py Introduced update_document_token_count function and updated extract_chunks_from_documents to use max_chunk_tokens.
cognee/api/v1/cognify/code_graph_pipeline.py Updated extract_chunks_from_documents task to use max_chunk_tokens from get_max_chunk_tokens().
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py Added provider attribute and get_tokenizer method to support different tokenizers.
cognee/infrastructure/databases/vector/embeddings/config.py Updated embedding_model to "openai/text-embedding-3-large" and added embedding_provider and embedding_max_tokens.
cognee/infrastructure/llm/config.py Added llm_max_tokens attribute to LLMConfig class.
cognee/infrastructure/llm/generic_llm_api/adapter.py Added max_tokens parameter to GenericAPIAdapter constructor.
cognee/infrastructure/llm/openai/adapter.py Added max_tokens parameter to OpenAIAdapter constructor and updated transcribe_image method.
cognee/modules/data/models/Data.py Added token_count column to Data class.
cognee/modules/data/processing/document_types/AudioDocument.py Renamed max_tokens to max_chunk_tokens in read method.
cognee/modules/data/processing/document_types/ImageDocument.py Renamed max_tokens to max_chunk_tokens in read method.
cognee/modules/data/processing/document_types/PdfDocument.py Renamed max_tokens to max_chunk_tokens in read method.
cognee/modules/data/processing/document_types/TextDocument.py Renamed max_tokens to max_chunk_tokens in read method.
cognee/modules/data/processing/document_types/UnstructuredDocument.py Renamed max_tokens to max_chunk_tokens in read method.
cognee/tests/integration/documents/*.py Updated tests to include max_chunk_tokens in read method calls for respective document types.
cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py Updated test functions to include max_chunk_tokens parameter.
cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py Updated test function call to include max_chunk_tokens.
cognee/shared/utils.py Removed token management functions num_tokens_from_string and trim_text_to_max_tokens.

Sequence Diagram

sequenceDiagram
    participant Doc as Document
    participant Chunker as TextChunker
    participant Extractor as ChunkExtractor

    Doc->>Chunker: read()
    Chunker-->>Extractor: Yield DocumentChunk
    Extractor->>Extractor: Accumulate token_count
    Extractor->>Doc: Set total token_count
Loading

Poem

🐰 Tokens dancing, chunk by chunk,
Counting whispers, byte by byte,
A rabbit's tale of text unsung,
Metrics weaving data's delight,
Knowledge measured, wisdom bright! 📊

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@alekszievr alekszievr self-assigned this Jan 28, 2025
@alekszievr alekszievr requested review from lxobr and dexters1 January 28, 2025 11:12
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🔭 Outside diff range comments (1)
cognee/modules/chunking/TextChunker.py (1)

Line range hint 79-81: Improve error handling for DocumentChunk creation.

The current error handling only prints exceptions but continues execution, which could lead to invalid state or data loss.

Consider proper error handling:

         try:
             yield DocumentChunk(...)
         except Exception as e:
-            print(e)
+            logger.error(f"Failed to create DocumentChunk: {e}")
+            raise DocumentChunkCreationError(f"Failed to create chunk: {e}") from e

Also applies to: 101-103

🧹 Nitpick comments (1)
cognee/modules/chunking/TextChunker.py (1)

Line range hint 23-24: Add validation for token counts.

The code doesn't validate token count values before using them in calculations or assignments.

Consider adding validation in check_word_count_and_token_count:

     def check_word_count_and_token_count(self, word_count_before, token_count_before, chunk_data):
+        if "token_count" not in chunk_data or not isinstance(chunk_data["token_count"], (int, float)):
+            raise ValueError(f"Invalid token count in chunk data: {chunk_data}")
         word_count_fits = word_count_before + chunk_data["word_count"] <= self.max_chunk_size
         token_count_fits = token_count_before + chunk_data["token_count"] <= self.max_tokens
         return word_count_fits and token_count_fits

Also applies to: 35-37

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d8bde54 and 458eeac.

📒 Files selected for processing (4)
  • cognee/modules/chunking/TextChunker.py (3 hunks)
  • cognee/modules/chunking/models/DocumentChunk.py (1 hunks)
  • cognee/modules/data/processing/document_types/Document.py (1 hunks)
  • cognee/tasks/documents/extract_chunks_from_documents.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (20)
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_simple_example_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_multimedia_example_test / test
  • GitHub Check: run_dynamic_steps_example_test / test
  • GitHub Check: run_simple_example_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: windows-latest
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: profiler
  • GitHub Check: docker-compose-test
🔇 Additional comments (1)
cognee/modules/data/processing/document_types/Document.py (1)

12-12: LGTM! Clean addition of token count tracking.

The optional token count field is well-defined with appropriate type annotation and default value.

cognee/tasks/documents/extract_chunks_from_documents.py Outdated Show resolved Hide resolved
cognee/modules/chunking/TextChunker.py Show resolved Hide resolved
@alekszievr
Copy link
Contributor Author

alekszievr commented Jan 28, 2025

image

@alekszievr alekszievr force-pushed the feat/cog-1071-input-token-counting branch from 55fff11 to f6663ab Compare January 28, 2025 16:08
@alekszievr alekszievr changed the base branch from dev to COG-970-refactor-tokenizing January 28, 2025 16:09
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments.

Actionable comments posted: 18

🔭 Outside diff range comments (2)
cognee/tasks/chunks/chunk_by_paragraph.py (1)

Line range hint 35-48: Optimize tokenizer initialization.

The tokenizer is recreated for each sentence, which is inefficient. Consider moving the initialization outside the loop:

     embedding_model = vector_engine.embedding_engine.model
     embedding_model = embedding_model.split("/")[-1]
+    tokenizer = tiktoken.encoding_for_model(embedding_model)
 
     for paragraph_id, sentence, word_count, end_type in chunk_by_sentence(
         data, maximum_length=paragraph_length
     ):
         # Check if this sentence would exceed length limit
-        tokenizer = tiktoken.encoding_for_model(embedding_model)
         token_count = len(tokenizer.encode(sentence))
cognee/infrastructure/llm/openai/adapter.py (1)

Line range hint 144-146: Replace hardcoded max_tokens value with instance variable.

The transcribe_image method uses a hardcoded value of 300 for max_tokens instead of using the instance variable self.max_tokens.

        return litellm.completion(
            model=self.model,
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": "What's in this image?",
                        },
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:image/jpeg;base64,{encoded_image}",
                            },
                        },
                    ],
                }
            ],
            api_key=self.api_key,
            api_base=self.endpoint,
            api_version=self.api_version,
-           max_tokens=300,
+           max_tokens=self.max_tokens,
            max_retries=5,
        )
♻️ Duplicate comments (1)
cognee/tasks/documents/extract_chunks_from_documents.py (1)

43-43: ⚠️ Potential issue

Add error handling for missing token counts.

While the token counting logic is correct, there's no handling for cases where document_chunk.token_count might be None or missing. This could lead to runtime errors.

-            document_token_count += document_chunk.token_count
+            if hasattr(document_chunk, 'token_count') and document_chunk.token_count is not None:
+                document_token_count += document_chunk.token_count
+            else:
+                logger.warning(f"Missing token count for chunk {document_chunk.id}")
🧹 Nitpick comments (16)
cognee/infrastructure/llm/tokenizer/tokenizer_interface.py (3)

5-7: Add comprehensive docstring for TokenizerInterface.

The interface would benefit from detailed documentation explaining its purpose, contract, and usage examples.

Add this docstring:

 class TokenizerInterface(Protocol):
-    """Tokenizer interface"""
+    """Protocol defining the contract for tokenizer implementations.
+    
+    This interface standardizes how different tokenizer implementations should:
+    - Extract tokens from text
+    - Count tokens in text
+    - Trim text to fit within token limits
+    
+    Implementations should handle different tokenization strategies while
+    conforming to this interface.
+    """

8-10: Add method documentation and consider using a more specific type.

The method lacks documentation and uses a very generic List[Any] return type.

Consider this improvement:

     @abstractmethod
     def extract_tokens(self, text: str) -> List[Any]:
-        raise NotImplementedError
+        """Extract tokens from the input text.
+        
+        Args:
+            text: The input text to tokenize
+            
+        Returns:
+            List of tokens extracted from the text
+        """
+        pass

Also consider creating a Token type (e.g., Token = NewType('Token', Any)) to make the interface more type-safe.


12-18: Add documentation for remaining methods.

The count_tokens and trim_text_to_max_tokens methods need documentation for completeness.

Add docstrings:

     @abstractmethod
     def count_tokens(self, text: str) -> int:
+        """Count the number of tokens in the input text.
+        
+        Args:
+            text: The input text to analyze
+            
+        Returns:
+            Number of tokens in the text
+        """
-        raise NotImplementedError
+        pass

     @abstractmethod
     def trim_text_to_max_tokens(self, text: str) -> str:
+        """Trim the input text to fit within a maximum token limit.
+        
+        Args:
+            text: The input text to trim
+            
+        Returns:
+            Trimmed text that fits within the token limit
+        """
-        raise NotImplementedError
+        pass
cognee/infrastructure/databases/vector/embeddings/config.py (2)

7-8: Decouple provider from model name.

Having the provider prefix in the model name creates tight coupling. Consider separating these concerns.

-    embedding_model: Optional[str] = "openai/text-embedding-3-large"
+    embedding_model: Optional[str] = "text-embedding-3-large"

13-13: Document the magic number for max_tokens.

The value 8191 appears to be a specific limit but lacks documentation explaining its significance.

-    embedding_max_tokens: Optional[int] = 8191
+    # OpenAI's text-embedding-3-large model has a context window of 8191 tokens
+    embedding_max_tokens: Optional[int] = 8191
cognee/infrastructure/llm/tokenizer/HuggingFace/adapter.py (2)

19-21: Consider memory usage for large texts

The extract_tokens method loads all tokens into memory. For very large texts, this could cause memory issues.

Consider implementing a generator version or adding a warning for large texts:

     def extract_tokens(self, text: str) -> List[Any]:
+        if len(text) > 1_000_000:  # 1MB threshold
+            import warnings
+            warnings.warn("Processing large text may consume significant memory")
         tokens = self.tokenizer.tokenize(text)
         return tokens

23-33: Improve docstring consistency and completeness

The docstring for count_tokens could be more detailed and consistent with Python standards.

     def count_tokens(self, text: str) -> int:
         """
-        Returns the number of tokens in the given text.
-        Args:
-            text: str
+        Count the number of tokens in the given text using the HuggingFace tokenizer.
+        
+        Args:
+            text (str): The input text to tokenize
 
-        Returns:
-            number of tokens in the given text
+        Returns:
+            int: The number of tokens in the text
 
+        Raises:
+            ValueError: If the text is empty or None
         """
cognee/infrastructure/llm/config.py (1)

28-28: Maintain consistent naming in dictionary keys.

The dictionary key "max_tokens" differs from the attribute name "llm_max_tokens". Consider using consistent naming to avoid confusion.

-            "max_tokens": self.llm_max_tokens,
+            "llm_max_tokens": self.llm_max_tokens,
cognee/tests/integration/documents/AudioDocument_test.py (1)

Line range hint 13-31: Consider using more realistic audio transcript format.

The mock transcript text uses quotation marks and dialogue formatting which might not match real audio transcription output. Consider using a format that better represents actual transcription results.

Example:

TEST_TEXT = """
Speaker 1: Mike, we need to talk about the payment processing service.
Speaker 2: Good timing. The board wants one-click checkout by end of quarter.
Speaker 1: That's exactly the problem. The service is held together with duct tape. One wrong move and—
# ... rest of the dialogue with speaker labels
"""
cognee/infrastructure/llm/get_llm_client.py (1)

46-50: Consider refactoring repeated GenericAPIAdapter instantiation.

The GenericAPIAdapter instantiation is duplicated for both Ollama and Custom providers with identical parameters. Consider extracting this to a helper method to reduce code duplication.

+def _create_generic_adapter(llm_config, provider_name):
+    return GenericAPIAdapter(
+        llm_config.llm_endpoint,
+        llm_config.llm_api_key,
+        llm_config.llm_model,
+        provider_name,
+        max_tokens=llm_config.llm_max_tokens,
+    )

 elif provider == LLMProvider.OLLAMA:
     if llm_config.llm_api_key is None:
         raise InvalidValueError(message="LLM API key is not set.")
     from .generic_llm_api.adapter import GenericAPIAdapter
-    return GenericAPIAdapter(
-        llm_config.llm_endpoint,
-        llm_config.llm_api_key,
-        llm_config.llm_model,
-        "Ollama",
-        max_tokens=llm_config.llm_max_tokens,
-    )
+    return _create_generic_adapter(llm_config, "Ollama")

 elif provider == LLMProvider.CUSTOM:
     if llm_config.llm_api_key is None:
         raise InvalidValueError(message="LLM API key is not set.")
     from .generic_llm_api.adapter import GenericAPIAdapter
-    return GenericAPIAdapter(
-        llm_config.llm_endpoint,
-        llm_config.llm_api_key,
-        llm_config.llm_model,
-        "Custom",
-        max_tokens=llm_config.llm_max_tokens,
-    )
+    return _create_generic_adapter(llm_config, "Custom")

Also applies to: 65-69

cognee/infrastructure/llm/tokenizer/TikToken/adapter.py (2)

63-68: Improve text trimming to respect word boundaries.

The current implementation might cut words in half. Consider implementing a more sophisticated trimming strategy that respects word boundaries:

-        # This is a simple trim, it may cut words in half; consider using word boundaries for a cleaner cut
-        encoded_text = self.tokenizer.encode(text)
-        trimmed_encoded_text = encoded_text[: self.max_tokens]
-        # Decoding the trimmed text
-        trimmed_text = self.tokenizer.decode(trimmed_encoded_text)
+        # Trim at word boundaries
+        words = text.split()
+        trimmed_words = []
+        current_tokens = 0
+        
+        for word in words:
+            word_tokens = len(self.tokenizer.encode(word + ' '))
+            if current_tokens + word_tokens > self.max_tokens:
+                break
+            trimmed_words.append(word)
+            current_tokens += word_tokens
+        
+        trimmed_text = ' '.join(trimmed_words)

16-16: Document the rationale for max_tokens default value.

The default value of 8191 tokens should be documented to explain its significance (e.g., if it's based on a specific model's context window).

     def __init__(
         self,
         model: str,
-        max_tokens: int = 8191,
+        max_tokens: int = 8191,  # Default based on GPT-4's typical context window
     ):
cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (1)

11-11: LGTM! Good test coverage with realistic token limits.

The test parameterization with values 512, 1024, and 4096 tokens provides good coverage of common use cases.

Consider adding edge cases to max_chunk_tokens_vals:

-max_chunk_tokens_vals = [512, 1024, 4096]
+max_chunk_tokens_vals = [1, 512, 1024, 4096, 8192]

Also applies to: 23-23

cognee/tests/integration/documents/UnstructuredDocument_test.py (1)

Line range hint 71-105: Add assertions for token counts and parameterize the token limit.

While the changes consistently add the max_chunk_tokens parameter across all document types, consider these improvements:

  1. Parameterize the token limit instead of hardcoding:
+MAX_CHUNK_TOKENS = 1024  # Define at module level
...
-        chunk_size=1024, chunker="text_chunker", max_chunk_tokens=1024
+        chunk_size=1024, chunker="text_chunker", max_chunk_tokens=MAX_CHUNK_TOKENS
  1. Add token count assertions for each document type:
     for paragraph_data in pptx_document.read(...):
         assert 19 == paragraph_data.word_count
+        assert paragraph_data.token_count <= MAX_CHUNK_TOKENS
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (1)

28-28: Consider adding validation for the max_tokens parameter.

While the addition of provider and max_tokens parameters improves flexibility, the max_tokens parameter should be validated to ensure it's within acceptable bounds for the chosen provider.

    def __init__(
        self,
        provider: str = "openai",
        model: Optional[str] = "text-embedding-3-large",
        dimensions: Optional[int] = 3072,
        api_key: str = None,
        endpoint: str = None,
        api_version: str = None,
        max_tokens: int = 512,
    ):
+       if max_tokens <= 0:
+           raise ValueError("max_tokens must be positive")
        self.api_key = api_key
        self.endpoint = endpoint
        self.api_version = api_version
        self.provider = provider
        self.model = model
        self.dimensions = dimensions
        self.max_tokens = max_tokens
        self.tokenizer = self.get_tokenizer()

Also applies to: 34-34, 39-43

cognee/tasks/documents/extract_chunks_from_documents.py (1)

Line range hint 32-36: Update docstring to include max_chunk_tokens parameter.

The docstring should be updated to reflect the new required parameter.

 """
 Extracts chunks of data from a list of documents based on the specified chunking parameters.
 
+Args:
+    documents: List of documents to process
+    max_chunk_tokens: Maximum number of tokens per chunk
+    chunk_size: Size of each chunk in bytes (default: 1024)
+    chunker: Chunking strategy to use (default: "text_chunker")
+
 Notes:
     - The `read` method of the `Document` class must be implemented to support the chunking operation.
     - The `chunker` parameter determines the chunking logic and should align with the document type.
 """
🛑 Comments failed to post (18)
cognee/modules/data/processing/document_types/PdfDocument.py (1)

12-12: ⚠️ Potential issue

Breaking change: Parameter max_chunk_tokens is now required

The parameter type has changed from Optional[int] to int, making it a required parameter. This could break existing callers that don't provide this parameter.

Consider keeping it optional with a sensible default value:

-    def read(self, chunk_size: int, chunker: str, max_chunk_tokens: int):
+    def read(self, chunk_size: int, chunker: str, max_chunk_tokens: Optional[int] = None):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    def read(self, chunk_size: int, chunker: str, max_chunk_tokens: Optional[int] = None):
cognee/tests/integration/documents/PdfDocument_test.py (1)

28-28: 🛠️ Refactor suggestion

Add test coverage for token counting and document magic number

A few suggestions to improve the test:

  1. The value 2048 appears to be a magic number. Consider defining it as a constant with a descriptive name.
  2. Add assertions to verify the token count functionality since that's the main change in this PR.

Example improvement:

+# At the top of the file with other constants
+MAX_CHUNK_TOKENS = 2048  # Maximum tokens per chunk for GPT-3 compatibility
+
+# Update ground truth to include token counts
 GROUND_TRUTH = [
-    {"word_count": 879, "len_text": 5607, "cut_type": "sentence_end"},
+    {"word_count": 879, "len_text": 5607, "cut_type": "sentence_end", "token_count": 1024},
     # ... update other entries
 ]

 # In the test
-    GROUND_TRUTH, document.read(chunk_size=1024, chunker="text_chunker", max_chunk_tokens=2048)
+    GROUND_TRUTH, document.read(chunk_size=1024, chunker="text_chunker", max_chunk_tokens=MAX_CHUNK_TOKENS)

 # Add token count assertion
+        assert ground_truth["token_count"] == paragraph_data.token_count, (
+            f'{ground_truth["token_count"] = } != {paragraph_data.token_count = }'
+        )

Committable suggestion skipped: line range outside the PR's diff.

cognee/infrastructure/llm/utils.py (1)

5-15: 🛠️ Refactor suggestion

Add error handling and improve documentation.

The function has several areas for improvement:

  1. No error handling for client creation
  2. Magic number in division
  3. Limited documentation

Consider this improved version:

 def get_max_chunk_tokens():
+    """Calculate the maximum allowed tokens for a text chunk.
+    
+    This function determines the maximum chunk size based on two constraints:
+    1. The embedding engine's maximum token limit
+    2. Half of the LLM's maximum context window (to leave room for system
+       prompts and completion tokens)
+    
+    Returns:
+        int: The maximum number of tokens allowed per chunk
+        
+    Raises:
+        RuntimeError: If unable to initialize LLM or embedding clients
+    """
-    # Calculate max chunk size based on the following formula
-    embedding_engine = get_vector_engine().embedding_engine
-    llm_client = get_llm_client()
+    try:
+        embedding_engine = get_vector_engine().embedding_engine
+        llm_client = get_llm_client()
+    except Exception as e:
+        raise RuntimeError(f"Failed to initialize clients: {str(e)}")

-    # We need to make sure chunk size won't take more than half of LLM max context token size
-    # but it also can't be bigger than the embedding engine max token size
-    llm_cutoff_point = llm_client.max_tokens // 2  # Round down the division
+    # Use half of LLM's context to leave room for system prompts and completion
+    llm_cutoff_point = llm_client.max_tokens // 2
     max_chunk_tokens = min(embedding_engine.max_tokens, llm_cutoff_point)

     return max_chunk_tokens
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

def get_max_chunk_tokens():
    """Calculate the maximum allowed tokens for a text chunk.
    
    This function determines the maximum chunk size based on two constraints:
    1. The embedding engine's maximum token limit
    2. Half of the LLM's maximum context window (to leave room for system
       prompts and completion tokens)
    
    Returns:
        int: The maximum number of tokens allowed per chunk
        
    Raises:
        RuntimeError: If unable to initialize LLM or embedding clients
    """
    try:
        embedding_engine = get_vector_engine().embedding_engine
        llm_client = get_llm_client()
    except Exception as e:
        raise RuntimeError(f"Failed to initialize clients: {str(e)}")

    # Use half of LLM's context to leave room for system prompts and completion
    llm_cutoff_point = llm_client.max_tokens // 2
    max_chunk_tokens = min(embedding_engine.max_tokens, llm_cutoff_point)

    return max_chunk_tokens
cognee/modules/data/processing/document_types/TextDocument.py (2)

10-10: 🛠️ Refactor suggestion

Add method documentation and consider backward compatibility.

The method lacks documentation and the removal of Optional[int] could break existing code.

Consider this improvement:

-    def read(self, chunk_size: int, chunker: str, max_chunk_tokens: int):
+    def read(self, chunk_size: int, chunker: str, max_chunk_tokens: Optional[int] = None):
+        """Read and chunk the text document.
+        
+        Args:
+            chunk_size: The target size for each chunk in characters
+            chunker: The chunking strategy to use
+            max_chunk_tokens: Maximum number of tokens per chunk. If None,
+                            uses the value from get_max_chunk_tokens()
+        
+        Yields:
+            Iterator of text chunks
+        """

Committable suggestion skipped: line range outside the PR's diff.


24-24: 🛠️ Refactor suggestion

Handle optional max_chunk_tokens parameter.

Since max_chunk_tokens should be optional for backward compatibility, add handling for None value.

-            self, chunk_size=chunk_size, get_text=get_text, max_chunk_tokens=max_chunk_tokens
+            self,
+            chunk_size=chunk_size,
+            get_text=get_text,
+            max_chunk_tokens=max_chunk_tokens or get_max_chunk_tokens()

Committable suggestion skipped: line range outside the PR's diff.

cognee/infrastructure/llm/tokenizer/HuggingFace/adapter.py (2)

35-36: ⚠️ Potential issue

Implement trim_text_to_max_tokens method

The trim_text_to_max_tokens method is not implemented. This could be important for handling texts that exceed token limits.

Here's a suggested implementation:

     def trim_text_to_max_tokens(self, text: str) -> str:
-        raise NotImplementedError
+        tokens = self.tokenizer.tokenize(text)
+        if len(tokens) <= self.max_tokens:
+            return text
+        trimmed_tokens = tokens[:self.max_tokens]
+        return self.tokenizer.convert_tokens_to_string(trimmed_tokens)

Committable suggestion skipped: line range outside the PR's diff.


17-17: ⚠️ Potential issue

Add error handling for tokenizer initialization

The AutoTokenizer.from_pretrained() call could fail if the model doesn't exist or network issues occur. Add appropriate error handling.

-        self.tokenizer = AutoTokenizer.from_pretrained(model)
+        try:
+            self.tokenizer = AutoTokenizer.from_pretrained(model)
+        except Exception as e:
+            raise ValueError(f"Failed to load tokenizer for model {model}: {str(e)}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        try:
            self.tokenizer = AutoTokenizer.from_pretrained(model)
        except Exception as e:
            raise ValueError(f"Failed to load tokenizer for model {model}: {str(e)}")
cognee/infrastructure/llm/tokenizer/Gemini/adapter.py (3)

26-27: ⚠️ Potential issue

Implement required interface methods.

The extract_tokens and trim_text_to_max_tokens methods are required by the interface but not implemented. This could cause runtime errors.

Consider implementing these methods or providing a clear reason why they're not supported:

def extract_tokens(self, text: str) -> List[Any]:
    """Not supported by Gemini API"""
    raise NotImplementedError("Token extraction is not supported by Gemini API")

def trim_text_to_max_tokens(self, text: str) -> str:
    """Trim text to max_tokens"""
    current_tokens = self.count_tokens(text)
    if current_tokens <= self.max_tokens:
        return text
    
    # Implement binary search or similar algorithm to find the right cut-off point
    raise NotImplementedError("Text trimming is not yet implemented for Gemini API")

Also applies to: 43-44


22-24: 🛠️ Refactor suggestion

Move Google AI import and configuration to class initialization.

The Google AI library import and configuration should be in __init__ to fail fast if the library is missing.

     def __init__(
         self,
         model: str,
         max_tokens: int = 3072,
     ):
         self.model = model
         self.max_tokens = max_tokens
 
         # Get LLM API key from config
         from cognee.infrastructure.databases.vector.embeddings.config import get_embedding_config
         from cognee.infrastructure.llm.config import get_llm_config
 
         config = get_embedding_config()
         llm_config = get_llm_config()
 
+        try:
+            import google.generativeai as genai
+            self.genai = genai
+            genai.configure(api_key=config.embedding_api_key or llm_config.llm_api_key)
+        except ImportError:
+            raise ImportError("google-generativeai package is required for GeminiTokenizer")
-        import google.generativeai as genai
-
-        genai.configure(api_key=config.embedding_api_key or llm_config.llm_api_key)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    def __init__(
        self,
        model: str,
        max_tokens: int = 3072,
    ):
        self.model = model
        self.max_tokens = max_tokens

        # Get LLM API key from config
        from cognee.infrastructure.databases.vector.embeddings.config import get_embedding_config
        from cognee.infrastructure.llm.config import get_llm_config

        config = get_embedding_config()
        llm_config = get_llm_config()

        try:
            import google.generativeai as genai
            self.genai = genai
            genai.configure(api_key=config.embedding_api_key or llm_config.llm_api_key)
        except ImportError:
            raise ImportError("google-generativeai package is required for GeminiTokenizer")

29-41: 🛠️ Refactor suggestion

Optimize token counting implementation.

Using embed_content for token counting is inefficient as it performs full embedding. Consider using a dedicated tokenization method if available.

Also, add error handling for API calls:

     def count_tokens(self, text: str) -> int:
-        import google.generativeai as genai
-
-        return len(genai.embed_content(model=f"models/{self.model}", content=text))
+        try:
+            return len(self.genai.embed_content(
+                model=f"models/{self.model}",
+                content=text
+            ))
+        except Exception as e:
+            raise ValueError(f"Failed to count tokens: {str(e)}")

Committable suggestion skipped: line range outside the PR's diff.

cognee/modules/data/models/Data.py (1)

23-23: 🛠️ Refactor suggestion

Add default value for token_count column.

The new token_count column needs a default value for existing rows. Also, consider including it in the to_json() method.

-    token_count = Column(Integer)
+    token_count = Column(Integer, default=-1)  # -1 indicates not calculated yet

And update the to_json() method:

     def to_json(self) -> dict:
         return {
             "id": str(self.id),
             "name": self.name,
             "extension": self.extension,
             "mimeType": self.mime_type,
             "rawDataLocation": self.raw_data_location,
+            "tokenCount": self.token_count,
             "createdAt": self.created_at.isoformat(),
             "updatedAt": self.updated_at.isoformat() if self.updated_at else None,
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    token_count = Column(Integer, default=-1)  # -1 indicates not calculated yet
cognee/infrastructure/llm/generic_llm_api/adapter.py (1)

20-26: ⚠️ Potential issue

Utilize the max_tokens parameter in API calls.

The max_tokens parameter is stored but never used. Consider using it in the acreate_structured_output method to ensure consistent token limits across the system.

Apply this change:

     async def acreate_structured_output(
         self, text_input: str, system_prompt: str, response_model: Type[BaseModel]
     ) -> BaseModel:
         """Generate a response from a user query."""
 
         return await self.aclient.chat.completions.create(
             model=self.model,
+            max_tokens=self.max_tokens,
             messages=[
                 {
                     "role": "user",

Committable suggestion skipped: line range outside the PR's diff.

cognee/infrastructure/llm/anthropic/adapter.py (1)

17-22: ⚠️ Potential issue

Replace hardcoded max_tokens value with instance variable.

The constructor accepts max_tokens but the acreate_structured_output method uses a hardcoded value of 4096. This creates an inconsistency with the configured value.

Apply this change:

     async def acreate_structured_output(
         self, text_input: str, system_prompt: str, response_model: Type[BaseModel]
     ) -> BaseModel:
         """Generate a response from a user query."""
 
         return await self.aclient(
             model=self.model,
-            max_tokens=4096,
+            max_tokens=self.max_tokens,
             max_retries=5,

Committable suggestion skipped: line range outside the PR's diff.

cognee/infrastructure/llm/tokenizer/TikToken/adapter.py (1)

28-30: 🛠️ Refactor suggestion

Optimize token extraction performance.

The current implementation decodes each token individually, which is inefficient. Consider decoding all tokens at once:

-        for token_id in token_ids:
-            token = self.tokenizer.decode([token_id])
-            tokens.append(token)
+        tokens = [self.tokenizer.decode([token_id]) for token_id in token_ids]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        tokens = [self.tokenizer.decode([token_id]) for token_id in token_ids]
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (1)

117-130: 🛠️ Refactor suggestion

Consider adding error handling for unsupported providers.

The get_tokenizer method should explicitly handle unsupported providers rather than defaulting to HuggingFace.

    def get_tokenizer(self):
        logger.debug(f"Loading tokenizer for model {self.model}...")
        # If model also contains provider information, extract only model information
        model = self.model.split("/")[-1]

        if "openai" in self.provider.lower():
            tokenizer = TikTokenTokenizer(model=model, max_tokens=self.max_tokens)
        elif "gemini" in self.provider.lower():
            tokenizer = GeminiTokenizer(model=model, max_tokens=self.max_tokens)
+       elif "huggingface" in self.provider.lower():
+           tokenizer = HuggingFaceTokenizer(model=self.model, max_tokens=self.max_tokens)
        else:
-           tokenizer = HuggingFaceTokenizer(model=self.model, max_tokens=self.max_tokens)
+           raise ValueError(f"Unsupported provider: {self.provider}")

        logger.debug(f"Tokenizer loaded for model: {self.model}")
        return tokenizer
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    def get_tokenizer(self):
        logger.debug(f"Loading tokenizer for model {self.model}...")
        # If model also contains provider information, extract only model information
        model = self.model.split("/")[-1]

        if "openai" in self.provider.lower():
            tokenizer = TikTokenTokenizer(model=model, max_tokens=self.max_tokens)
        elif "gemini" in self.provider.lower():
            tokenizer = GeminiTokenizer(model=model, max_tokens=self.max_tokens)
        elif "huggingface" in self.provider.lower():
            tokenizer = HuggingFaceTokenizer(model=self.model, max_tokens=self.max_tokens)
        else:
            raise ValueError(f"Unsupported provider: {self.provider}")

        logger.debug(f"Tokenizer loaded for model: {self.model}")
        return tokenizer
cognee/tasks/repo_processor/get_source_code_chunks.py (2)

102-103: 🛠️ Refactor suggestion

Add error handling for embedding engine initialization.

The code assumes the embedding engine will always be successfully initialized. Consider adding error handling to gracefully handle initialization failures.

-    embedding_engine = get_vector_engine().embedding_engine
+    try:
+        embedding_engine = get_vector_engine().embedding_engine
+    except Exception as e:
+        logger.error(f"Failed to initialize embedding engine: {e}")
+        raise

Also applies to: 107-107, 111-111


99-101: 🛠️ Refactor suggestion

Move import statement to the top of the file.

Moving the import statement inside the function could impact performance as it will be executed on every function call. Consider moving it back to the top of the file with other imports.

-    # Get embedding engine used in vector database
-    from cognee.infrastructure.databases.vector.get_vector_engine import get_vector_engine

Committable suggestion skipped: line range outside the PR's diff.

.env.template (1)

6-6: ⚠️ Potential issue

Fix typo in the LLM model name.

The model name "openai/gpt-4o-mini" appears to have a typo. OpenAI's model naming doesn't follow this format.

-LLM_MODEL="openai/gpt-4o-mini"
+LLM_MODEL="gpt-4"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

LLM_MODEL="gpt-4"

Base automatically changed from COG-970-refactor-tokenizing to dev January 29, 2025 08:02
@alekszievr alekszievr merged commit edae277 into dev Jan 29, 2025
28 checks passed
@alekszievr alekszievr deleted the feat/cog-1071-input-token-counting branch January 29, 2025 10:29
This was referenced Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants