-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Count the number of tokens in documents [COG-1071] #476
Conversation
WalkthroughThe changes introduce a new Changes
Sequence DiagramsequenceDiagram
participant Doc as Document
participant Chunker as TextChunker
participant Extractor as ChunkExtractor
Doc->>Chunker: read()
Chunker-->>Extractor: Yield DocumentChunk
Extractor->>Extractor: Accumulate token_count
Extractor->>Doc: Set total token_count
Poem
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🔭 Outside diff range comments (1)
cognee/modules/chunking/TextChunker.py (1)
Line range hint
79-81
: Improve error handling for DocumentChunk creation.The current error handling only prints exceptions but continues execution, which could lead to invalid state or data loss.
Consider proper error handling:
try: yield DocumentChunk(...) except Exception as e: - print(e) + logger.error(f"Failed to create DocumentChunk: {e}") + raise DocumentChunkCreationError(f"Failed to create chunk: {e}") from eAlso applies to: 101-103
🧹 Nitpick comments (1)
cognee/modules/chunking/TextChunker.py (1)
Line range hint
23-24
: Add validation for token counts.The code doesn't validate token count values before using them in calculations or assignments.
Consider adding validation in check_word_count_and_token_count:
def check_word_count_and_token_count(self, word_count_before, token_count_before, chunk_data): + if "token_count" not in chunk_data or not isinstance(chunk_data["token_count"], (int, float)): + raise ValueError(f"Invalid token count in chunk data: {chunk_data}") word_count_fits = word_count_before + chunk_data["word_count"] <= self.max_chunk_size token_count_fits = token_count_before + chunk_data["token_count"] <= self.max_tokens return word_count_fits and token_count_fitsAlso applies to: 35-37
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
cognee/modules/chunking/TextChunker.py
(3 hunks)cognee/modules/chunking/models/DocumentChunk.py
(1 hunks)cognee/modules/data/processing/document_types/Document.py
(1 hunks)cognee/tasks/documents/extract_chunks_from_documents.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (20)
- GitHub Check: run_notebook_test / test
- GitHub Check: run_simple_example_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: run_simple_example_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: windows-latest
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: profiler
- GitHub Check: docker-compose-test
🔇 Additional comments (1)
cognee/modules/data/processing/document_types/Document.py (1)
12-12
: LGTM! Clean addition of token count tracking.The optional token count field is well-defined with appropriate type annotation and default value.
55fff11
to
f6663ab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments.
Actionable comments posted: 18
🔭 Outside diff range comments (2)
cognee/tasks/chunks/chunk_by_paragraph.py (1)
Line range hint
35-48
: Optimize tokenizer initialization.The tokenizer is recreated for each sentence, which is inefficient. Consider moving the initialization outside the loop:
embedding_model = vector_engine.embedding_engine.model embedding_model = embedding_model.split("/")[-1] + tokenizer = tiktoken.encoding_for_model(embedding_model) for paragraph_id, sentence, word_count, end_type in chunk_by_sentence( data, maximum_length=paragraph_length ): # Check if this sentence would exceed length limit - tokenizer = tiktoken.encoding_for_model(embedding_model) token_count = len(tokenizer.encode(sentence))cognee/infrastructure/llm/openai/adapter.py (1)
Line range hint
144-146
: Replace hardcoded max_tokens value with instance variable.The transcribe_image method uses a hardcoded value of 300 for max_tokens instead of using the instance variable self.max_tokens.
return litellm.completion( model=self.model, messages=[ { "role": "user", "content": [ { "type": "text", "text": "What's in this image?", }, { "type": "image_url", "image_url": { "url": f"data:image/jpeg;base64,{encoded_image}", }, }, ], } ], api_key=self.api_key, api_base=self.endpoint, api_version=self.api_version, - max_tokens=300, + max_tokens=self.max_tokens, max_retries=5, )
♻️ Duplicate comments (1)
cognee/tasks/documents/extract_chunks_from_documents.py (1)
43-43
:⚠️ Potential issueAdd error handling for missing token counts.
While the token counting logic is correct, there's no handling for cases where document_chunk.token_count might be None or missing. This could lead to runtime errors.
- document_token_count += document_chunk.token_count + if hasattr(document_chunk, 'token_count') and document_chunk.token_count is not None: + document_token_count += document_chunk.token_count + else: + logger.warning(f"Missing token count for chunk {document_chunk.id}")
🧹 Nitpick comments (16)
cognee/infrastructure/llm/tokenizer/tokenizer_interface.py (3)
5-7
: Add comprehensive docstring for TokenizerInterface.The interface would benefit from detailed documentation explaining its purpose, contract, and usage examples.
Add this docstring:
class TokenizerInterface(Protocol): - """Tokenizer interface""" + """Protocol defining the contract for tokenizer implementations. + + This interface standardizes how different tokenizer implementations should: + - Extract tokens from text + - Count tokens in text + - Trim text to fit within token limits + + Implementations should handle different tokenization strategies while + conforming to this interface. + """
8-10
: Add method documentation and consider using a more specific type.The method lacks documentation and uses a very generic List[Any] return type.
Consider this improvement:
@abstractmethod def extract_tokens(self, text: str) -> List[Any]: - raise NotImplementedError + """Extract tokens from the input text. + + Args: + text: The input text to tokenize + + Returns: + List of tokens extracted from the text + """ + passAlso consider creating a Token type (e.g.,
Token = NewType('Token', Any)
) to make the interface more type-safe.
12-18
: Add documentation for remaining methods.The count_tokens and trim_text_to_max_tokens methods need documentation for completeness.
Add docstrings:
@abstractmethod def count_tokens(self, text: str) -> int: + """Count the number of tokens in the input text. + + Args: + text: The input text to analyze + + Returns: + Number of tokens in the text + """ - raise NotImplementedError + pass @abstractmethod def trim_text_to_max_tokens(self, text: str) -> str: + """Trim the input text to fit within a maximum token limit. + + Args: + text: The input text to trim + + Returns: + Trimmed text that fits within the token limit + """ - raise NotImplementedError + passcognee/infrastructure/databases/vector/embeddings/config.py (2)
7-8
: Decouple provider from model name.Having the provider prefix in the model name creates tight coupling. Consider separating these concerns.
- embedding_model: Optional[str] = "openai/text-embedding-3-large" + embedding_model: Optional[str] = "text-embedding-3-large"
13-13
: Document the magic number for max_tokens.The value 8191 appears to be a specific limit but lacks documentation explaining its significance.
- embedding_max_tokens: Optional[int] = 8191 + # OpenAI's text-embedding-3-large model has a context window of 8191 tokens + embedding_max_tokens: Optional[int] = 8191cognee/infrastructure/llm/tokenizer/HuggingFace/adapter.py (2)
19-21
: Consider memory usage for large textsThe
extract_tokens
method loads all tokens into memory. For very large texts, this could cause memory issues.Consider implementing a generator version or adding a warning for large texts:
def extract_tokens(self, text: str) -> List[Any]: + if len(text) > 1_000_000: # 1MB threshold + import warnings + warnings.warn("Processing large text may consume significant memory") tokens = self.tokenizer.tokenize(text) return tokens
23-33
: Improve docstring consistency and completenessThe docstring for
count_tokens
could be more detailed and consistent with Python standards.def count_tokens(self, text: str) -> int: """ - Returns the number of tokens in the given text. - Args: - text: str + Count the number of tokens in the given text using the HuggingFace tokenizer. + + Args: + text (str): The input text to tokenize - Returns: - number of tokens in the given text + Returns: + int: The number of tokens in the text + Raises: + ValueError: If the text is empty or None """cognee/infrastructure/llm/config.py (1)
28-28
: Maintain consistent naming in dictionary keys.The dictionary key "max_tokens" differs from the attribute name "llm_max_tokens". Consider using consistent naming to avoid confusion.
- "max_tokens": self.llm_max_tokens, + "llm_max_tokens": self.llm_max_tokens,cognee/tests/integration/documents/AudioDocument_test.py (1)
Line range hint
13-31
: Consider using more realistic audio transcript format.The mock transcript text uses quotation marks and dialogue formatting which might not match real audio transcription output. Consider using a format that better represents actual transcription results.
Example:
TEST_TEXT = """ Speaker 1: Mike, we need to talk about the payment processing service. Speaker 2: Good timing. The board wants one-click checkout by end of quarter. Speaker 1: That's exactly the problem. The service is held together with duct tape. One wrong move and— # ... rest of the dialogue with speaker labels """cognee/infrastructure/llm/get_llm_client.py (1)
46-50
: Consider refactoring repeated GenericAPIAdapter instantiation.The GenericAPIAdapter instantiation is duplicated for both Ollama and Custom providers with identical parameters. Consider extracting this to a helper method to reduce code duplication.
+def _create_generic_adapter(llm_config, provider_name): + return GenericAPIAdapter( + llm_config.llm_endpoint, + llm_config.llm_api_key, + llm_config.llm_model, + provider_name, + max_tokens=llm_config.llm_max_tokens, + ) elif provider == LLMProvider.OLLAMA: if llm_config.llm_api_key is None: raise InvalidValueError(message="LLM API key is not set.") from .generic_llm_api.adapter import GenericAPIAdapter - return GenericAPIAdapter( - llm_config.llm_endpoint, - llm_config.llm_api_key, - llm_config.llm_model, - "Ollama", - max_tokens=llm_config.llm_max_tokens, - ) + return _create_generic_adapter(llm_config, "Ollama") elif provider == LLMProvider.CUSTOM: if llm_config.llm_api_key is None: raise InvalidValueError(message="LLM API key is not set.") from .generic_llm_api.adapter import GenericAPIAdapter - return GenericAPIAdapter( - llm_config.llm_endpoint, - llm_config.llm_api_key, - llm_config.llm_model, - "Custom", - max_tokens=llm_config.llm_max_tokens, - ) + return _create_generic_adapter(llm_config, "Custom")Also applies to: 65-69
cognee/infrastructure/llm/tokenizer/TikToken/adapter.py (2)
63-68
: Improve text trimming to respect word boundaries.The current implementation might cut words in half. Consider implementing a more sophisticated trimming strategy that respects word boundaries:
- # This is a simple trim, it may cut words in half; consider using word boundaries for a cleaner cut - encoded_text = self.tokenizer.encode(text) - trimmed_encoded_text = encoded_text[: self.max_tokens] - # Decoding the trimmed text - trimmed_text = self.tokenizer.decode(trimmed_encoded_text) + # Trim at word boundaries + words = text.split() + trimmed_words = [] + current_tokens = 0 + + for word in words: + word_tokens = len(self.tokenizer.encode(word + ' ')) + if current_tokens + word_tokens > self.max_tokens: + break + trimmed_words.append(word) + current_tokens += word_tokens + + trimmed_text = ' '.join(trimmed_words)
16-16
: Document the rationale for max_tokens default value.The default value of 8191 tokens should be documented to explain its significance (e.g., if it's based on a specific model's context window).
def __init__( self, model: str, - max_tokens: int = 8191, + max_tokens: int = 8191, # Default based on GPT-4's typical context window ):cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (1)
11-11
: LGTM! Good test coverage with realistic token limits.The test parameterization with values 512, 1024, and 4096 tokens provides good coverage of common use cases.
Consider adding edge cases to
max_chunk_tokens_vals
:-max_chunk_tokens_vals = [512, 1024, 4096] +max_chunk_tokens_vals = [1, 512, 1024, 4096, 8192]Also applies to: 23-23
cognee/tests/integration/documents/UnstructuredDocument_test.py (1)
Line range hint
71-105
: Add assertions for token counts and parameterize the token limit.While the changes consistently add the
max_chunk_tokens
parameter across all document types, consider these improvements:
- Parameterize the token limit instead of hardcoding:
+MAX_CHUNK_TOKENS = 1024 # Define at module level ... - chunk_size=1024, chunker="text_chunker", max_chunk_tokens=1024 + chunk_size=1024, chunker="text_chunker", max_chunk_tokens=MAX_CHUNK_TOKENS
- Add token count assertions for each document type:
for paragraph_data in pptx_document.read(...): assert 19 == paragraph_data.word_count + assert paragraph_data.token_count <= MAX_CHUNK_TOKENS
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (1)
28-28
: Consider adding validation for the max_tokens parameter.While the addition of provider and max_tokens parameters improves flexibility, the max_tokens parameter should be validated to ensure it's within acceptable bounds for the chosen provider.
def __init__( self, provider: str = "openai", model: Optional[str] = "text-embedding-3-large", dimensions: Optional[int] = 3072, api_key: str = None, endpoint: str = None, api_version: str = None, max_tokens: int = 512, ): + if max_tokens <= 0: + raise ValueError("max_tokens must be positive") self.api_key = api_key self.endpoint = endpoint self.api_version = api_version self.provider = provider self.model = model self.dimensions = dimensions self.max_tokens = max_tokens self.tokenizer = self.get_tokenizer()Also applies to: 34-34, 39-43
cognee/tasks/documents/extract_chunks_from_documents.py (1)
Line range hint
32-36
: Update docstring to include max_chunk_tokens parameter.The docstring should be updated to reflect the new required parameter.
""" Extracts chunks of data from a list of documents based on the specified chunking parameters. +Args: + documents: List of documents to process + max_chunk_tokens: Maximum number of tokens per chunk + chunk_size: Size of each chunk in bytes (default: 1024) + chunker: Chunking strategy to use (default: "text_chunker") + Notes: - The `read` method of the `Document` class must be implemented to support the chunking operation. - The `chunker` parameter determines the chunking logic and should align with the document type. """
🛑 Comments failed to post (18)
cognee/modules/data/processing/document_types/PdfDocument.py (1)
12-12:
⚠️ Potential issueBreaking change: Parameter
max_chunk_tokens
is now requiredThe parameter type has changed from
Optional[int]
toint
, making it a required parameter. This could break existing callers that don't provide this parameter.Consider keeping it optional with a sensible default value:
- def read(self, chunk_size: int, chunker: str, max_chunk_tokens: int): + def read(self, chunk_size: int, chunker: str, max_chunk_tokens: Optional[int] = None):📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.def read(self, chunk_size: int, chunker: str, max_chunk_tokens: Optional[int] = None):
cognee/tests/integration/documents/PdfDocument_test.py (1)
28-28: 🛠️ Refactor suggestion
Add test coverage for token counting and document magic number
A few suggestions to improve the test:
- The value 2048 appears to be a magic number. Consider defining it as a constant with a descriptive name.
- Add assertions to verify the token count functionality since that's the main change in this PR.
Example improvement:
+# At the top of the file with other constants +MAX_CHUNK_TOKENS = 2048 # Maximum tokens per chunk for GPT-3 compatibility + +# Update ground truth to include token counts GROUND_TRUTH = [ - {"word_count": 879, "len_text": 5607, "cut_type": "sentence_end"}, + {"word_count": 879, "len_text": 5607, "cut_type": "sentence_end", "token_count": 1024}, # ... update other entries ] # In the test - GROUND_TRUTH, document.read(chunk_size=1024, chunker="text_chunker", max_chunk_tokens=2048) + GROUND_TRUTH, document.read(chunk_size=1024, chunker="text_chunker", max_chunk_tokens=MAX_CHUNK_TOKENS) # Add token count assertion + assert ground_truth["token_count"] == paragraph_data.token_count, ( + f'{ground_truth["token_count"] = } != {paragraph_data.token_count = }' + )Committable suggestion skipped: line range outside the PR's diff.
cognee/infrastructure/llm/utils.py (1)
5-15: 🛠️ Refactor suggestion
Add error handling and improve documentation.
The function has several areas for improvement:
- No error handling for client creation
- Magic number in division
- Limited documentation
Consider this improved version:
def get_max_chunk_tokens(): + """Calculate the maximum allowed tokens for a text chunk. + + This function determines the maximum chunk size based on two constraints: + 1. The embedding engine's maximum token limit + 2. Half of the LLM's maximum context window (to leave room for system + prompts and completion tokens) + + Returns: + int: The maximum number of tokens allowed per chunk + + Raises: + RuntimeError: If unable to initialize LLM or embedding clients + """ - # Calculate max chunk size based on the following formula - embedding_engine = get_vector_engine().embedding_engine - llm_client = get_llm_client() + try: + embedding_engine = get_vector_engine().embedding_engine + llm_client = get_llm_client() + except Exception as e: + raise RuntimeError(f"Failed to initialize clients: {str(e)}") - # We need to make sure chunk size won't take more than half of LLM max context token size - # but it also can't be bigger than the embedding engine max token size - llm_cutoff_point = llm_client.max_tokens // 2 # Round down the division + # Use half of LLM's context to leave room for system prompts and completion + llm_cutoff_point = llm_client.max_tokens // 2 max_chunk_tokens = min(embedding_engine.max_tokens, llm_cutoff_point) return max_chunk_tokens📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.def get_max_chunk_tokens(): """Calculate the maximum allowed tokens for a text chunk. This function determines the maximum chunk size based on two constraints: 1. The embedding engine's maximum token limit 2. Half of the LLM's maximum context window (to leave room for system prompts and completion tokens) Returns: int: The maximum number of tokens allowed per chunk Raises: RuntimeError: If unable to initialize LLM or embedding clients """ try: embedding_engine = get_vector_engine().embedding_engine llm_client = get_llm_client() except Exception as e: raise RuntimeError(f"Failed to initialize clients: {str(e)}") # Use half of LLM's context to leave room for system prompts and completion llm_cutoff_point = llm_client.max_tokens // 2 max_chunk_tokens = min(embedding_engine.max_tokens, llm_cutoff_point) return max_chunk_tokens
cognee/modules/data/processing/document_types/TextDocument.py (2)
10-10: 🛠️ Refactor suggestion
Add method documentation and consider backward compatibility.
The method lacks documentation and the removal of Optional[int] could break existing code.
Consider this improvement:
- def read(self, chunk_size: int, chunker: str, max_chunk_tokens: int): + def read(self, chunk_size: int, chunker: str, max_chunk_tokens: Optional[int] = None): + """Read and chunk the text document. + + Args: + chunk_size: The target size for each chunk in characters + chunker: The chunking strategy to use + max_chunk_tokens: Maximum number of tokens per chunk. If None, + uses the value from get_max_chunk_tokens() + + Yields: + Iterator of text chunks + """Committable suggestion skipped: line range outside the PR's diff.
24-24: 🛠️ Refactor suggestion
Handle optional max_chunk_tokens parameter.
Since max_chunk_tokens should be optional for backward compatibility, add handling for None value.
- self, chunk_size=chunk_size, get_text=get_text, max_chunk_tokens=max_chunk_tokens + self, + chunk_size=chunk_size, + get_text=get_text, + max_chunk_tokens=max_chunk_tokens or get_max_chunk_tokens()Committable suggestion skipped: line range outside the PR's diff.
cognee/infrastructure/llm/tokenizer/HuggingFace/adapter.py (2)
35-36:
⚠️ Potential issueImplement trim_text_to_max_tokens method
The
trim_text_to_max_tokens
method is not implemented. This could be important for handling texts that exceed token limits.Here's a suggested implementation:
def trim_text_to_max_tokens(self, text: str) -> str: - raise NotImplementedError + tokens = self.tokenizer.tokenize(text) + if len(tokens) <= self.max_tokens: + return text + trimmed_tokens = tokens[:self.max_tokens] + return self.tokenizer.convert_tokens_to_string(trimmed_tokens)Committable suggestion skipped: line range outside the PR's diff.
17-17:
⚠️ Potential issueAdd error handling for tokenizer initialization
The
AutoTokenizer.from_pretrained()
call could fail if the model doesn't exist or network issues occur. Add appropriate error handling.- self.tokenizer = AutoTokenizer.from_pretrained(model) + try: + self.tokenizer = AutoTokenizer.from_pretrained(model) + except Exception as e: + raise ValueError(f"Failed to load tokenizer for model {model}: {str(e)}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.try: self.tokenizer = AutoTokenizer.from_pretrained(model) except Exception as e: raise ValueError(f"Failed to load tokenizer for model {model}: {str(e)}")
cognee/infrastructure/llm/tokenizer/Gemini/adapter.py (3)
26-27:
⚠️ Potential issueImplement required interface methods.
The
extract_tokens
andtrim_text_to_max_tokens
methods are required by the interface but not implemented. This could cause runtime errors.Consider implementing these methods or providing a clear reason why they're not supported:
def extract_tokens(self, text: str) -> List[Any]: """Not supported by Gemini API""" raise NotImplementedError("Token extraction is not supported by Gemini API") def trim_text_to_max_tokens(self, text: str) -> str: """Trim text to max_tokens""" current_tokens = self.count_tokens(text) if current_tokens <= self.max_tokens: return text # Implement binary search or similar algorithm to find the right cut-off point raise NotImplementedError("Text trimming is not yet implemented for Gemini API")Also applies to: 43-44
22-24: 🛠️ Refactor suggestion
Move Google AI import and configuration to class initialization.
The Google AI library import and configuration should be in
__init__
to fail fast if the library is missing.def __init__( self, model: str, max_tokens: int = 3072, ): self.model = model self.max_tokens = max_tokens # Get LLM API key from config from cognee.infrastructure.databases.vector.embeddings.config import get_embedding_config from cognee.infrastructure.llm.config import get_llm_config config = get_embedding_config() llm_config = get_llm_config() + try: + import google.generativeai as genai + self.genai = genai + genai.configure(api_key=config.embedding_api_key or llm_config.llm_api_key) + except ImportError: + raise ImportError("google-generativeai package is required for GeminiTokenizer") - import google.generativeai as genai - - genai.configure(api_key=config.embedding_api_key or llm_config.llm_api_key)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.def __init__( self, model: str, max_tokens: int = 3072, ): self.model = model self.max_tokens = max_tokens # Get LLM API key from config from cognee.infrastructure.databases.vector.embeddings.config import get_embedding_config from cognee.infrastructure.llm.config import get_llm_config config = get_embedding_config() llm_config = get_llm_config() try: import google.generativeai as genai self.genai = genai genai.configure(api_key=config.embedding_api_key or llm_config.llm_api_key) except ImportError: raise ImportError("google-generativeai package is required for GeminiTokenizer")
29-41: 🛠️ Refactor suggestion
Optimize token counting implementation.
Using
embed_content
for token counting is inefficient as it performs full embedding. Consider using a dedicated tokenization method if available.Also, add error handling for API calls:
def count_tokens(self, text: str) -> int: - import google.generativeai as genai - - return len(genai.embed_content(model=f"models/{self.model}", content=text)) + try: + return len(self.genai.embed_content( + model=f"models/{self.model}", + content=text + )) + except Exception as e: + raise ValueError(f"Failed to count tokens: {str(e)}")Committable suggestion skipped: line range outside the PR's diff.
cognee/modules/data/models/Data.py (1)
23-23: 🛠️ Refactor suggestion
Add default value for token_count column.
The new
token_count
column needs a default value for existing rows. Also, consider including it in theto_json()
method.- token_count = Column(Integer) + token_count = Column(Integer, default=-1) # -1 indicates not calculated yetAnd update the
to_json()
method:def to_json(self) -> dict: return { "id": str(self.id), "name": self.name, "extension": self.extension, "mimeType": self.mime_type, "rawDataLocation": self.raw_data_location, + "tokenCount": self.token_count, "createdAt": self.created_at.isoformat(), "updatedAt": self.updated_at.isoformat() if self.updated_at else None, }
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.token_count = Column(Integer, default=-1) # -1 indicates not calculated yet
cognee/infrastructure/llm/generic_llm_api/adapter.py (1)
20-26:
⚠️ Potential issueUtilize the max_tokens parameter in API calls.
The
max_tokens
parameter is stored but never used. Consider using it in theacreate_structured_output
method to ensure consistent token limits across the system.Apply this change:
async def acreate_structured_output( self, text_input: str, system_prompt: str, response_model: Type[BaseModel] ) -> BaseModel: """Generate a response from a user query.""" return await self.aclient.chat.completions.create( model=self.model, + max_tokens=self.max_tokens, messages=[ { "role": "user",
Committable suggestion skipped: line range outside the PR's diff.
cognee/infrastructure/llm/anthropic/adapter.py (1)
17-22:
⚠️ Potential issueReplace hardcoded max_tokens value with instance variable.
The constructor accepts
max_tokens
but theacreate_structured_output
method uses a hardcoded value of 4096. This creates an inconsistency with the configured value.Apply this change:
async def acreate_structured_output( self, text_input: str, system_prompt: str, response_model: Type[BaseModel] ) -> BaseModel: """Generate a response from a user query.""" return await self.aclient( model=self.model, - max_tokens=4096, + max_tokens=self.max_tokens, max_retries=5,Committable suggestion skipped: line range outside the PR's diff.
cognee/infrastructure/llm/tokenizer/TikToken/adapter.py (1)
28-30: 🛠️ Refactor suggestion
Optimize token extraction performance.
The current implementation decodes each token individually, which is inefficient. Consider decoding all tokens at once:
- for token_id in token_ids: - token = self.tokenizer.decode([token_id]) - tokens.append(token) + tokens = [self.tokenizer.decode([token_id]) for token_id in token_ids]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.tokens = [self.tokenizer.decode([token_id]) for token_id in token_ids]
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (1)
117-130: 🛠️ Refactor suggestion
Consider adding error handling for unsupported providers.
The get_tokenizer method should explicitly handle unsupported providers rather than defaulting to HuggingFace.
def get_tokenizer(self): logger.debug(f"Loading tokenizer for model {self.model}...") # If model also contains provider information, extract only model information model = self.model.split("/")[-1] if "openai" in self.provider.lower(): tokenizer = TikTokenTokenizer(model=model, max_tokens=self.max_tokens) elif "gemini" in self.provider.lower(): tokenizer = GeminiTokenizer(model=model, max_tokens=self.max_tokens) + elif "huggingface" in self.provider.lower(): + tokenizer = HuggingFaceTokenizer(model=self.model, max_tokens=self.max_tokens) else: - tokenizer = HuggingFaceTokenizer(model=self.model, max_tokens=self.max_tokens) + raise ValueError(f"Unsupported provider: {self.provider}") logger.debug(f"Tokenizer loaded for model: {self.model}") return tokenizer📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.def get_tokenizer(self): logger.debug(f"Loading tokenizer for model {self.model}...") # If model also contains provider information, extract only model information model = self.model.split("/")[-1] if "openai" in self.provider.lower(): tokenizer = TikTokenTokenizer(model=model, max_tokens=self.max_tokens) elif "gemini" in self.provider.lower(): tokenizer = GeminiTokenizer(model=model, max_tokens=self.max_tokens) elif "huggingface" in self.provider.lower(): tokenizer = HuggingFaceTokenizer(model=self.model, max_tokens=self.max_tokens) else: raise ValueError(f"Unsupported provider: {self.provider}") logger.debug(f"Tokenizer loaded for model: {self.model}") return tokenizer
cognee/tasks/repo_processor/get_source_code_chunks.py (2)
102-103: 🛠️ Refactor suggestion
Add error handling for embedding engine initialization.
The code assumes the embedding engine will always be successfully initialized. Consider adding error handling to gracefully handle initialization failures.
- embedding_engine = get_vector_engine().embedding_engine + try: + embedding_engine = get_vector_engine().embedding_engine + except Exception as e: + logger.error(f"Failed to initialize embedding engine: {e}") + raiseAlso applies to: 107-107, 111-111
99-101: 🛠️ Refactor suggestion
Move import statement to the top of the file.
Moving the import statement inside the function could impact performance as it will be executed on every function call. Consider moving it back to the top of the file with other imports.
- # Get embedding engine used in vector database - from cognee.infrastructure.databases.vector.get_vector_engine import get_vector_engineCommittable suggestion skipped: line range outside the PR's diff.
.env.template (1)
6-6:
⚠️ Potential issueFix typo in the LLM model name.
The model name "openai/gpt-4o-mini" appears to have a typo. OpenAI's model naming doesn't follow this format.
-LLM_MODEL="openai/gpt-4o-mini" +LLM_MODEL="gpt-4"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.LLM_MODEL="gpt-4"
Description
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin
Summary by CodeRabbit
New Features
Improvements