-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code graph pipeline improvements and fixes #414
Conversation
WalkthroughThe changes introduce a comprehensive enhancement to token management across multiple modules in the Cognee project. The primary modification involves adding an optional Changes
Poem
Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🔭 Outside diff range comments (5)
cognee/modules/data/processing/document_types/PdfDocument.py (1)
Fix inconsistent spacing around
=
operators in parameter assignmentsIn
chunker_func
call, maintain consistent spacing around=
operators:
chunk_size = chunk_size, get_text = get_text
→ Remove spaces- Or
embedding_model=embedding_model, max_tokens=max_tokens
→ Add spaces🔗 Analysis chain
Line range hint
1-21
: Fix formatting issues according to Ruff standardsThe file requires formatting according to the pipeline failures. Run the Ruff formatter to fix spacing around operators and other style issues.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check specific formatting issues in the file rg -A 1 "chunk_size = " "cognee/modules/data/processing/document_types/PdfDocument.py"Length of output: 226
🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/modules/chunking/TextChunker.py (2)
Line range hint
71-81
: Improve exception handlingThe bare except and print statements should be replaced with proper error handling.
- except Exception as e: - print(e) + except Exception as e: + logger.error(f"Failed to create DocumentChunk: {e}") + raise🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
Fix formatting issues in TextChunker.py
The verification confirms formatting issues that need to be addressed:
- Several lines exceed the 88-character limit (e.g.,
__init__
method signature)- Inconsistent spacing around operators (e.g.,
id = uuid5
,self.chunk_size += chunk_data["word_count"]
)- Inconsistent spacing after commas in parameter lists
🔗 Analysis chain
Line range hint
1-100
: Fix formatting issuesThe file requires formatting according to Ruff standards.
Run the following command to fix formatting:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Fix formatting issues using Ruff ruff format cognee/modules/chunking/TextChunker.pyLength of output: 147
Script:
#!/bin/bash # Check maximum line length and show lines exceeding 88 characters awk 'length($0) > 88 { print length($0) ": " $0 }' cognee/modules/chunking/TextChunker.py # Check for inconsistent spacing around operators and in parameter lists grep -n "=[^ ]\|[^ ]=" cognee/modules/chunking/TextChunker.py grep -n ", [^ ]" cognee/modules/chunking/TextChunker.py # Check indentation consistency grep -n "^[ ]\{1,\}[^ ]" cognee/modules/chunking/TextChunker.pyLength of output: 6870
🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/modules/data/processing/document_types/UnstructuredDocument.py (1)
Fix formatting issues in method signature and TextChunker instantiation
- Line 13: Break down the long method signature into multiple lines
- Line 32: Remove extra spaces around
=
in keyword arguments and consider line wrapping- Example fix for line 32:
chunker = TextChunker( self, chunk_size=chunk_size, get_text=get_text, embedding_model=embedding_model, max_tokens=max_tokens )🔗 Analysis chain
Line range hint
1-35
: Fix formatting issuesThe file requires formatting according to Ruff standards.
Run the following command to fix formatting:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Fix formatting issues using Ruff ruff format cognee/modules/data/processing/document_types/UnstructuredDocument.pyLength of output: 178
Script:
#!/bin/bash # Check for inconsistent indentation ast-grep --pattern 'class $_($$$) { $$$ }' # Check import organization cat cognee/modules/data/processing/document_types/UnstructuredDocument.py | grep -n "^import\|^from" # Check line lengths awk 'length > 100 { print NR, length, $0 }' cognee/modules/data/processing/document_types/UnstructuredDocument.pyLength of output: 760
🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/tasks/repo_processor/get_non_code_files.py (1)
Fix formatting issues: trailing whitespaces and long lines
The following formatting issues need to be addressed:
- Remove trailing whitespaces from lines 29-43 and 52
- Wrap lines that exceed 79 characters (lines 48, 53, and 60)
🔗 Analysis chain
Line range hint
1-70
: Fix formatting issuesThe file requires formatting according to Ruff standards.
Run the following command to fix formatting:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Fix formatting issues using Ruff ruff format cognee/tasks/repo_processor/get_non_code_files.pyLength of output: 158
Script:
#!/bin/bash # Check for basic formatting issues echo "=== Checking line lengths ===" awk 'length > 79 { print NR ":", $0 }' cognee/tasks/repo_processor/get_non_code_files.py echo -e "\n=== Checking for trailing whitespaces ===" grep -n '[[:space:]]$' cognee/tasks/repo_processor/get_non_code_files.py echo -e "\n=== Checking for tabs ===" grep -n $'\t' cognee/tasks/repo_processor/get_non_code_files.py echo -e "\n=== Checking indentation consistency ===" sed -n '/^[[:space:]]\+[^[:space:]]/p' cognee/tasks/repo_processor/get_non_code_files.py | uniq -w4 | wc -lLength of output: 1944
🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
🧹 Nitpick comments (10)
cognee/tasks/chunks/chunk_by_paragraph.py (1)
29-34
: Move tokenizer initialization outside the loop for efficiencyThe tokenizer is initialized within the loop, which may lead to unnecessary re-initializations and impact performance. Consider initializing the tokenizer once before the loop if
embedding_model
is provided.Apply this diff to move the tokenizer initialization:
+ tokenizer = None for paragraph_id, sentence, word_count, end_type in chunk_by_sentence(data, maximum_length=paragraph_length): # Check if this sentence would exceed length limit - if embedding_model: - tokenizer = tiktoken.encoding_for_model(embedding_model) + if embedding_model and tokenizer is None: + tokenizer = tiktoken.encoding_for_model(embedding_model) if embedding_model: token_count = len(tokenizer.encode(sentence)) else: token_count = 0🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/tasks/documents/extract_chunks_from_documents.py (2)
1-2
: ImportOptional
from the correct moduleThe
Optional
type is imported but not used correctly. Ensure thatOptional
is imported fromtyping
, and all type annotations use the correct types.Apply this diff to adjust the import:
-from typing import Optional +from typing import Optional, ListAnd update the type hint in line 7:
async def extract_chunks_from_documents( - documents: list[Document], + documents: List[Document],🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
6-11
: Adjust default values and type hints for new parametersThe new parameters
embedding_model
andmax_tokens
should have consistent type hints and default values matching those in other files.Apply this diff to adjust the type hints and default values:
async def extract_chunks_from_documents( documents: List[Document], chunk_size: int = 1024, chunker='text_chunker', - embedding_model: Optional[str] = None, - max_tokens: Optional[int] = None, + embedding_model: Optional[str] = None, + max_tokens: Optional[Union[int, float]] = None, ):🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/modules/data/processing/document_types/TextDocument.py (1)
Line range hint
12-16
: Consider making the buffer size configurableThe fixed buffer size of 1024 bytes might not be optimal for all cases. Consider making it configurable or adjusting based on the file size.
- text = file.read(1024) + text = file.read(self.buffer_size)Add to class initialization:
def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.buffer_size = 4096 # Default buffer size, can be configured🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/modules/data/processing/document_types/ImageDocument.py (1)
22-22
: Consider using a regular function instead of lambdaThe lambda function could be replaced with a more explicit generator function for better readability and maintainability.
+ def _get_text(self, text): + yield text + def read(self, chunk_size: int, chunker: str, embedding_model:Optional[str], max_tokens: Optional[int]): text = self.transcribe_image() chunker_func = ChunkerConfig.get_chunker(chunker) - chunker = chunker_func(self, chunk_size = chunk_size, get_text = lambda: [text], embedding_model=embedding_model, max_tokens=max_tokens) + chunker = chunker_func(self, chunk_size=chunk_size, get_text=lambda: self._get_text(text), embedding_model=embedding_model, max_tokens=max_tokens)🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
examples/python/code_graph_example.py (1)
58-65
: Use more precise time measurementFor more accurate timing, consider using
time.perf_counter()
instead oftime.time()
. Also, consider adding memory usage metrics.if args.time: import time - start_time = time.time() + start_time = time.perf_counter() asyncio.run(main(args.repo_path, args.include_docs)) - end_time = time.time() + end_time = time.perf_counter() print("\n" + "="*50) print(f"Pipeline Execution Time: {end_time - start_time:.2f} seconds") + import psutil + process = psutil.Process() + print(f"Memory Usage: {process.memory_info().rss / 1024 / 1024:.2f} MB") print("="*50 + "\n")cognee/api/v1/cognify/code_graph_pipeline.py (1)
74-74
: Consider making max_tokens configurableThe max_tokens value is hardcoded to 8192. Consider moving this to configuration to make it more maintainable and flexible.
- Task(extract_chunks_from_documents, embedding_model=embedding_engine.model, max_tokens=8192), + Task(extract_chunks_from_documents, embedding_model=embedding_engine.model, max_tokens=cognee_config.max_tokens),🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/modules/chunking/TextChunker.py (1)
17-22
: Add type hints for better code clarityConsider adding type hints for better code maintainability and IDE support.
- def __init__(self, document, get_text: callable, embedding_model: Optional[str] = None, max_tokens: Optional[int] = None, chunk_size: int = 1024): + def __init__(self, document: Document, get_text: Callable[[], Iterator[str]], embedding_model: Optional[str] = None, max_tokens: Optional[int] = None, chunk_size: int = 1024) -> None:🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/modules/data/processing/document_types/UnstructuredDocument.py (1)
Line range hint
13-31
: Improve file reading efficiencyThe current implementation reads the file in fixed 1024-byte chunks, which might split words. Consider using a line-based approach for text files.
while True: - text = in_memory_file.read(1024) + text = in_memory_file.readline() if len(text.strip()) == 0: break🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/tasks/repo_processor/get_non_code_files.py (1)
28-44
: Remove duplicate extensions and improve organizationThere are several duplicate extensions in the ALLOWED_EXTENSIONS set. Also, consider organizing extensions by type for better maintainability.
ALLOWED_EXTENSIONS = { - '.txt', '.md', '.csv', '.json', '.xml', '.yaml', '.yml', '.html', - '.css', '.js', '.ts', '.jsx', '.tsx', '.sql', '.log', '.ini', - '.toml', '.properties', '.sh', '.bash', '.dockerfile', '.gitignore', - '.gitattributes', '.makefile', '.pyproject', '.requirements', - '.env', '.pdf', '.doc', '.docx', '.dot', '.dotx', '.rtf', - '.wps', '.wpd', '.odt', '.ott', '.ottx', '.txt', '.wp', - '.sdw', '.sdx', '.docm', '.dotm', - # Additional extensions for other programming languages - '.java', '.c', '.cpp', '.h', '.cs', '.go', '.php', '.rb', - '.swift', '.pl', '.lua', '.rs', '.scala', '.kt', '.sh', - '.sql', '.v', '.asm', '.pas', '.d', '.ml', '.clj', '.cljs', - '.erl', '.ex', '.exs', '.f', '.fs', '.r', '.pyi', - '.pdb', '.ipynb', '.rmd', '.cabal', '.hs', '.nim', - '.vhdl', '.verilog', '.svelte', '.html', '.css', '.scss', - '.less', '.json5', '.yaml', '.yml' + # Documentation and text files + '.txt', '.md', '.log', '.rtf', + # Configuration files + '.json', '.json5', '.yaml', '.yml', '.toml', '.ini', '.properties', '.env', + # Web files + '.html', '.css', '.scss', '.less', '.js', '.ts', '.jsx', '.tsx', '.svelte', + # Document files + '.pdf', '.doc', '.docx', '.dot', '.dotx', '.odt', '.ott', '.ottx', '.wp', + '.wps', '.wpd', '.sdw', '.sdx', '.docm', '.dotm', + # Programming languages + '.java', '.c', '.cpp', '.h', '.cs', '.go', '.php', '.rb', '.swift', + '.pl', '.lua', '.rs', '.scala', '.kt', '.sql', '.v', '.asm', '.pas', + '.d', '.ml', '.clj', '.cljs', '.erl', '.ex', '.exs', '.f', '.fs', + '.r', '.pyi', '.hs', '.nim', '.vhdl', '.verilog', + # Notebooks and special files + '.ipynb', '.rmd', '.cabal', + # Build and system files + '.sh', '.bash', '.dockerfile', '.gitignore', '.gitattributes', + '.makefile', '.pyproject', '.requirements' }🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
cognee/api/v1/cognify/code_graph_pipeline.py
(1 hunks)cognee/modules/chunking/TextChunker.py
(2 hunks)cognee/modules/data/processing/document_types/Document.py
(2 hunks)cognee/modules/data/processing/document_types/ImageDocument.py
(2 hunks)cognee/modules/data/processing/document_types/PdfDocument.py
(2 hunks)cognee/modules/data/processing/document_types/TextDocument.py
(2 hunks)cognee/modules/data/processing/document_types/UnstructuredDocument.py
(2 hunks)cognee/tasks/chunks/chunk_by_paragraph.py
(5 hunks)cognee/tasks/documents/extract_chunks_from_documents.py
(1 hunks)cognee/tasks/repo_processor/get_non_code_files.py
(1 hunks)examples/python/code_graph_example.py
(1 hunks)
🧰 Additional context used
🪛 GitHub Actions: ruff format
cognee/tasks/repo_processor/get_non_code_files.py
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/api/v1/cognify/code_graph_pipeline.py
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/modules/data/processing/document_types/ImageDocument.py
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/modules/data/processing/document_types/Document.py
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/modules/data/processing/document_types/PdfDocument.py
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/modules/data/processing/document_types/TextDocument.py
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/modules/chunking/TextChunker.py
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/tasks/chunks/chunk_by_paragraph.py
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/tasks/documents/extract_chunks_from_documents.py
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/modules/data/processing/document_types/UnstructuredDocument.py
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
🪛 GitHub Actions: test | multimedia example
cognee/tasks/documents/extract_chunks_from_documents.py
[error] 14-14: TypeError: AudioDocument.read() method does not accept 'embedding_model' as a parameter
🪛 GitHub Actions: test | multimedia notebook
cognee/tasks/documents/extract_chunks_from_documents.py
[error] 14-14: TypeError: AudioDocument.read() method does not accept 'embedding_model' as a parameter
🔇 Additional comments (6)
cognee/tasks/chunks/chunk_by_paragraph.py (1)
29-35
: 🛠️ Refactor suggestionHandle cases where
embedding_model
isNone
butmax_tokens
is providedIf
max_tokens
is provided without anembedding_model
, the token count will always be zero, rendering themax_tokens
parameter ineffective. Consider raising an error or defaulting to a standard tokenizer ifembedding_model
isNone
butmax_tokens
is specified.Do you want me to implement a default tokenizer or raise an error when
embedding_model
isNone
butmax_tokens
is provided?🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/modules/data/processing/document_types/PdfDocument.py (1)
12-12
: LGTM! Method signature update is consistent with interface changesThe addition of
embedding_model
andmax_tokens
parameters aligns with the PR objectives for improving the code graph pipeline. The implementation correctly passes these parameters to the chunker.Also applies to: 21-21
🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/modules/data/processing/document_types/TextDocument.py (1)
23-23
: Fix formatting issues with operator spacingRemove spaces around
=
operator in keyword arguments according to Ruff standards.✅ Verification successful
Confirmed: Remove spaces around
=
in keyword argumentsThe line shows inconsistent formatting where some keyword arguments have spaces around
=
while others don't. According to Ruff standards, there should be no spaces around=
in keyword arguments.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check specific formatting issues in the file rg -A 1 "chunk_size = " "cognee/modules/data/processing/document_types/TextDocument.py"Length of output: 227
🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
examples/python/code_graph_example.py (2)
11-43
: LGTM! Well-structured argument parsingThe separation of argument parsing into a dedicated function improves code organization and maintainability. The type conversion using lambda functions is a clean approach.
26-30
: Review default values for mock flagsSetting mock flags to
True
by default might not be suitable for production use. Consider setting these toFalse
by default to prevent accidental use of mocks in production.- default=True, + default=False,Also applies to: 32-36
cognee/api/v1/cognify/code_graph_pipeline.py (1)
Line range hint
1-85
: Fix formatting issuesThe file requires formatting according to Ruff standards.
Run the following command to fix formatting:
✅ Verification successful
Fix formatting issues in code_graph_pipeline.py
The review comment is correct. The file has several formatting issues:
- Replace backslash line continuations with parentheses (lines 6, 17, 22)
- Remove trailing whitespace (line 82)
- Fix long lines exceeding line length limit (lines 45, 47, 63, 74, 75)
- Organize imports according to PEP 8
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Fix formatting issues using Ruff ruff format cognee/api/v1/cognify/code_graph_pipeline.pyLength of output: 153
Script:
#!/bin/bash # Check for Python style configuration files fd -e toml -e cfg -e ini -e yapf . -x cat {} # Check import formatting and line lengths rg -n "^.*\\\\.*$|^.{89,}$|^import\s+|^from\s+" "cognee/api/v1/cognify/code_graph_pipeline.py" # Look for inconsistent spacing rg -n "=\s{2,}|\s{2,}=|\s+$" "cognee/api/v1/cognify/code_graph_pipeline.py"Length of output: 14027
🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
@@ -13,5 +14,5 @@ class Document(DataPoint): | |||
"type": "Document" | |||
} | |||
|
|||
def read(self, chunk_size: int, chunker = str) -> str: | |||
def read(self, chunk_size: int, embedding_model: Optional[str], max_tokens: Optional[int], chunker = str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update default values for new parameters in read
method
The new parameters embedding_model
and max_tokens
in the read
method lack default values, which may cause issues in subclasses that do not override these parameters. Set default values to None
to maintain compatibility.
Apply this diff to set default values:
def read(
self,
chunk_size: int,
- embedding_model: Optional[str],
- max_tokens: Optional[int],
+ embedding_model: Optional[str] = None,
+ max_tokens: Optional[int] = None,
chunker=str,
) -> str:
pass
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def read(self, chunk_size: int, embedding_model: Optional[str], max_tokens: Optional[int], chunker = str) -> str: | |
def read( | |
self, | |
chunk_size: int, | |
embedding_model: Optional[str] = None, | |
max_tokens: Optional[int] = None, | |
chunker=str | |
) -> str: |
🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
💡 Codebase verification
Update missing read methods in ImageDocument and AudioDocument
The following classes need to implement the read method with the updated signature read(self, chunk_size: int, embedding_model: Optional[str], max_tokens: Optional[int], chunker = str) -> str
:
cognee/modules/data/processing/document_types/ImageDocument.py
cognee/modules/data/processing/document_types/AudioDocument.py
🔗 Analysis chain
Ensure all subclasses override the updated read
method
The read
method signature has changed. Ensure that all subclasses of Document
, such as AudioDocument
, VideoDocument
, etc., override the read
method with the new parameters to prevent TypeError
.
The pipeline failure indicates that AudioDocument.read()
does not accept embedding_model
as a parameter. Please update all subclasses accordingly.
Run the following script to identify classes that need updating:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Find all subclasses of Document that need to update the `read` method.
# Find files containing classes that inherit from Document but do not have updated `read` methods
fd --type f --extension py '.*' | xargs grep -l 'class .*Document(Document):' | xargs grep -L 'def read(self, chunk_size'
# Alternatively, use ast-grep to find method definitions
ast-grep --pattern 'class $CLASS_NAME(Document): {$$$ def read(self, chunk_size: int, $$$): $$$ }' --ignore 'test_*'
Length of output: 490
Script:
#!/bin/bash
# Find Document subclasses
echo "=== Document subclasses ==="
ast-grep --pattern 'class $NAME(Document)'
echo -e "\n=== Read method implementations ==="
# Find read method implementations
ast-grep --pattern 'def read(self, $$$)'
echo -e "\n=== Backup search with ripgrep ==="
# Backup search with ripgrep
rg "class.*\(Document\)" -A 5
Length of output: 3351
🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now it doesn't work with our Azure LLM settings
parser.add_argument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use environmental variables for the parameters instead of arguments. Also, overwriting environmental variables is not a good practice and can later cause misunderstanding.
We already use envs, and I think they are more appropriate/easier to handle in a production environment as we don't always have to structure the arguments + we don't mix envs and args.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@borisarzentar What do you think?
|
||
if args.mock_embedding: | ||
os.environ["MOCK_EMBEDDING"] = "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This I wouldn't do. It can cause misunderstanding later, we should either use envs or args.
|
||
if args.mock_code_summary: | ||
os.environ["MOCK_CODE_SUMMARY"] = "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
cognee/tasks/chunks/chunk_by_paragraph.py (1)
Fix trailing whitespaces and long lines
The code has formatting issues that need to be addressed:
- Remove trailing whitespaces from function parameters and empty lines
- Break down long lines (>100 characters):
- The
chunk_by_sentence
call- The condition check combining
word_count
andtoken_count
🔗 Analysis chain
Line range hint
1-96
: Fix formatting issues to pass pipeline checks.The code requires formatting according to Ruff standards. Run
ruff format
to fix the formatting issues.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify formatting issues ruff format --diff cognee/tasks/chunks/chunk_by_paragraph.pyLength of output: 157
Script:
#!/bin/bash # Check for tabs and trailing whitespaces echo "=== Checking for tabs ===" grep -P '\t' cognee/tasks/chunks/chunk_by_paragraph.py || echo "No tabs found" echo -e "\n=== Checking for trailing whitespaces ===" grep -P ' $' cognee/tasks/chunks/chunk_by_paragraph.py || echo "No trailing whitespaces found" echo -e "\n=== Checking for lines longer than 100 characters ===" awk 'length($0) > 100' cognee/tasks/chunks/chunk_by_paragraph.py || echo "No long lines found" echo -e "\n=== Checking for inconsistent indentation ===" grep -P '^( *)[^ ]' cognee/tasks/chunks/chunk_by_paragraph.py | grep -Pv '^( {4})*[^ ]' || echo "No inconsistent indentation found"Length of output: 1260
🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require reformatting using Ruff formatter. Run 'ruff format' to fix formatting issues.
♻️ Duplicate comments (1)
cognee/tasks/chunks/chunk_by_paragraph.py (1)
39-39
:⚠️ Potential issueEnsure
max_tokens
is notNone
before comparison.This is a duplicate of a previous review comment. The condition needs to be fixed to prevent potential
TypeError
.- if current_word_count > 0 and (current_word_count + word_count > paragraph_length or current_token_count + token_count > max_tokens): + if current_word_count > 0 and ( + current_word_count + word_count > paragraph_length or + (max_tokens is not None and current_token_count + token_count > max_tokens) + ):🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require reformatting using Ruff formatter. Run 'ruff format' to fix formatting issues.
🧹 Nitpick comments (3)
cognee/modules/data/processing/document_types/AudioDocument.py (1)
22-22
: Improve code readability with proper formattingThe line is quite long and could benefit from better formatting. Consider breaking it into multiple lines for better readability.
- chunker = chunker_func(self, chunk_size = chunk_size, get_text = lambda: [text], embedding_model=embedding_model, max_tokens=max_tokens) + chunker = chunker_func( + self, + chunk_size=chunk_size, + get_text=lambda: [text], + embedding_model=embedding_model, + max_tokens=max_tokens + )🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require reformatting using Ruff formatter. Run 'ruff format' to fix formatting issues.
cognee/tasks/chunks/chunk_by_paragraph.py (2)
31-37
: Consider caching the tokenizer instance.Creating a new tokenizer for each sentence could impact performance. Consider creating it once at the beginning of the function when
embedding_model
is provided.def chunk_by_paragraph(...): current_token_count = 0 if not max_tokens: max_tokens = float("inf") + tokenizer = None + if embedding_model: + if embedding_model.startswith("azure/"): + embedding_model = embedding_model.split("/")[-1] + tokenizer = tiktoken.encoding_for_model(embedding_model) for paragraph_id, sentence, word_count, end_type in chunk_by_sentence(...): - if embedding_model: - if embedding_model.startswith("azure/"): - embedding_model = embedding_model.split("/")[-1] - tokenizer = tiktoken.encoding_for_model(embedding_model) - token_count = len(tokenizer.encode(sentence)) - else: - token_count = 0 + token_count = len(tokenizer.encode(sentence)) if tokenizer else 0🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require reformatting using Ruff formatter. Run 'ruff format' to fix formatting issues.
32-33
: Make Azure model name extraction more robust.The current Azure model name extraction assumes a simple path format. Consider adding validation or using a more robust parsing method.
- if embedding_model.startswith("azure/"): - embedding_model = embedding_model.split("/")[-1] + if embedding_model.startswith("azure/"): + parts = embedding_model.split("/") + if len(parts) < 2: + raise ValueError("Invalid Azure model format. Expected 'azure/model_name'") + embedding_model = parts[-1]🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require reformatting using Ruff formatter. Run 'ruff format' to fix formatting issues.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
cognee/modules/data/processing/document_types/AudioDocument.py
(2 hunks)cognee/tasks/chunks/chunk_by_paragraph.py
(5 hunks)
🧰 Additional context used
🪛 GitHub Actions: ruff format
cognee/modules/data/processing/document_types/AudioDocument.py
[error] 308 files require reformatting using Ruff formatter. Run 'ruff format' to fix formatting issues.
cognee/tasks/chunks/chunk_by_paragraph.py
[error] 308 files require reformatting using Ruff formatter. Run 'ruff format' to fix formatting issues.
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: run_simple_example_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
🔇 Additional comments (6)
cognee/modules/data/processing/document_types/AudioDocument.py (3)
1-1
: LGTM! Clean import additionThe Optional type import is appropriately added to support the new type hints.
🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require reformatting using Ruff formatter. Run 'ruff format' to fix formatting issues.
Line range hint
1-24
: Format code using ruffThe CI pipeline indicates formatting issues. Please run
ruff format
to fix formatting issues across the codebase.🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require reformatting using Ruff formatter. Run 'ruff format' to fix formatting issues.
16-16
: Verify Document interface implementation across all document typesThe read method signature has been updated with new optional parameters. Let's verify this change is consistent across all document types implementing the Document interface.
✅ Verification successful
Interface implementation verified across all document types
All document types (TextDocument, PdfDocument, ImageDocument, AudioDocument) implement the read method with consistent signatures, including the new optional parameters.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for all Document implementations and their read method signatures ast-grep --pattern 'class $CLASSNAME(Document): $$$ def read($$$): $$$'Length of output: 7077
🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require reformatting using Ruff formatter. Run 'ruff format' to fix formatting issues.
cognee/tasks/chunks/chunk_by_paragraph.py (3)
1-15
: LGTM! Well-structured function signature with proper type hints.The updated imports and function signature are well-organized and properly typed. The new parameters
embedding_model
andmax_tokens
are correctly marked as optional.🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require reformatting using Ruff formatter. Run 'ruff format' to fix formatting issues.
25-27
: LGTM! Good initialization of token counting.The initialization of
current_token_count
and the default handling ofmax_tokens
is well-implemented.🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require reformatting using Ruff formatter. Run 'ruff format' to fix formatting issues.
Line range hint
41-96
: LGTM! Consistent token count tracking in chunk dictionaries.The token count field is consistently added to all chunk dictionary creations throughout the function.
🧰 Tools
🪛 GitHub Actions: ruff format
[error] 308 files require reformatting using Ruff formatter. Run 'ruff format' to fix formatting issues.
@@ -71,7 +71,7 @@ async def run_code_graph_pipeline(repo_path, include_docs=True): | |||
Task(ingest_data_with_metadata, dataset_name="repo_docs", user=user), | |||
Task(get_data_list_for_user, dataset_name="repo_docs", user=user), | |||
Task(classify_documents), | |||
Task(extract_chunks_from_documents), | |||
Task(extract_chunks_from_documents, embedding_model=embedding_engine.model, max_tokens=8192), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the embedding_engine shouldn't change in code and should only be changed through environment configuration I think we should remove it as a parameter to functions and instead get it dynamically with the getter when needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
examples/python/code_graph_example.py (2)
23-24
: Refactor duplicate type conversion logic.The boolean conversion lambda function is repeated across multiple arguments. Consider extracting it into a helper function for better maintainability.
+def str_to_bool(value: str) -> bool: + return value.lower() in ("true", "1") + def parse_args(): parser = argparse.ArgumentParser() # ... other arguments ... parser.add_argument( "--include_docs", - type=lambda x: x.lower() in ("true", "1"), + type=str_to_bool, default=True, help="Whether or not to process non-code files" )Also applies to: 29-30, 35-36, 41-42
62-69
: Refactor timing logic into a reusable component.The timing logic could be encapsulated in a decorator or context manager for better reusability and cleaner code.
from contextlib import contextmanager import time @contextmanager def timer(description: str): start_time = time.time() yield end_time = time.time() duration = end_time - start_time separator = "=" * 50 print(f"\n{separator}") print(f"{description}: {duration:.2f} seconds") print(f"{separator}\n") # Usage in main: if args.time: with timer("Pipeline Execution Time"): asyncio.run(main(args.repo_path, args.include_docs)) else: asyncio.run(main(args.repo_path, args.include_docs))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
examples/python/code_graph_example.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: profiler
🔇 Additional comments (4)
examples/python/code_graph_example.py (4)
1-11
: LGTM! Clean imports and well-structured main function.The imports are organized logically, and the main function follows good async/await patterns.
13-45
: Consider using environment variables consistently instead of command-line arguments.Previous reviews have indicated a preference for using environment variables in production environments. Consider refactoring to use environment variables exclusively to avoid mixing configuration methods.
29-31
: Review default values for mock flags.The mock flags default to
True
, which might not be suitable for production use. Consider defaulting toFalse
to prevent accidental mocking in production environments.Also applies to: 35-37
54-60
: Avoid overwriting environment variables at runtime.Setting environment variables at runtime based on command-line arguments is not recommended as it:
- Can cause side effects in other parts of the application
- Makes the configuration flow harder to understand
- May lead to race conditions in concurrent environments
Consider reading these values from environment variables at startup instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
examples/python/code_graph_example.py (2)
40-42
: Move the time import to the top of the file.Conditional imports can make code harder to maintain and understand. It's better to import all required modules at the top of the file.
import argparse import asyncio import logging +import time from cognee.api.v1.cognify.code_graph_pipeline import run_code_graph_pipeline from cognee.shared.utils import setup_logging
40-47
: Consider extracting timing logic into a decorator or utility function.The timing logic could be made more reusable by extracting it into a decorator or utility function. This would make it easier to time other functions in the future and reduce code duplication.
def time_execution(func): async def wrapper(*args, **kwargs): start_time = time.time() result = await func(*args, **kwargs) end_time = time.time() print("\n" + "="*50) print(f"Pipeline Execution Time: {end_time - start_time:.2f} seconds") print("="*50 + "\n") return result return wrapper
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
examples/python/code_graph_example.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: run_simple_example_test / test
- GitHub Check: profiler
🔇 Additional comments (2)
examples/python/code_graph_example.py (2)
3-11
: LGTM! Clean imports and well-structured main function.The addition of logging import and the async main function implementation look good.
13-33
: Consider using environment variables instead of command-line arguments.Based on the team's established practices and previous discussions, environment variables are preferred over command-line arguments in production environments. This makes configuration more consistent and easier to manage.
Consider refactoring to use environment variables:
-def parse_args(): - parser = argparse.ArgumentParser() - parser.add_argument( - "--repo_path", - type=str, - required=True, - help="Path to the repository" - ) - parser.add_argument( - "--include_docs", - type=lambda x: x.lower() in ("true", "1"), - default=True, - help="Whether or not to process non-code files" - ) - parser.add_argument( - "--time", - type=lambda x: x.lower() in ("true", "1"), - default=True, - help="Whether or not to time the pipeline run" - ) - return parser.parse_args() +def get_config(): + return { + 'repo_path': os.environ['REPO_PATH'], + 'include_docs': os.getenv('INCLUDE_DOCS', 'true').lower() in ('true', '1'), + 'time': os.getenv('TIME_PIPELINE', 'true').lower() in ('true', '1') + }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (2)
cognee/tasks/repo_processor/get_source_code_chunks.py (1)
Fix formatting issues in the file
The following formatting issues were found:
- Multiple lines exceed the 88-character limit (e.g., line 81:
_get_subchunk_token_counts(tokenizer, subchunk, max_subchunk_tokens, depth=depth + 1, max_depth=max_depth)
)- Several instances of multiple consecutive blank lines (e.g., lines 13-14, 17-18, 37-38)
Run a formatter like
black
orruff format
to automatically fix these issues.🔗 Analysis chain
Line range hint
1-167
: Fix formatting issues reported by Ruff.The CI pipeline reports that this file requires formatting. Please run
ruff format
to fix the formatting issues.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify formatting issues in the file ruff format --diff cognee/tasks/repo_processor/get_source_code_chunks.pyLength of output: 169
Script:
#!/bin/bash # Check maximum line length and show lines exceeding 88 characters (default black limit) echo "Lines exceeding 88 characters:" awk 'length > 88 { print NR ":" length ":" $0 }' cognee/tasks/repo_processor/get_source_code_chunks.py # Check for inconsistent indentation (not divisible by 4) echo -e "\nLines with indentation not divisible by 4:" awk '/^[[:space:]]/ { if (length($0) - length(ltrim($0)) % 4 != 0) print NR ":" $0 }' cognee/tasks/repo_processor/get_source_code_chunks.py # Show import statements order echo -e "\nImport statements:" grep -n "^import\|^from" cognee/tasks/repo_processor/get_source_code_chunks.py # Check for multiple consecutive blank lines echo -e "\nMultiple consecutive blank lines:" awk '/^$/ { blank++ } !/^$/ { if (blank > 1) printf "Lines %d-%d: %d consecutive blank lines\n", NR-blank, NR-1, blank; blank=0 }' cognee/tasks/repo_processor/get_source_code_chunks.pyLength of output: 2733
🧰 Tools
🪛 GitHub Actions: ruff format
[error] Multiple Python files (308 files) require formatting with Ruff. Run 'ruff format' to fix formatting issues.
cognee/modules/chunking/TextChunker.py (1)
Line range hint
71-78
: Improve error handling in chunk generation.The current error handling simply prints the exception and continues. Consider:
- Using specific exception types instead of generic Exception
- Proper logging instead of print statements
- Deciding whether to continue or raise based on error severity
- except Exception as e: - print(e) + except ValueError as e: + logger.error(f"Failed to generate chunk: {e}") + raise🧰 Tools
🪛 GitHub Actions: ruff format
[error] Multiple Python files (308 files) require formatting with Ruff. Run 'ruff format' to fix formatting issues.
🧹 Nitpick comments (5)
cognee/tasks/repo_processor/get_source_code_chunks.py (3)
125-127
: Add error handling for model name extraction.While the dynamic model name extraction is a good improvement, consider adding error handling for cases where the model name format might be unexpected.
vector_engine = get_vector_engine() embedding_model = vector_engine.embedding_engine.model -model_name = embedding_model.split("/")[-1] +try: + model_name = embedding_model.split("/")[-1] + tokenizer = tiktoken.encoding_for_model(model_name) +except (IndexError, ValueError) as e: + logger.error(f"Failed to extract or validate model name from {embedding_model}: {e}") + raise ValueError(f"Invalid embedding model format: {embedding_model}") -tokenizer = tiktoken.encoding_for_model(model_name)🧰 Tools
🪛 GitHub Actions: ruff format
[error] Multiple Python files (308 files) require formatting with Ruff. Run 'ruff format' to fix formatting issues.
147-147
: Remove or update the outdated TODO comment.The TODO comment about adding support for other embedding models seems outdated since the code now dynamically handles models through the vector engine.
- # TODO: Add support for other embedding models, with max_token mapping
🧰 Tools
🪛 GitHub Actions: ruff format
[error] Multiple Python files (308 files) require formatting with Ruff. Run 'ruff format' to fix formatting issues.
Line range hint
162-166
: Enhance error logging with more context.The error handling is good, but the error messages could be more specific to help with debugging.
- logger.error(f"Error processing code part: {e}") + logger.error(f"Error processing code part {code_part.id}: {str(e)}", exc_info=True) - logger.error(f"Error processing data point: {e}") + logger.error(f"Error processing data point {data_point.id}: {str(e)}", exc_info=True)🧰 Tools
🪛 GitHub Actions: ruff format
[error] Multiple Python files (308 files) require formatting with Ruff. Run 'ruff format' to fix formatting issues.
cognee/modules/chunking/TextChunker.py (2)
1-15
: Consider organizing class attributes using type annotations.The class attributes could be better organized using Python's class variable syntax with type annotations for better code maintainability and type safety.
class TextChunker(): - document = None - max_chunk_size: int - chunk_index = 0 - chunk_size = 0 - token_count = 0 + document: Optional[Any] = None + max_chunk_size: int = 0 + chunk_index: int = 0 + chunk_size: int = 0 + token_count: int = 0🧰 Tools
🪛 GitHub Actions: ruff format
[error] Multiple Python files (308 files) require formatting with Ruff. Run 'ruff format' to fix formatting issues.
23-26
: Add type hints to improve code clarity.Consider adding type hints to the method parameters and return value for better code maintainability and IDE support.
- def check_word_count_and_token_count(self, word_count_before, token_count_before, chunk_data): + def check_word_count_and_token_count( + self, + word_count_before: int, + token_count_before: int, + chunk_data: dict[str, Any] + ) -> bool:🧰 Tools
🪛 GitHub Actions: ruff format
[error] Multiple Python files (308 files) require formatting with Ruff. Run 'ruff format' to fix formatting issues.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
cognee/api/v1/cognify/code_graph_pipeline.py
(2 hunks)cognee/modules/chunking/TextChunker.py
(2 hunks)cognee/modules/data/processing/document_types/AudioDocument.py
(2 hunks)cognee/modules/data/processing/document_types/Document.py
(2 hunks)cognee/modules/data/processing/document_types/ImageDocument.py
(2 hunks)cognee/modules/data/processing/document_types/PdfDocument.py
(2 hunks)cognee/modules/data/processing/document_types/TextDocument.py
(2 hunks)cognee/modules/data/processing/document_types/UnstructuredDocument.py
(2 hunks)cognee/tasks/chunks/chunk_by_paragraph.py
(5 hunks)cognee/tasks/documents/extract_chunks_from_documents.py
(1 hunks)cognee/tasks/repo_processor/get_source_code_chunks.py
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
- cognee/modules/data/processing/document_types/Document.py
- cognee/tasks/documents/extract_chunks_from_documents.py
- cognee/modules/data/processing/document_types/ImageDocument.py
- cognee/modules/data/processing/document_types/UnstructuredDocument.py
- cognee/modules/data/processing/document_types/TextDocument.py
- cognee/api/v1/cognify/code_graph_pipeline.py
- cognee/tasks/chunks/chunk_by_paragraph.py
- cognee/modules/data/processing/document_types/AudioDocument.py
- cognee/modules/data/processing/document_types/PdfDocument.py
🧰 Additional context used
🪛 GitHub Actions: ruff format
cognee/modules/chunking/TextChunker.py
[error] Multiple Python files (308 files) require formatting with Ruff. Run 'ruff format' to fix formatting issues.
cognee/tasks/repo_processor/get_source_code_chunks.py
[error] Multiple Python files (308 files) require formatting with Ruff. Run 'ruff format' to fix formatting issues.
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: profiler
🔇 Additional comments (4)
cognee/tasks/repo_processor/get_source_code_chunks.py (1)
8-8
: LGTM! Good refactoring to use vector engine.The addition of
get_vector_engine
import supports better encapsulation of model handling.🧰 Tools
🪛 GitHub Actions: ruff format
[error] Multiple Python files (308 files) require formatting with Ruff. Run 'ruff format' to fix formatting issues.
cognee/modules/chunking/TextChunker.py (3)
17-21
: LGTM! Well-structured constructor changes.The addition of the optional
max_tokens
parameter with infinity as default is a good approach for backward compatibility.🧰 Tools
🪛 GitHub Actions: ruff format
[error] Multiple Python files (308 files) require formatting with Ruff. Run 'ruff format' to fix formatting issues.
32-33
: LGTM! Token counting implementation looks good.The integration of token counting with the existing word count logic is well implemented.
Also applies to: 37-40
🧰 Tools
🪛 GitHub Actions: ruff format
[error] Multiple Python files (308 files) require formatting with Ruff. Run 'ruff format' to fix formatting issues.
Line range hint
1-97
: Run ruff format to fix formatting issues.The pipeline indicates that this file (among others) requires formatting with Ruff.
Run the following to verify and fix formatting issues:
✅ Verification successful
Run ruff format to fix formatting issues.
The project uses ruff for formatting (configured in pyproject.toml), and the file shows inconsistent formatting patterns that need to be fixed:
- Inconsistent class definition style
- Inconsistent spacing around operators and in function arguments
- Potential line length issues
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Show formatting issues in this file ruff format --diff cognee/modules/chunking/TextChunker.pyLength of output: 154
Script:
#!/bin/bash # Check spacing around operators and function arguments rg -U "(\w+\s*=\s*\w+)" --type py cognee/modules/chunking/TextChunker.py # Check class structure and spacing ast-grep --pattern 'class $_ { $$$ }' cognee/modules/chunking/TextChunker.py # Look for style guide or configuration fd "\.style" --type f fd "\.ruff" --type f fd "setup.cfg" --type fLength of output: 2134
Script:
#!/bin/bash # Look for Python project configuration fd "pyproject.toml" --type f # Check spacing patterns in other Python files rg -U "class.*:[\n\s]*(\w+\s*=\s*\w+)" --type py # Check for common style patterns rg -U "def.*:.*\n\s*" --type py cognee/modules/chunking/Length of output: 3294
Script:
#!/bin/bash # Check project configuration for style settings cat pyproject.tomlLength of output: 3814
🧰 Tools
🪛 GitHub Actions: ruff format
[error] Multiple Python files (308 files) require formatting with Ruff. Run 'ruff format' to fix formatting issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
cognee/tasks/repo_processor/get_source_code_chunks.py (2)
130-132
: Add error handling for model name extraction.While centralizing the model configuration is good, the model name extraction could be more robust. The current implementation assumes a specific format and might fail if the model name doesn't contain a "/".
Consider adding error handling:
vector_engine = get_vector_engine() embedding_model = vector_engine.embedding_engine.model - model_name = embedding_model.split("/")[-1] + try: + model_name = embedding_model.split("/")[-1] + except (AttributeError, IndexError) as e: + logger.error(f"Invalid embedding model format: {embedding_model}") + model_name = "text-embedding-3-large" # fallback to default
157-157
: Update the TODO comment to reflect the new design.The TODO comment about adding support for other embedding models should be updated since the embedding model handling has been centralized through the vector engine.
Consider updating the TODO comment to:
- # TODO: Add support for other embedding models, with max_token mapping + # TODO: Add max_token mapping for different embedding models in vector engine configuration
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
cognee/api/v1/cognify/code_graph_pipeline.py
(2 hunks)cognee/modules/chunking/TextChunker.py
(2 hunks)cognee/modules/data/processing/document_types/AudioDocument.py
(2 hunks)cognee/modules/data/processing/document_types/Document.py
(2 hunks)cognee/modules/data/processing/document_types/ImageDocument.py
(2 hunks)cognee/modules/data/processing/document_types/PdfDocument.py
(2 hunks)cognee/modules/data/processing/document_types/TextDocument.py
(2 hunks)cognee/modules/data/processing/document_types/UnstructuredDocument.py
(2 hunks)cognee/tasks/chunks/chunk_by_paragraph.py
(5 hunks)cognee/tasks/documents/extract_chunks_from_documents.py
(1 hunks)cognee/tasks/repo_processor/get_non_code_files.py
(1 hunks)cognee/tasks/repo_processor/get_source_code_chunks.py
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (11)
- cognee/tasks/documents/extract_chunks_from_documents.py
- cognee/modules/data/processing/document_types/ImageDocument.py
- cognee/modules/data/processing/document_types/AudioDocument.py
- cognee/modules/chunking/TextChunker.py
- cognee/modules/data/processing/document_types/PdfDocument.py
- cognee/api/v1/cognify/code_graph_pipeline.py
- cognee/tasks/chunks/chunk_by_paragraph.py
- cognee/modules/data/processing/document_types/Document.py
- cognee/tasks/repo_processor/get_non_code_files.py
- cognee/modules/data/processing/document_types/TextDocument.py
- cognee/modules/data/processing/document_types/UnstructuredDocument.py
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: profiler
- GitHub Check: docker-compose-test
🔇 Additional comments (3)
cognee/tasks/repo_processor/get_source_code_chunks.py (3)
8-8
: LGTM: Import addition aligns with centralized vector engine approach.The addition of
get_vector_engine
import supports the refactoring to use a centralized vector engine configuration.
172-172
: LGTM: Function call correctly updated.The call to
get_source_code_chunks_from_code_part
has been properly updated to align with the new signature, and appropriate error handling is maintained.
130-132
: Verify all callers are updated for the new signatures.Let's ensure all callers have been updated to work with the new function signatures that no longer accept the embedding model parameter.
Also applies to: 157-157, 172-172
✅ Verification successful
All callers have been properly updated
The changes are consistent across the codebase with no remaining references to the old function signatures. The only model references found are in the embedding infrastructure files where they belong.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for any remaining direct references to embedding models or old function signatures # Check for any remaining hardcoded embedding model references rg -l "text-embedding-[0-9]-large" --type py # Look for potential calls to get_source_code_chunks with embedding_model parameter ast-grep --pattern 'get_source_code_chunks($$$, embedding_model=$_)' # Look for potential calls to get_source_code_chunks_from_code_part with model_name parameter ast-grep --pattern 'get_source_code_chunks_from_code_part($$$, model_name=$_)'Length of output: 388
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (1)
Add test coverage for token-based chunking
The implementation includes token-based chunking features that aren't covered by tests:
- No test cases for the
max_tokens
parameter- No verification of the
token_count
field in chunk output- Missing tests for interaction between
max_tokens
andparagraph_length
limitsSuggested test additions:
- Test chunks respect token limits when
max_tokens
is set- Verify correct
token_count
values in chunk output- Test chunking behavior when both token and paragraph length limits are active
🔗 Analysis chain
Line range hint
1-51
: Verify ground truth values and test coverage.Based on the retrieved learning from PR #205, when changes are made to the chunking implementation, the ground truth values in the tests need to be updated accordingly. Additionally, consider adding test cases for:
- The new
max_tokens
parameter mentioned in the AI summary- Edge cases where token limits affect paragraph chunking
Let's verify the test coverage:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check test coverage for chunk_by_paragraph rg -A 5 "def test.*chunk_by_paragraph" cognee/tests/unit/processing/chunks/Length of output: 1557
Script:
#!/bin/bash # Check chunk_by_paragraph implementation and max_tokens parameter ast-grep --pattern 'def chunk_by_paragraph($$$)' # Search for max_tokens in test files rg "max_tokens" cognee/tests/unit/processing/chunks/Length of output: 103
Script:
#!/bin/bash # Find implementation files fd "chunks.py" --type f # Search for chunk_by_paragraph in all Python files rg -t py "def chunk_by_paragraph" -A 3 # Look for additional test files fd "chunk.*test.py" --type fLength of output: 825
Script:
#!/bin/bash # Check content of the other test file cat cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py # Get full implementation to understand max_tokens usage cat cognee/tasks/chunks/chunk_by_paragraph.pyLength of output: 5970
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cognee/tasks/chunks/chunk_by_paragraph.py
(5 hunks)cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py
(2 hunks)cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cognee/tasks/chunks/chunk_by_paragraph.py
🧰 Additional context used
📓 Learnings (2)
cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (1)
Learnt from: 0xideas
PR: topoteretes/cognee#205
File: cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py:7-7
Timestamp: 2024-11-13T14:55:05.912Z
Learning: When changes are made to the chunking implementation in `cognee/tasks/chunks`, the ground truth values in the corresponding tests in `cognee/tests/unit/processing/chunks` need to be updated accordingly.
cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py (1)
Learnt from: 0xideas
PR: topoteretes/cognee#205
File: cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py:7-7
Timestamp: 2024-11-13T14:55:05.912Z
Learning: When changes are made to the chunking implementation in `cognee/tasks/chunks`, the ground truth values in the corresponding tests in `cognee/tests/unit/processing/chunks` need to be updated accordingly.
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: test
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: test
- GitHub Check: run_notebook_test / test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: docker-compose-test
🔇 Additional comments (2)
cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py (1)
52-54
: LGTM! Improved readability with keyword arguments.The conversion from positional to keyword arguments enhances code clarity and maintainability.
cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (1)
49-51
: LGTM! Consistent with the previous change.The modification to use keyword arguments is consistent with the previous change and improves code readability.
for chunk_data in chunk_by_paragraph( | ||
data=test_text, paragraph_length=12, batch_paragraphs=False | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
The max_tokens
parameter needs test coverage
The implementation shows that max_tokens
is a critical parameter affecting chunk boundaries and output metadata, but it lacks test coverage. Neither test file includes tests for:
- Chunks respecting the token limit
- Token count accuracy in output
- Edge cases with different token limits
- Interaction between
max_tokens
andparagraph_length
🔗 Analysis chain
Verify test coverage for the new max_tokens
parameter.
The chunk_by_paragraph
function has been updated to include an optional max_tokens
parameter, but it's not being tested here.
Let's verify the implementation and test coverage:
Consider adding test cases to verify the behavior with different max_tokens
values. I can help generate these test cases if needed.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check implementation and test coverage of max_tokens parameter
# Check the implementation
echo "Checking chunk_by_paragraph implementation:"
rg -A 5 "def chunk_by_paragraph" cognee/tasks/chunks/
# Check for existing tests of max_tokens
echo "Checking for max_tokens tests:"
rg -l "max_tokens.*chunk_by_paragraph" cognee/tests/
Length of output: 721
Script:
#!/bin/bash
# Get the complete implementation of chunk_by_paragraph
echo "Complete chunk_by_paragraph implementation:"
cat cognee/tasks/chunks/chunk_by_paragraph.py
# Find and show all test files related to chunk_by_paragraph
echo -e "\nAll chunk_by_paragraph test files:"
fd "test.*py" cognee/tests | xargs rg -l "chunk_by_paragraph"
# Show content of the main test file
echo -e "\nContent of the main test file:"
cat cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py
Length of output: 6419
Script:
#!/bin/bash
# Check the content of the second test file
echo "Content of the second test file:"
cat cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py
Length of output: 2277
chunks = list( | ||
chunk_by_paragraph( | ||
data=input_text, paragraph_length=paragraph_length, batch_paragraphs=batch_paragraphs | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Add test coverage for the max_tokens parameter
The implementation in cognee/tasks/chunks/chunk_by_paragraph.py
includes an optional max_tokens
parameter, but there are no tests covering this functionality. Consider adding test cases to verify the behavior when:
max_tokens
is set to a specific valuemax_tokens
is None (default case)max_tokens
is set to different numeric types (int/float)
🔗 Analysis chain
LGTM! Consider adding test coverage for max_tokens parameter.
The change to use keyword arguments improves code readability. However, based on the AI summary, the chunk_by_paragraph
function now supports a max_tokens
parameter that isn't covered by these tests.
Let's verify the implementation of max_tokens
parameter:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if max_tokens parameter exists in the implementation
ast-grep --pattern 'def chunk_by_paragraph($params) {
$$$
}'
Length of output: 64
Script:
#!/bin/bash
# Search for chunk_by_paragraph function definition with correct Python syntax
ast-grep --pattern 'def chunk_by_paragraph($$$):
$$$'
# Fallback to ripgrep for function definition
rg "def chunk_by_paragraph" -A 5
# Check imports
rg "from.*chunk_by_paragraph.*import|import.*chunk_by_paragraph"
Length of output: 1022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
cognee/tests/integration/documents/UnstructuredDocument_test.py (1)
Line range hint
1-102
: Restructure tests for better organization and isolation.The current test structure combines all document type tests into a single function. Consider:
- Using a test class with proper setup/teardown
- Separating tests by document type
- Adding parameterized tests for different configurations
- Implementing proper test isolation
Example restructure:
import pytest from unittest import TestCase class TestUnstructuredDocument(TestCase): def setUp(self): # Setup test files and document instances self.setup_test_files() def setup_test_files(self): # Initialize test file paths and document instances pass def test_pptx_document(self): # PPTX specific tests pass def test_docx_document(self): # DOCX specific tests pass @pytest.mark.parametrize("chunk_size", [512, 1024, 2048]) def test_csv_document_with_different_chunks(self, chunk_size): # CSV tests with different chunk sizes pass def test_xlsx_document(self): # Excel specific tests pass
🧹 Nitpick comments (3)
cognee/tests/integration/documents/UnstructuredDocument_test.py (3)
71-77
: Enhance test coverage for PPTX document processing.The test could be improved in several ways:
- Add test cases for the new
max_tokens
parameter- Verify the total number of chunks processed
- Test edge cases with different chunk sizes
- Consider using test data constants instead of hard-coded values
Example improvement:
def test_pptx_document_read(): # Test with different max_tokens values chunks = list(pptx_document.read(chunk_size=1024, chunker="text_chunker", max_tokens=100)) assert len(chunks) > 0, "Expected at least one chunk" # Test without max_tokens chunks_no_limit = list(pptx_document.read(chunk_size=1024, chunker="text_chunker")) assert len(chunks_no_limit) >= len(chunks), "Expected more chunks without token limit" # Verify first chunk properties first_chunk = chunks[0] assert first_chunk.word_count == 19 assert len(first_chunk.text) == 104 assert first_chunk.cut_type == "sentence_cut"
Line range hint
87-95
: Improve CSV test data and coverage.The current test uses artificial data ("A A A A...") which may not represent real-world CSV scenarios. Consider:
- Using more realistic test data with actual column headers and varied content
- Verifying CSV-specific properties (column separation, handling of quotes, etc.)
- Testing with the new
max_tokens
parameter- Testing with different chunk sizes to verify consistent behavior
Example improvement:
def test_csv_document_read_with_realistic_data(): # Test with realistic CSV content csv_content = "Name,Age,City\nJohn Doe,30,New York\nJane Smith,25,Los Angeles" for chunk in csv_document.read(chunk_size=1024, chunker="text_chunker"): assert "," in chunk.text, "Expected CSV delimiter in chunk" assert chunk.text.count(",") <= 2, "Expected maximum 2 columns per row"
Line range hint
97-102
: Add Excel-specific test scenarios.The current test treats Excel files similarly to other document types. Consider adding tests for Excel-specific features:
- Multiple worksheet handling
- Formula handling
- Different cell types (text, numbers, dates)
- Empty cell handling
Example improvement:
def test_xlsx_document_read_comprehensive(): # Test with max_tokens chunks = list(xlsx_document.read(chunk_size=1024, chunker="text_chunker", max_tokens=100)) # Verify Excel-specific features found_sheets = set() for chunk in chunks: if "Sheet" in chunk.text: # Assuming sheet names are preserved in text found_sheets.add(chunk.text.split("Sheet")[1].split()[0]) assert len(found_sheets) > 0, "Expected content from at least one sheet"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
cognee/modules/data/processing/document_types/AudioDocument.py
(2 hunks)cognee/modules/data/processing/document_types/Document.py
(2 hunks)cognee/modules/data/processing/document_types/ImageDocument.py
(2 hunks)cognee/modules/data/processing/document_types/PdfDocument.py
(2 hunks)cognee/modules/data/processing/document_types/TextDocument.py
(2 hunks)cognee/modules/data/processing/document_types/UnstructuredDocument.py
(2 hunks)cognee/tests/integration/documents/UnstructuredDocument_test.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- cognee/modules/data/processing/document_types/AudioDocument.py
- cognee/modules/data/processing/document_types/TextDocument.py
- cognee/modules/data/processing/document_types/Document.py
- cognee/modules/data/processing/document_types/ImageDocument.py
- cognee/modules/data/processing/document_types/UnstructuredDocument.py
⏰ Context from checks skipped due to timeout of 90000ms (14)
- GitHub Check: run_notebook_test / test
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: test
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: docker-compose-test
🔇 Additional comments (4)
cognee/modules/data/processing/document_types/PdfDocument.py (4)
1-1
: LGTM! Clean import addition.The Optional type import follows Python typing best practices.
12-12
: LGTM! Well-structured method signature update.The read method signature is properly updated with:
- Type-safe Optional[int] parameter
- Sensible None default value
- Backward compatibility maintained
21-23
: LGTM! Clean parameter passing.The chunker_func call correctly propagates the max_tokens parameter while maintaining the existing structure.
Let's verify that other document types have consistent implementations:
✅ Verification successful
Consistent chunker_func implementation verified across document types
All document types (Text, Image, PDF, Audio) implement chunker_func calls consistently with identical parameter handling, including the max_tokens parameter.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for consistency in read method signatures across document types # Expected: All document types should have similar max_tokens parameter handling # Search for read method definitions in document type classes ast-grep --pattern 'class $_Document(Document): $$$ def read(self, chunk_size: int, chunker: str$_) { $$$ } $$$'Length of output: 126
Script:
#!/bin/bash # First, find all Python files that might contain document type implementations echo "=== Document Type Classes ===" ast-grep --pattern 'class $_Document(Document):' echo -e "\n=== Read Methods ===" # Look for read method implementations ast-grep --pattern 'def read(self, $_)' echo -e "\n=== Chunker Usage ===" # Look for chunker_func usage rg "chunker_func\(" -A 2Length of output: 1568
Line range hint
12-27
: Verify error handling for max_tokens.While the implementation looks good, we should ensure proper error handling when max_tokens is exceeded.
Let's check if there's consistent error handling across the codebase:
for paragraph_data in docx_document.read(chunk_size=1024, chunker="text_chunker"): | ||
assert 16 == paragraph_data.word_count, f" 16 != {paragraph_data.word_count = }" | ||
assert 145 == len(paragraph_data.text), f" 145 != {len(paragraph_data.text) = }" | ||
assert ( | ||
"sentence_end" == paragraph_data.cut_type | ||
), f" sentence_end != {paragraph_data.cut_type = }" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Inconsistent cut type values between document types
The codebase shows that DOCX uses "sentence_end" while PPTX uses "sentence_cut", but the chunking implementation in chunk_by_paragraph.py
suggests that "sentence_cut" should be used when cutting at word boundaries. This inconsistency appears to be a bug.
cognee/tasks/chunks/chunk_by_paragraph.py
: Uses "sentence_cut" as the standard cut typecognee/tests/integration/documents/UnstructuredDocument_test.py
: Expects "sentence_end" for DOCXcognee/tests/integration/documents/PowerPointDocument_test.py
: Expects "sentence_cut" for PPTX
🔗 Analysis chain
Document and verify the different cut types.
The test shows that DOCX uses "sentence_end" while PPTX uses "sentence_cut", but this difference isn't documented or verified across different scenarios.
Let's check if this behavior is documented:
Consider:
- Adding docstrings explaining the different cut types
- Testing with different chunk sizes to verify cut type consistency
- Adding test cases for the
max_tokens
parameter
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for documentation about cut types
rg -i "sentence_(end|cut)" --type md
rg -i "cut_type.*=.*['\"]sentence_(end|cut)['\"]" --type py -A 2 -B 2
Length of output: 114
Script:
#!/bin/bash
# Search for all cut_type assignments and chunker implementations
rg "cut_type.*=.*['\"].*['\"]" --type py -A 2 -B 2
rg "class.*Chunker" --type py -A 5
rg "def.*chunk" --type py -A 5
# Search for test files related to chunking
fd ".*chunk.*test.*\.py$"
Length of output: 31805
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
cognee/tasks/repo_processor/expand_dependency_graph.py (1)
9-11
: Standardize import spacingAdd a single blank line between the third-party imports and the standard library imports to follow PEP 8 guidelines.
from cognee.tasks.repo_processor.extract_code_parts import extract_code_parts - import logging
cognee/tasks/repo_processor/get_local_dependencies.py (1)
Line range hint
82-84
: Consider adding context to file operation errorsFor better debugging, consider including the file encoding being used when logging IO errors.
except IOError as e: - logger.error(f"Error opening {script_path}: {e}") + logger.error(f"Error opening {script_path} with default encoding: {e}") return []
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
cognee/tasks/repo_processor/__init__.py
(0 hunks)cognee/tasks/repo_processor/expand_dependency_graph.py
(1 hunks)cognee/tasks/repo_processor/extract_code_parts.py
(1 hunks)cognee/tasks/repo_processor/get_local_dependencies.py
(1 hunks)
💤 Files with no reviewable changes (1)
- cognee/tasks/repo_processor/init.py
🧰 Additional context used
🪛 GitHub Actions: test | profiling
cognee/tasks/repo_processor/extract_code_parts.py
[error] Circular import detected: Cannot import 'logger' from 'cognee.tasks.repo_processor' due to circular dependency
⏰ Context from checks skipped due to timeout of 90000ms (15)
- GitHub Check: run_simple_example_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: run_notebook_test / test
- GitHub Check: test
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: docker-compose-test
🔇 Additional comments (3)
cognee/tasks/repo_processor/extract_code_parts.py (1)
Line range hint
4-11
: Good fix for the circular import issue!The change from importing
logger
fromcognee.tasks.repo_processor
to using Python's standard logging module resolves the circular dependency issue indicated in the pipeline failure. This is a more maintainable approach that follows Python's logging best practices.🧰 Tools
🪛 GitHub Actions: test | profiling
[error] Circular import detected: Cannot import 'logger' from 'cognee.tasks.repo_processor' due to circular dependency
cognee/tasks/repo_processor/expand_dependency_graph.py (1)
11-11
: Well-structured logging with appropriate log levels!Good use of different logging levels:
- DEBUG for non-critical information (no code parts to add)
- WARNING for potential issues (empty code parts)
- ERROR for exceptions
Also applies to: 19-21, 54-56
cognee/tasks/repo_processor/get_local_dependencies.py (1)
13-15
: Excellent error handling with appropriate logging levels!The error handling has been significantly improved with proper logging:
- ERROR level for critical failures (file operations, code analysis)
- WARNING level for non-critical issues (path resolution)
Also applies to: 63-65, 82-84, 121-124, 132-135
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
cognee/modules/cognify/config.py (1)
Line range hint
14-19
: Update to_dict() method to include max_tokens.The
to_dict()
method should be updated to include all configuration attributes, including the newly addedmax_tokens
. This ensures consistent serialization of the configuration.Apply this diff:
def to_dict(self) -> dict: return { "classification_model": self.classification_model, "summarization_model": self.summarization_model, + "max_tokens": self.max_tokens, }
🧹 Nitpick comments (1)
cognee/modules/cognify/config.py (1)
11-11
: Document the MAX_TOKENS environment variable.Since this configuration relies on an environment variable, it would be helpful to add a docstring explaining its purpose, expected value type, and any constraints.
Add documentation above the attribute:
+ # Maximum number of tokens allowed for document processing. + # Set via MAX_TOKENS environment variable. + # Expected value: positive integer or None max_tokens: Optional[int] = os.getenv("MAX_TOKENS")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
cognee/api/v1/cognify/code_graph_pipeline.py
(2 hunks)cognee/modules/cognify/config.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cognee/api/v1/cognify/code_graph_pipeline.py
⏰ Context from checks skipped due to timeout of 90000ms (15)
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: docker-compose-test
🔇 Additional comments (2)
cognee/modules/cognify/config.py (2)
4-4
: LGTM!The import of
Optional
fromtyping
is correctly added to support type hinting for the newmax_tokens
attribute.
11-11
: Verify the usage of max_tokens across the codebase.Let's verify how this configuration is used in document processing classes mentioned in the AI summary.
✅ Verification successful
Implementation of max_tokens is complete and consistent
The max_tokens configuration is properly propagated through the entire document processing pipeline and correctly implemented in the chunking logic to limit token counts in document chunks.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check the usage of max_tokens parameter in document processing classes # Search for document class definitions and their read methods echo "Checking document class implementations:" ast-grep --pattern 'class $DOC_CLASS($_): $$$ def read($$$): $$$' # Search for TextChunker usage of max_tokens echo -e "\nChecking TextChunker implementation:" rg -A 5 'class TextChunker' # Search for places where CognifyConfig.max_tokens is accessed echo -e "\nChecking max_tokens usage:" rg 'max_tokens.*=.*cognee_config\.max_tokens'Length of output: 8151
Script:
#!/bin/bash # Check complete implementation of TextChunker and related classes echo "Checking TextChunker implementation:" cat cognee/modules/chunking/TextChunker.py echo -e "\nChecking extract_chunks_from_documents implementation:" ast-grep --pattern 'def extract_chunks_from_documents($$$): $$$' echo -e "\nChecking additional max_tokens usages:" rg -g '!*.pyc' 'max_tokens' -A 3Length of output: 18909
|
||
|
||
class CognifyConfig(BaseSettings): | ||
classification_model: object = DefaultContentPrediction | ||
summarization_model: object = SummarizedContent | ||
|
||
max_tokens: Optional[int] = os.getenv("MAX_TOKENS") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider type validation for MAX_TOKENS environment variable.
While the type hint correctly indicates Optional[int]
, the direct use of os.getenv()
doesn't perform type conversion. This could lead to runtime type errors if the environment variable exists but contains a non-integer value.
Apply this diff to add type validation:
- max_tokens: Optional[int] = os.getenv("MAX_TOKENS")
+ max_tokens: Optional[int] = int(os.getenv("MAX_TOKENS")) if os.getenv("MAX_TOKENS") is not None else None
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
max_tokens: Optional[int] = os.getenv("MAX_TOKENS") | |
max_tokens: Optional[int] = int(os.getenv("MAX_TOKENS")) if os.getenv("MAX_TOKENS") is not None else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
evals/eval_swe_bench.py (2)
51-58
: Improve collections configuration and documentation.The hardcoded collections list could benefit from:
- Moving collections to a configuration file or constants
- Adding validation to ensure collections exist
- Documenting the purpose and importance of each collection
- Considering the search performance impact of collection order
+# At the top of the file +SEARCH_COLLECTIONS = { + "code_summary_text": "Contains code summary information", + "data_point_name": "Contains data point identifiers", + "document_chunk_text": "Contains document text chunks", + "entity_name": "Contains entity names", + "entity_type_name": "Contains entity type information", + "sourcecodechunk_source_code": "Contains source code chunks", +} # In the function - collections=[ - "code_summary_text", - "data_point_name", - "document_chunk_text", - "entity_name", - "entity_type_name", - "sourcecodechunk_source_code", - ], + collections=list(SEARCH_COLLECTIONS.keys()),
Line range hint
1-1
: Consider overall improvements to the evaluation script.
- Move the
asyncio
import to the top with other imports- Enhance package installation with version checks and retry logic
- Add more configuration options (e.g., timeout, retry attempts, logging level)
Would you like me to provide a detailed implementation for any of these improvements?
|
||
async for result in pipeline: | ||
async for result in run_code_graph_pipeline(repo_path, include_docs=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Enhance error handling and result utilization.
The current implementation only prints the pipeline results without proper error handling or result processing. Consider:
- Adding try-catch block for the async iteration
- Implementing proper logging instead of print statements
- Storing or processing the pipeline results for later use
- async for result in run_code_graph_pipeline(repo_path, include_docs=True):
- print(result)
+ try:
+ results = []
+ async for result in run_code_graph_pipeline(repo_path, include_docs=True):
+ results.append(result)
+ logging.info(f"Pipeline result: {result}")
+ except Exception as e:
+ logging.error(f"Error in code graph pipeline: {str(e)}")
+ raise
Committable suggestion skipped: line range outside the PR's diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
cognee/tasks/repo_processor/get_source_code_chunks.py (1)
157-157
: Update or remove the outdated TODO comment.The TODO comment about adding support for other embedding models (line 159) is now inconsistent with the centralized model management approach.
Either remove the comment or update it to reflect the current architecture:
- # TODO: Add support for other embedding models, with max_token mapping + # TODO: Add max_token mapping for different embedding models in vector engine configuration
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/tasks/repo_processor/get_source_code_chunks.py
(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (17)
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: run_simple_example_test / test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: run_notebook_test / test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: profiler
- GitHub Check: docker-compose-test
🔇 Additional comments (2)
cognee/tasks/repo_processor/get_source_code_chunks.py (2)
8-8
: LGTM: Import aligns with vector engine integration.The addition of the vector engine import supports the centralized model management approach.
172-172
: LGTM: Consistent with centralized model management.The simplified function call aligns with the removal of the model_name parameter.
vector_engine = get_vector_engine() | ||
embedding_model = vector_engine.embedding_engine.model | ||
model_name = embedding_model.split("/")[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for vector engine initialization and model name extraction.
The current implementation makes assumptions about the vector engine and model path format without proper error handling.
Consider this safer implementation:
- vector_engine = get_vector_engine()
- embedding_model = vector_engine.embedding_engine.model
- model_name = embedding_model.split("/")[-1]
+ try:
+ vector_engine = get_vector_engine()
+ if not vector_engine or not vector_engine.embedding_engine:
+ raise ValueError("Vector engine or embedding engine not initialized")
+ embedding_model = vector_engine.embedding_engine.model
+ # Handle cases where model path doesn't contain '/'
+ model_name = embedding_model.split("/")[-1] if "/" in embedding_model else embedding_model
+ except Exception as e:
+ logger.error(f"Failed to initialize vector engine or get model name: {e}")
+ # Fallback to default model
+ model_name = "text-embedding-3-large"
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
vector_engine = get_vector_engine() | |
embedding_model = vector_engine.embedding_engine.model | |
model_name = embedding_model.split("/")[-1] | |
try: | |
vector_engine = get_vector_engine() | |
if not vector_engine or not vector_engine.embedding_engine: | |
raise ValueError("Vector engine or embedding engine not initialized") | |
embedding_model = vector_engine.embedding_engine.model | |
# Handle cases where model path doesn't contain '/' | |
model_name = embedding_model.split("/")[-1] if "/" in embedding_model else embedding_model | |
except Exception as e: | |
logger.error(f"Failed to initialize vector engine or get model name: {e}") | |
# Fallback to default model | |
model_name = "text-embedding-3-large" |
Improvements and fixes in the code graph pipeline:
Summary by CodeRabbit
Release Notes
New Features
max_tokens
parameter across multiple document processing methods to control token limits during chunkingImprovements
Configuration
max_tokens
configuration option to support token-based document processing limits