-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: [cog-1089] Define pydantic models for descriptive graph metrics and input metrics #466
Conversation
WalkthroughThe changes introduce a new class, Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant MetricData
participant Database
User->>MetricData: Create GraphMetricData
MetricData->>Database: Insert graph_metric_table
Database-->>MetricData: Confirm insertion
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (1)
cognee/modules/data/models/MetricData.py (1)
22-23
: Simplify timestamp defaults.Instead of using lambda functions, you can directly pass the function reference.
- created_at = Column(DateTime(timezone=True), default=lambda: datetime.now(timezone.utc)) - updated_at = Column(DateTime(timezone=True), onupdate=lambda: datetime.now(timezone.utc)) + created_at = Column(DateTime(timezone=True), default=datetime.now) + updated_at = Column(DateTime(timezone=True), onupdate=datetime.now)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/modules/data/models/MetricData.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: docker-compose-test
🔇 Additional comments (2)
cognee/modules/data/models/MetricData.py (2)
8-42
: Consider using Pydantic with SQLAlchemy.The PR objective mentions using Pydantic models, but this is a pure SQLAlchemy model. Consider using SQLModel or creating separate Pydantic models for API validation and documentation.
Let's check if other models in the project use Pydantic:
✅ Verification successful
Confirmed: Pydantic models should be added for API layer
The codebase extensively uses Pydantic for data validation across API endpoints and DTOs. Consider creating a separate Pydantic model for
GraphMetricData
to maintain consistency with the project's architecture, where SQLAlchemy handles database operations and Pydantic handles API validation.Example locations showing this pattern:
cognee/api/DTO.py
: Base DTOs using Pydanticcognee/shared/data_models.py
: Domain models using Pydanticcognee/api/v1/cognify/routers/
: API endpoints using Pydantic models🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for Pydantic usage in models rg -l "from pydantic" "cognee/modules/data/models/"Length of output: 51
Script:
#!/bin/bash # Check for Pydantic/SQLModel in dependency files echo "=== Checking dependencies ===" fd -e txt -e toml "requirements|pyproject" -x cat {} echo -e "\n=== Checking for Pydantic imports project-wide ===" rg "from pydantic|import pydantic|from sqlmodel" echo -e "\n=== Checking for BaseModel usage (Pydantic's base class) ===" rg "class.*\(.*BaseModel.*\)"Length of output: 15132
47-47
: Verify if additional input metrics are needed.The model only tracks token count. Consider if other input metrics should be captured for comprehensive monitoring.
Let's check if there are related issues or discussions:
✅ Verification successful
Current input metrics are sufficient for their intended purpose
The InputMetricData class appropriately focuses on token count, which is crucial for resource tracking. Other input characteristics are comprehensively covered by the extensive evaluation framework that includes correctness, comprehensiveness, diversity, and various other quality metrics.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for related issues about input metrics gh issue list --search "input metrics in:title,body" --limit 5Length of output: 64
Script:
#!/bin/bash # Find MetricData usage and other metric patterns echo "=== MetricData Usage ===" rg "MetricData" -l echo -e "\n=== Metric Collection Points ===" rg "metrics?\." -l echo -e "\n=== Other Metric Models ===" fd -e py -e js -e ts ".*[Mm]etric.*" echo -e "\n=== MetricData Implementation ===" rg -A 10 "class MetricData"Length of output: 622
Script:
#!/bin/bash # Examine metric implementations echo "=== MetricData Class ===" cat cognee/modules/data/models/MetricData.py echo -e "\n=== Evaluation Metrics ===" echo "--- deepeval_metrics.py ---" cat evals/deepeval_metrics.py echo -e "\n--- qa_metrics_utils.py ---" cat evals/qa_metrics_utils.pyLength of output: 7902
@@ -0,0 +1,57 @@ | |||
from datetime import datetime, timezone | |||
|
|||
from sqlalchemy import Column, DateTime, Float, Integer, ARRAY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add primary key column import.
The models are missing primary key columns. Import and add an appropriate primary key (e.g., id
) for both tables.
-from sqlalchemy import Column, DateTime, Float, Integer, ARRAY
+from sqlalchemy import Column, DateTime, Float, Integer, ARRAY, Identity
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
from sqlalchemy import Column, DateTime, Float, Integer, ARRAY | |
from sqlalchemy import Column, DateTime, Float, Integer, ARRAY, Identity |
def to_json(self) -> dict: | ||
return { | ||
"num_tokens": self.num_tokens, | ||
"createdAt": self.created_at.isoformat(), | ||
"updatedAt": self.updated_at.isoformat() if self.updated_at else None, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Maintain consistent field naming in JSON serialization.
Similar to GraphMetricData, use consistent field naming convention.
"num_tokens": self.num_tokens,
- "createdAt": self.created_at.isoformat(),
- "updatedAt": self.updated_at.isoformat() if self.updated_at else None,
+ "created_at": self.created_at.isoformat(),
+ "updated_at": self.updated_at.isoformat() if self.updated_at else None,
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def to_json(self) -> dict: | |
return { | |
"num_tokens": self.num_tokens, | |
"createdAt": self.created_at.isoformat(), | |
"updatedAt": self.updated_at.isoformat() if self.updated_at else None, | |
} | |
def to_json(self) -> dict: | |
return { | |
"num_tokens": self.num_tokens, | |
"created_at": self.created_at.isoformat(), | |
"updated_at": self.updated_at.isoformat() if self.updated_at else None, | |
} |
def to_json(self) -> dict: | ||
return { | ||
"num_nodes": self.num_nodes, | ||
"num_edges": self.num_edges, | ||
"mean_degree": self.mean_degree, | ||
"edge_density": self.edge_density, | ||
"num_connected_components": self.num_connected_components, | ||
"sizes_of_connected_components": self.sizes_of_connected_components, | ||
"num_selfloops": self.num_selfloops if self.num_selfloops else None, | ||
"diameter": self.diameter if self.diameter else None, | ||
"avg_shortest_path_length": self.avg_shortest_path_length | ||
if self.avg_shortest_path_length | ||
else None, | ||
"avg_clustering": self.avg_clustering if self.avg_clustering else None, | ||
"createdAt": self.created_at.isoformat(), | ||
"updatedAt": self.updated_at.isoformat() if self.updated_at else None, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Maintain consistent field naming convention in JSON serialization.
The JSON field names mix camelCase (createdAt
, updatedAt
) with snake_case (all other fields). Choose one convention and stick to it.
"avg_clustering": self.avg_clustering if self.avg_clustering else None,
- "createdAt": self.created_at.isoformat(),
- "updatedAt": self.updated_at.isoformat() if self.updated_at else None,
+ "created_at": self.created_at.isoformat(),
+ "updated_at": self.updated_at.isoformat() if self.updated_at else None,
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def to_json(self) -> dict: | |
return { | |
"num_nodes": self.num_nodes, | |
"num_edges": self.num_edges, | |
"mean_degree": self.mean_degree, | |
"edge_density": self.edge_density, | |
"num_connected_components": self.num_connected_components, | |
"sizes_of_connected_components": self.sizes_of_connected_components, | |
"num_selfloops": self.num_selfloops if self.num_selfloops else None, | |
"diameter": self.diameter if self.diameter else None, | |
"avg_shortest_path_length": self.avg_shortest_path_length | |
if self.avg_shortest_path_length | |
else None, | |
"avg_clustering": self.avg_clustering if self.avg_clustering else None, | |
"createdAt": self.created_at.isoformat(), | |
"updatedAt": self.updated_at.isoformat() if self.updated_at else None, | |
} | |
def to_json(self) -> dict: | |
return { | |
"num_nodes": self.num_nodes, | |
"num_edges": self.num_edges, | |
"mean_degree": self.mean_degree, | |
"edge_density": self.edge_density, | |
"num_connected_components": self.num_connected_components, | |
"sizes_of_connected_components": self.sizes_of_connected_components, | |
"num_selfloops": self.num_selfloops if self.num_selfloops else None, | |
"diameter": self.diameter if self.diameter else None, | |
"avg_shortest_path_length": self.avg_shortest_path_length | |
if self.avg_shortest_path_length | |
else None, | |
"avg_clustering": self.avg_clustering if self.avg_clustering else None, | |
"created_at": self.created_at.isoformat(), | |
"updated_at": self.updated_at.isoformat() if self.updated_at else None, | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion from coderabbit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@borisarzentar Thank you for confirming! The consistent naming will indeed improve the code's maintainability.
(\ /)
(^.^)
c(")(")
def to_json(self) -> dict: | ||
return { | ||
"num_nodes": self.num_nodes, | ||
"num_edges": self.num_edges, | ||
"mean_degree": self.mean_degree, | ||
"edge_density": self.edge_density, | ||
"num_connected_components": self.num_connected_components, | ||
"sizes_of_connected_components": self.sizes_of_connected_components, | ||
"num_selfloops": self.num_selfloops if self.num_selfloops else None, | ||
"diameter": self.diameter if self.diameter else None, | ||
"avg_shortest_path_length": self.avg_shortest_path_length | ||
if self.avg_shortest_path_length | ||
else None, | ||
"avg_clustering": self.avg_clustering if self.avg_clustering else None, | ||
"createdAt": self.created_at.isoformat(), | ||
"updatedAt": self.updated_at.isoformat() if self.updated_at else None, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion from coderabbit.
1057a14
to
bbc39a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (1)
cognee/modules/data/models/MetricData.py (1)
3-3
:⚠️ Potential issueAdd missing SQLAlchemy imports for table constraints.
The models require additional SQLAlchemy components for proper table constraints.
-from sqlalchemy import Column, DateTime, Float, Integer, ARRAY +from sqlalchemy import Column, DateTime, Float, Integer, ARRAY, Identity, String, UniqueConstraint
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/modules/data/models/MetricData.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: windows-latest
- GitHub Check: docker-compose-test
@alekszievr Please look at comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
cognee/modules/data/models/MetricData.py (1)
14-20
: 🛠️ Refactor suggestionAdd NOT NULL constraints to essential metrics.
These metrics appear to be fundamental graph properties that should always be present.
- num_tokens = Column(Integer) - num_nodes = Column(Integer) - num_edges = Column(Integer) - mean_degree = Column(Float) - edge_density = Column(Float) - num_connected_components = Column(Integer) - sizes_of_connected_components = Column(ARRAY(Integer)) + num_tokens = Column(Integer, nullable=False) + num_nodes = Column(Integer, nullable=False) + num_edges = Column(Integer, nullable=False) + mean_degree = Column(Float, nullable=False) + edge_density = Column(Float, nullable=False) + num_connected_components = Column(Integer, nullable=False) + sizes_of_connected_components = Column(ARRAY(Integer), nullable=False)
🧹 Nitpick comments (2)
cognee/modules/data/models/MetricData.py (2)
9-10
: Add class-level documentation.Based on the previous discussion about metric types, add docstring to explain the purpose and significance of graph metrics (e.g., nodes, edges, connected components).
class GraphMetricData(Base): + """ + Stores descriptive metrics about the graph structure. + + This table captures high-level properties of the graph, including: + - Basic metrics (nodes, edges) + - Connectivity metrics (components, density) + - Path metrics (diameter, shortest paths) + - Clustering metrics + """ __tablename__ = "graph_metric_table"
21-24
: Add validation constraints to optional metrics.Consider adding check constraints to ensure valid values when present:
diameter
should be positiveavg_shortest_path_length
should be positiveavg_clustering
should be between 0 and 1- num_selfloops = Column(Integer, nullable=True) - diameter = Column(Integer, nullable=True) - avg_shortest_path_length = Column(Float, nullable=True) - avg_clustering = Column(Float, nullable=True) + num_selfloops = Column(Integer, nullable=True, CheckConstraint('num_selfloops >= 0')) + diameter = Column(Integer, nullable=True, CheckConstraint('diameter > 0')) + avg_shortest_path_length = Column(Float, nullable=True, CheckConstraint('avg_shortest_path_length > 0')) + avg_clustering = Column(Float, nullable=True, CheckConstraint('avg_clustering >= 0 AND avg_clustering <= 1'))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/modules/data/models/MetricData.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (21)
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_simple_example_test / test
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: run_simple_example_test / test
- GitHub Check: test
- GitHub Check: run_notebook_test / test
- GitHub Check: test
- GitHub Check: windows-latest
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: profiler
- GitHub Check: Build Cognee Backend Docker App Image
- GitHub Check: docker-compose-test
🔇 Additional comments (3)
cognee/modules/data/models/MetricData.py (3)
1-7
: LGTM! All imports are appropriate and used.The imports cover all necessary components for the SQLAlchemy model implementation.
12-13
: Clarify or implement the TODO comment.The TODO suggests changing the ID to reflect the graph database's unique ID. This needs either:
- Implementation if requirements are clear
- Clarification in the comment about specific requirements
The UUID choice is good for distributed systems, but consider documenting the relationship with the graph database ID in the class docstring.
26-27
: LGTM! Timestamp implementation follows best practices.The timestamp fields are properly implemented with:
- Timezone awareness
- Automatic creation timestamp
- Automatic update timestamp
Description
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin
Summary by CodeRabbit
GraphMetricData
, enhancing data management capabilities.GraphMetricData
includes metrics such as the number of nodes, edges, and clustering metrics.