fix(backend): few improvements for scaling branches #7752

fatih-acar · 2025-11-28T10:16:18Z

Summary by CodeRabbit

Refactor
- Prevented duplicate branch-hash processing to speed up branch handling and repository sync.
- Improved repository data handling (supports additional repository types) and added staging-branch detection.
- Repository sync flow now uses database-backed queries for more efficient synchronization.
Tests
- Updated assertions to accommodate the new repository field handling.
Changelog
- Noted performance improvements for branch creation and repository synchronization.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-28T10:16:35Z

Walkthrough

This pull request updates schema management, git models, tasks, utils, and tests. purge_inactive_branches now tracks processed branch hashes to avoid re-processing. RepositoryData gains model_config, a repository field (CoreRepository|CoreReadOnlyRepository|Node), and get_staging_branch(). sync_remote_repositories was refactored to use a database session with get_repositories_commit_per_branch instead of a client list. get_repositories_commit_per_branch adds a kind parameter, stores full repository objects on RepositoryData, and returns a mapping by repository. Tests now assert the repository identity separately and compare dumped data excluding repository. A changelog entry was added.

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(backend): few improvements for scaling branches' accurately describes the main changes in the PR, which focus on performance optimizations for handling large numbers of branches through improved tracking mechanisms in the schema manager and repository utilities.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codspeed-hq · 2025-11-28T10:22:22Z

CodSpeed Performance Report

Merging #7752 will not alter performance

_{Comparing fac-branch-scale-IFC-2059 (a9eea39) with stable (c964c21)}

Summary

✅ 12 untouched

This would not allow branch creation to scale since we purge inactive branches on each create, thus processing each branch. 10x create branch speedup at 100 branches (1.5s vs 15s). Signed-off-by: Fatih Acar <fatih@opsmill.com>

The SDK get_list_repositories doesn't scale. A workaround is to use the get_repositories_commit_per_branch helper that is similar but using DB queries to get the data. Signed-off-by: Fatih Acar <fatih@opsmill.com>

In the recurring Sync Git Repositories task, only sync branches that are flagged with the sync_with_git flag. Signed-off-by: Fatih Acar <fatih@opsmill.com>

This reverts commit 23ab221. Not sure of the impact of this change (could break features related to repositories). Also, this change is not required for real world usage (syncing a lot of branches between two sync jobs).

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

backend/infrahub/git/tasks.py (1)
207-207: Type annotation assumes CoreRepository but union type is broader.

The type annotation repository: CoreRepository = repository_data.repository assumes all results are CoreRepository, but repository_data.repository is typed as CoreRepository | CoreReadOnlyRepository | Node.

Since kind=InfrahubKind.REPOSITORY is passed to get_repositories_commit_per_branch, this assumption is likely safe in practice. However, for type safety, consider using cast() or adding a runtime assertion:
from typing import cast
repository = cast(CoreRepository, repository_data.repository)
Or validate explicitly:
repository = repository_data.repository
assert hasattr(repository, 'default_branch'), f"Expected CoreRepository, got {type(repository)}"

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 43241fc and 0f6e2db.

📒 Files selected for processing (5)

backend/infrahub/core/schema/manager.py (1 hunks)
backend/infrahub/git/models.py (2 hunks)
backend/infrahub/git/tasks.py (5 hunks)
backend/infrahub/git/utils.py (3 hunks)
backend/tests/unit/git/test_utils.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (.github/instructions/python-docstring.instructions.md)

**/*.py: Always use triple quotes (""") for Python docstrings
Follow Google-style docstring format for Python docstrings
Include brief one-line description in Python docstrings when applicable
Include detailed description in Python docstrings when applicable
Include Args/Parameters section without typing in Python docstrings when applicable
Include Returns section in Python docstrings when applicable
Include Raises section in Python docstrings when applicable
Include Examples section in Python docstrings when applicable

**/*.py: Use type hints for all function parameters and return values in Python
Use Async whenever possible in Python
Use async def for asynchronous functions in Python
Use await for asynchronous calls in Python
Use Pydantic models for dataclasses in Python
Use ruff and mypy for type checking and code validation in Python

Use ruff and mypy to validate and lint Python files

Files:

backend/infrahub/git/models.py
backend/infrahub/git/tasks.py
backend/infrahub/git/utils.py
backend/tests/unit/git/test_utils.py
backend/infrahub/core/schema/manager.py

backend/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use type hints for Python code in backend

Files:

backend/infrahub/git/models.py
backend/infrahub/git/tasks.py
backend/infrahub/git/utils.py
backend/tests/unit/git/test_utils.py
backend/infrahub/core/schema/manager.py

backend/infrahub/**/*.py

📄 CodeRabbit inference engine (backend/AGENTS.md)

backend/infrahub/**/*.py: Use async/await for all I/O operations to maintain async-first architecture
Type hint all function parameters and returns in Python code
Use Pydantic models for defining data structures instead of plain dictionaries
Use Query class pattern (extending infrahub.core.query.Query) for all database operations instead of unparameterized Cypher queries
Use Google-style docstrings with Args, Returns, and Raises sections for all functions
Use snake_case for function and variable names
Use PascalCase for class names
Use UPPER_SNAKE_CASE for constant definitions
Do not use unparameterized Cypher queries; always use parameterized queries to prevent injection
Do not block the event loop with synchronous I/O operations

Files:

backend/infrahub/git/models.py
backend/infrahub/git/tasks.py
backend/infrahub/git/utils.py
backend/infrahub/core/schema/manager.py

backend/tests/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Run backend tests with pytest or via invoke tasks

Name test files as test_<module>.py and mirror source structure in tests directory

Files:

backend/tests/unit/git/test_utils.py

🧠 Learnings (1)

📚 Learning: 2025-12-01T22:16:13.668Z

Learnt from: CR
Repo: opsmill/infrahub PR: 0
File: backend/AGENTS.md:0-0
Timestamp: 2025-12-01T22:16:13.668Z
Learning: Applies to backend/infrahub/**/*.py : Use Pydantic models for defining data structures instead of plain dictionaries

Applied to files:

backend/infrahub/git/models.py

🧬 Code graph analysis (3)

backend/infrahub/git/tasks.py (2)

backend/infrahub/git/utils.py (1)

get_repositories_commit_per_branch (27-64)

backend/infrahub/workers/dependencies.py (2)

get_database (70-71)

get_client (50-51)

backend/infrahub/git/utils.py (2)

backend/infrahub/core/manager.py (1)

NodeManager (78-1416)

backend/infrahub/git/models.py (2)

RepositoryBranchInfo (201-202)

RepositoryData (205-224)

backend/infrahub/core/schema/manager.py (2)

backend/infrahub/core/schema/schema_branch.py (3)

get (324-357)

get_all (449-456)

duplicate (292-301)

backend/infrahub/core/models.py (2)

nodes (62-64)

duplicate (501-503)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Cloudflare Pages

🔇 Additional comments (10)

backend/infrahub/core/schema/manager.py (1)

777-785: Optimization to avoid redundant schema hash collection looks correct.

The logic correctly skips branches that share an already-processed schema hash, reducing redundant work when multiple branches point to the same schema. The conditional on line 780 properly handles both cases: when a branch has no hash yet (new/untracked) and when the hash hasn't been seen before.

One edge case to verify: if active_branch is in _branch_hash_by_name but not in _branches, line 781-782 would add the hash to branch_processed, but line 783's get() would return None, skipping hash collection. This seems intentional (stale entry cleanup), but worth confirming this scenario is expected.

backend/tests/unit/git/test_utils.py (2)

34-47: Test assertions correctly adapted for the new repository field.

The two-step verification pattern (identity check on repository.id, then model_dump(exclude=["repository"]) for remaining fields) is the right approach for testing a Pydantic model containing a non-serializable object reference.

66-99: LGTM!

The multi-branch test correctly verifies repository identity and field values across different branches with modified commits.

backend/infrahub/git/models.py (2)

206-212: Model configuration and repository field look good.

The arbitrary_types_allowed=True is necessary since Node is not a Pydantic-native type. The union type properly accommodates both repository protocol types and the generic Node fallback.

220-224: Consider edge case: multiple branches with staging status.

The get_staging_branch() method returns the first branch with internal_status == "staging". If multiple staging branches are possible (even temporarily), this would return an arbitrary one based on dict iteration order.

If only one staging branch is ever valid, this is fine. Otherwise, consider returning a list or documenting the single-staging-branch assumption.

backend/infrahub/git/utils.py (2)

41-57: Repository object stored from first-encountered branch.

The repository object stored in RepositoryData comes from whichever branch is iterated first (line 55 only executes when repo_name not in repositories). Subsequent branch iterations update branches and branch_info but not the repository reference.

This means repository.commit.value, repository.internal_status.value, etc. reflect one specific branch's state, not necessarily the default branch or any specific branch. If downstream code expects the repository object to represent the default branch's state, this could cause subtle issues.

Consider explicitly tracking which branch the stored repository object came from, or ensuring the default branch is processed first.

27-30: New kind parameter adds useful flexibility.

The parameterized kind allows callers to filter by repository schema type. The default InfrahubKind.GENERICREPOSITORY maintains backward compatibility while enabling the new InfrahubKind.REPOSITORY usage in tasks.py.

backend/infrahub/git/tasks.py (3)

198-204: Good refactor: DB-backed repository fetching reduces API overhead.

Moving from client.get_list_repositories to the database-backed get_repositories_commit_per_branch eliminates unnecessary network round-trips for repository discovery. The session scope is properly managed with async with db.start_session().

221-228: LGTM: Repository field access is consistent.

The access patterns (repository.id, repository.name.value, repository.location.value, repository.default_branch.value) are consistent with CoreRepository protocol and properly extract values from the repository node.

263-271: The get_kind() method is correctly available on the repository object.

Line 268 calls repository.get_kind() where the repository is typed as CoreRepository (line 211). CoreRepository inherits from CoreNode through multiple parent classes (LineageOwner, LineageSource, CoreGenericRepository, CoreTaskTarget), and CoreNode corresponds to InfrahubNode, which implements the get_kind() method (backend/infrahub/core/node/init.py:112-114). The method is safe to call and will return self._schema.kind.

Signed-off-by: Fatih Acar <fatih@opsmill.com>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

changelog/+d9659fb5.fixed.md (1)
1-1: Consider more concise phrasing for the changelog entry.

The phrase "having a lot of" can be tightened for clarity. Consider rewording to something like "with many branches" or "at scale."

Apply this diff to improve conciseness:
-Improve branch creation and repository sync performance when having a lot of branches.
+Improve branch creation and repository sync performance when scaling with many branches.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0f6e2db and a9eea39.

📒 Files selected for processing (1)

changelog/+d9659fb5.fixed.md (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.md

📄 CodeRabbit inference engine (AGENTS.md)

**/*.md: Use - for unordered lists in markdown files
Add blank line before/after headings, code blocks, and lists in markdown files
Use fenced code blocks with language identifier in markdown files
No trailing spaces or multiple consecutive blank lines in markdown files
No bare URLs in markdown files - use [text](url) format

Files:

changelog/+d9659fb5.fixed.md

🪛 LanguageTool

changelog/+d9659fb5.fixed.md

[style] ~1-~1: Consider using a synonym to be more concise.
Context: ...repository sync performance when having a lot of branches.

(A_LOT_OF)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: E2E-testing-version-upgrade / From 1.3.0
GitHub Check: E2E-testing-playwright
GitHub Check: backend-benchmark
GitHub Check: E2E-testing-invoke-demo-start
GitHub Check: documentation
GitHub Check: backend-docker-integration
GitHub Check: backend-tests-integration
GitHub Check: backend-tests-functional
GitHub Check: backend-tests-unit
GitHub Check: Cloudflare Pages

github-actions bot added the group/backend Issue related to the backend (API Server, Git Agent) label Nov 28, 2025

fatih-acar force-pushed the fac-branch-scale-IFC-2059 branch 2 times, most recently from 4769e44 to 3c19703 Compare November 28, 2025 14:35

dgarros approved these changes Nov 28, 2025

View reviewed changes

fatih-acar changed the title ~~fix(backend): few improvements for scaling~~ fix(backend): few improvements for scaling branches Nov 28, 2025

fatih-acar force-pushed the fac-branch-scale-IFC-2059 branch from 3c19703 to b87cf09 Compare December 1, 2025 13:28

fatih-acar added 4 commits December 2, 2025 10:30

fix(backend): use direct db access when syncing repositories

07b323d

The SDK get_list_repositories doesn't scale. A workaround is to use the get_repositories_commit_per_branch helper that is similar but using DB queries to get the data. Signed-off-by: Fatih Acar <fatih@opsmill.com>

fix(backend): only sync required branches

30c9187

In the recurring Sync Git Repositories task, only sync branches that are flagged with the sync_with_git flag. Signed-off-by: Fatih Acar <fatih@opsmill.com>

Revert "fix(backend): only sync required branches"

0f6e2db

This reverts commit 23ab221. Not sure of the impact of this change (could break features related to repositories). Also, this change is not required for real world usage (syncing a lot of branches between two sync jobs).

fatih-acar force-pushed the fac-branch-scale-IFC-2059 branch from b87cf09 to 0f6e2db Compare December 2, 2025 09:44

fatih-acar marked this pull request as ready for review December 2, 2025 12:01

fatih-acar requested a review from a team as a code owner December 2, 2025 12:01

coderabbitai bot reviewed Dec 2, 2025

View reviewed changes

chore: add newsfragment

a9eea39

Signed-off-by: Fatih Acar <fatih@opsmill.com>

coderabbitai bot reviewed Dec 2, 2025

View reviewed changes

ogenstad approved these changes Dec 2, 2025

View reviewed changes

fatih-acar merged commit cd2ed7c into stable Dec 2, 2025
41 checks passed

fatih-acar deleted the fac-branch-scale-IFC-2059 branch December 2, 2025 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(backend): few improvements for scaling branches #7752

fix(backend): few improvements for scaling branches #7752

fatih-acar commented Nov 28, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 28, 2025 •

edited

Loading

Uh oh!

codspeed-hq bot commented Nov 28, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix(backend): few improvements for scaling branches #7752

fix(backend): few improvements for scaling branches #7752

Conversation

fatih-acar commented Nov 28, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Pre-merge checks

Uh oh!

codspeed-hq bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Performance Report

Merging #7752 will not alter performance

Summary

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fatih-acar commented Nov 28, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 28, 2025 •

edited

Loading

codspeed-hq bot commented Nov 28, 2025 •

edited

Loading