Skip to content

Conversation

@dheerajturaga
Copy link
Member

When using git DAG bundles, corrupted bare repositories can cause all tasks
landing on a host to fail with InvalidGitRepositoryError. This adds retry
logic that detects corrupted bare repositories, cleans them up, and attempts
to re-clone them once before failing.

Changes:

  • Add InvalidGitRepositoryError handling in _clone_bare_repo_if_required()
  • Implement cleanup and retry logic with shutil.rmtree()
  • Add comprehensive tests for both successful retry and retry failure scenarios
  • Ensure all existing tests continue to pass

Copy link
Contributor

@prdai prdai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required, but just a thought, we could use tenacity, here instead of a custom retry loop. It might make the retry/backoff logic easier to read/maintain. For example:

@retry(stop=stop_after_attempt(3), wait=wait_exponential(), reraise=True)
def clone_bare_repo(url, path, env=None):
    return Repo.clone_from(url, path, bare=True, env=env)

@dheerajturaga
Copy link
Member Author

dheerajturaga commented Oct 11, 2025

cc: @jedcunningham @ephraimbuddy @kaxil @jscheffl

When using gitbundles with edge workers (EdgeExectutor) it could happen that the git connection could be unstable causing git clone/ git bare clone to fail. If the bare clone is broken, all subsequent tasks on the worker fail unless you manually ssh onto the machine to tidy up the bare repo. This is an attempt to self heal

Im hoping to get this in the next wave of provider release. cc: @eladkal

image

@dheerajturaga
Copy link
Member Author

Not required, but just a thought, we could use tenacity, here instead of a custom retry loop. It might make the retry/backoff logic easier to read/maintain. For example:

@retry(stop=stop_after_attempt(3), wait=wait_exponential(), reraise=True)
def clone_bare_repo(url, path, env=None):
    return Repo.clone_from(url, path, bare=True, env=env)

good idea!

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

  When using git DAG bundles, corrupted bare repositories can cause all tasks
  landing on a host to fail with InvalidGitRepositoryError. This adds retry
  logic that detects corrupted bare repositories, cleans them up, and attempts
  to re-clone them once before failing.

  Changes:
  - Add InvalidGitRepositoryError handling in _clone_bare_repo_if_required()
  - Implement cleanup and retry logic with shutil.rmtree()
  - Add comprehensive tests for both successful retry and retry failure scenarios
  - Ensure all existing tests continue to pass
@dheerajturaga dheerajturaga force-pushed the bugfix/corrupted-git-bundle-retry branch from 8465619 to 1b238ca Compare October 14, 2025 20:46
@dheerajturaga
Copy link
Member Author

@potiuk this looks like an unrelated failure?

@dheerajturaga dheerajturaga force-pushed the bugfix/corrupted-git-bundle-retry branch from 1b238ca to 03de448 Compare October 15, 2025 22:01
Copy link
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me this looks good. I would have preferred to catch other AirflowExceptions directly in the PR and not reverting these - breaking change doe not apply in my eyes also it is no Dag or user code touching these exceptions.... Now some other PR needs to clean this... but anyway LGTM.

@jscheffl jscheffl requested a review from ephraimbuddy October 16, 2025 04:51
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @ephraimbuddy ?

@potiuk potiuk merged commit 5320693 into apache:main Oct 16, 2025
112 checks passed
snreddygopu pushed a commit to Teradata/airflow that referenced this pull request Oct 16, 2025
* Fix corrupted bare Git repository recovery in DAG bundles

  When using git DAG bundles, corrupted bare repositories can cause all tasks
  landing on a host to fail with InvalidGitRepositoryError. This adds retry
  logic that detects corrupted bare repositories, cleans them up, and attempts
  to re-clone them once before failing.

  Changes:
  - Add InvalidGitRepositoryError handling in _clone_bare_repo_if_required()
  - Implement cleanup and retry logic with shutil.rmtree()
  - Add comprehensive tests for both successful retry and retry failure scenarios
  - Ensure all existing tests continue to pass

* Refactor git clone retry logic to use tenacity

* Ephraims suggestions
@dheerajturaga dheerajturaga deleted the bugfix/corrupted-git-bundle-retry branch October 16, 2025 18:40
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 17, 2025
* Fix corrupted bare Git repository recovery in DAG bundles

  When using git DAG bundles, corrupted bare repositories can cause all tasks
  landing on a host to fail with InvalidGitRepositoryError. This adds retry
  logic that detects corrupted bare repositories, cleans them up, and attempts
  to re-clone them once before failing.

  Changes:
  - Add InvalidGitRepositoryError handling in _clone_bare_repo_if_required()
  - Implement cleanup and retry logic with shutil.rmtree()
  - Add comprehensive tests for both successful retry and retry failure scenarios
  - Ensure all existing tests continue to pass

* Refactor git clone retry logic to use tenacity

* Ephraims suggestions
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 19, 2025
* Fix corrupted bare Git repository recovery in DAG bundles

  When using git DAG bundles, corrupted bare repositories can cause all tasks
  landing on a host to fail with InvalidGitRepositoryError. This adds retry
  logic that detects corrupted bare repositories, cleans them up, and attempts
  to re-clone them once before failing.

  Changes:
  - Add InvalidGitRepositoryError handling in _clone_bare_repo_if_required()
  - Implement cleanup and retry logic with shutil.rmtree()
  - Add comprehensive tests for both successful retry and retry failure scenarios
  - Ensure all existing tests continue to pass

* Refactor git clone retry logic to use tenacity

* Ephraims suggestions
TyrellHaywood pushed a commit to TyrellHaywood/airflow that referenced this pull request Oct 22, 2025
* Fix corrupted bare Git repository recovery in DAG bundles

  When using git DAG bundles, corrupted bare repositories can cause all tasks
  landing on a host to fail with InvalidGitRepositoryError. This adds retry
  logic that detects corrupted bare repositories, cleans them up, and attempts
  to re-clone them once before failing.

  Changes:
  - Add InvalidGitRepositoryError handling in _clone_bare_repo_if_required()
  - Implement cleanup and retry logic with shutil.rmtree()
  - Add comprehensive tests for both successful retry and retry failure scenarios
  - Ensure all existing tests continue to pass

* Refactor git clone retry logic to use tenacity

* Ephraims suggestions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants