-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: stuck hook issue when a Job resource has a ttlSecondsAfterFinished field set #646
base: master
Are you sure you want to change the base?
Conversation
c446acf
to
9e573ae
Compare
pkg/sync/sync_context.go
Outdated
if job.Spec.TTLSecondsAfterFinished == nil { | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the downside(s) to always using a finalizer instead of only using one when this field is set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good question, the logic would be simpler and more generic if all hooks have a finalizer, cleanup would also be simpler.
pkg/sync/sync_context.go
Outdated
// processJobHookTask processes a hook task where the target object is a Job and has defined ttlSecondsAfterFinished. | ||
// This addresses the issue where a Job with a ttlSecondsAfterFinished set to a low value gets deleted fast and the hook phase gets stuck. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the finalizer feature be limited to just Jobs? I understand that Argo Workflows can also exhibit the same behavior. I could imagine any resource having the same issue if some process deletes the resource before Argo CD has a chance to observe the final resource state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that, and yes, that is true.
I focused only on the immediate issue as a PoC, I am also in favour of a generic solution for the scenario where some external process deletes a resource during hook phase.
376d9d0
to
57d66fd
Compare
} | ||
} | ||
if mutated { | ||
task.targetObj.SetFinalizers(finalizers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I set it at all for the targetObj
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a glance, it's not clear to me.... would you mind investigating/documenting that liveObj
and targetObj
are on the syncTask
struct?
@crenshaw-dev I have updated the PR to make the solution more generic. This use case should be also covered by an integration test, but I guess that test would live in the argocd repository? |
@dejanzele yep! You can open a PR on the argo-cd repo temporarily replacing gitops-engine in go.mod with your fork and revision. |
a4ed0ec
to
f02987e
Compare
// In that case, we need to get the latest version of the object and retry the update. | ||
return retry.RetryOnConflict(retry.DefaultRetry, func() error { | ||
updateErr := sc.updateResource(task) | ||
if apierr.IsConflict(updateErr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conflicts happen quite often and without retries the E2E tests were running very flaky.
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
…it generic Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
f02987e
to
4e93b3e
Compare
Quality Gate passedIssues Measures |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #646 +/- ##
==========================================
+ Coverage 54.26% 54.45% +0.18%
==========================================
Files 64 64
Lines 6164 6330 +166
==========================================
+ Hits 3345 3447 +102
- Misses 2549 2607 +58
- Partials 270 276 +6 ☔ View full report in Codecov by Sentry. |
This is a proof-of-concept PR which would fix argoproj/argo-cd#21055
More info on the issue can be found in the linked GitHub issue.
Idea is to add a finalizer to Job resources which havettlSecondsAfterFinished
set and remove it after ArgoCD detects the hook completed.A simpler approach would be to unset thettlSecondsAfterFinished
but that would cause drift from defined vs actual state.The proposed solution is to attach a finalizer on all hook tasks and remove it after the argocd acknowledges the hook task is completed in the sync phase.
The same scenario which is described in the linked GitHub issue passes in this PR.
I welcome any feedback, as I think a lot of people on the community would like this to be fixed, and I'd be more than happy to adopt based on best direction for this issue.