-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(controller): Workflow hangs indefinitely during ContainerCreating if the Pod or Node unexpectedly dies #5585
Conversation
… if the Pod or Node unexpectedly dies Signed-off-by: Saravanan Balasubramanian <sarabala1979@gmail.com>
Codecov Report
@@ Coverage Diff @@
## master #5585 +/- ##
==========================================
+ Coverage 47.01% 47.04% +0.02%
==========================================
Files 240 240
Lines 15002 15021 +19
==========================================
+ Hits 7053 7066 +13
- Misses 7048 7059 +11
+ Partials 901 896 -5
Continue to review full report at Codecov.
|
argo-workflows/workflow/controller/operator.go Lines 1118 to 1133 in 4e450e2
Cont>
|
@alexec Do you think we can move nodestatus to
|
workflow/controller/operator.go
Outdated
@@ -975,7 +975,8 @@ func (woc *wfOperationCtx) podReconciliation(ctx context.Context) error { | |||
|
|||
// If the node is pending and the pod does not exist, it could be the case that we want to try to submit it | |||
// again instead of marking it as an error. Check if that's the case. | |||
if node.Pending() { | |||
// Node will be in pending state without Pod create if Node is waiting for Synchronize lock | |||
if node.Pending() && !node.IsPodCreated() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one valid scenario where node status will be pending
without pod creation. It is the template that is waiting for sync lock to acquire.
So what is better? @sarabala1979 I'm not sure what to do right now, I did wonder if gitops-engine is correct: https://github.com/argoproj/gitops-engine/blob/master/pkg/health/health_pod.go#L14 But no, not correct. What about
I know that we can be running, but status is things like
Do you know where to find the source code? |
To make this worse - a "waiting" container can be in error state and not recoverable! |
workflow/controller/operator.go
Outdated
@@ -976,9 +976,10 @@ func (woc *wfOperationCtx) podReconciliation(ctx context.Context) error { | |||
// If the node is pending and the pod does not exist, it could be the case that we want to try to submit it | |||
// again instead of marking it as an error. Check if that's the case. | |||
// Node will be in pending state without Pod create if Node is waiting for Synchronize lock | |||
if node.Pending() && !node.IsPodCreated() { | |||
if node.Pending() && node.GetPendingReason() == wfv1.WaitingForSyncLock { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be ||
not &&
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be &&
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The node should not fail/error if the node is in a pending state with WaitforSyncLock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure? &&
means that any node that is pending and not waiting for lock may be marked as error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. I am sure. There is another scenario (pod. is not updated to informer) which will cover in recentlyStarted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simster7 - can I request another opinion on this? I feel like I'm going mad.
workflow/controller/operator.go
Outdated
@@ -976,9 +976,10 @@ func (woc *wfOperationCtx) podReconciliation(ctx context.Context) error { | |||
// If the node is pending and the pod does not exist, it could be the case that we want to try to submit it | |||
// again instead of marking it as an error. Check if that's the case. | |||
// Node will be in pending state without Pod create if Node is waiting for Synchronize lock | |||
if node.Pending() && !node.IsPodCreated() { | |||
if node.Pending() && node.GetPendingReason() == wfv1.WaitingForSyncLock { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure? &&
means that any node that is pending and not waiting for lock may be marked as error
… if the Pod or Node unexpectedly dies (#5585)
Signed-off-by: Saravanan Balasubramanian sarabala1979@gmail.com
Checklist: