-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: make sure taskresult completed when mark node succeed when it has outputs #12537
Conversation
…s outputs Signed-off-by: shuangkun <tsk2013uestc@163.com>
b2b564c
to
3b26b73
Compare
Signed-off-by: shuangkun <tsk2013uestc@163.com>
3b26b73
to
1f9cca7
Compare
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
So, it sounds like previously the execution of a Workflow was allowed to continue even if the previous Step's Outputs weren't reconciled? Are you essentially preventing the next Step from running yet in that case? |
Yes, I want to prevent the next step from running. |
Yes, I think this was the case before, although under normal circumstances output will be processed before pod status normally, because this resource is indeed created earlier. But on a large scale, these two processing orders may be caused by high pressure Events arrive at APIserver in different order |
Is it possible to see if this worked on some older versions of code? I'm curious if something broke this. It seems like core functionality. |
I see. So, maybe this is a good enough answer to my request that you test on an older version - perhaps this case is just an unusual one? I am kind of curious if other people have logged similar bugs. |
I think this may be related to the introduction of taskresult resources since 3.4. Maybe it is hard to support old, because there is a lack of records recording whether the taskresult was processed(Originally I needed to add this record, but found that it was included in the latest version.) |
Yes,
Yes, I tested it on version 3.4.12 for few weeks. Looks well. There will be no "failed to evaluate expression" error like before. |
@Garett-MacGowan do you want to look at this too? |
I just skimmed it. I can take a proper look after I 😴. In general, if we're proceeding to next steps before outputs are reconciled, it seems important that we add the wait behavior. As you said, it seems like core functionality, so I'm surprised if it's not already accounted for. I'm wondering if this can be tested. |
workflow/controller/operator.go
Outdated
} | ||
// Check whether the node has output and whether its taskresult is in an incompleted state. | ||
if tmpl.HasOutputs() && woc.wf.Status.IsTaskResultInCompleted(node.ID) && woc.wf.Status.IsTaskResultInCompleted(pod.Name) { | ||
woc.log.WithFields(log.Fields{"nodeID": newState.ID}).WithError(err).Error("Taskresult of the node not yet completed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment along the lines of what @juliev0 was saying, I don't think this is an error. We just need to flag needReconcileTaskResult
. Could maybe just log it normally if you want the log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, probably a Debug line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, do we need to call it for both woc.wf.Status.IsTaskResultInCompleted(node.ID) && woc.wf.Status.IsTaskResultInCompleted(pod.Name)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, I was thinking this but had to step away and forgot to ask. I think it should just be tmpl.HasOutputs() && woc.wf.Status.IsTaskResultInCompleted(node.ID)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe confusion from the comment here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seeing that comment about the comment made me think of it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, do we need to call it for both
woc.wf.Status.IsTaskResultInCompleted(node.ID) && woc.wf.Status.IsTaskResultInCompleted(pod.Name)
?
Yes, I thought about this problem at first. But there is a problem. If the outputs are in pod annotations or in taskresult, the key values are different. Maybe we can unify to podName or NodeId. I think nodeId is better, how about you?
May be I can add a func named pod.GetNodeId()
if x, ok := pod.Annotations[common.AnnotationKeyReportOutputsCompleted]; ok {
woc.log.Warn("workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/latest/workflow-rbac/")
resultName := pod.GetName()
if x == "true" {
woc.wf.Status.MarkTaskResultComplete(resultName)
} else {
woc.wf.Status.MarkTaskResultIncomplete(resultName)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, just unify to Node ID.
a2ccd42
to
1670763
Compare
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Co-authored-by: Julie Vogelman <julievogelman0@gmail.com> Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com>
Co-authored-by: Julie Vogelman <julievogelman0@gmail.com> Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com>
Co-authored-by: Julie Vogelman <julievogelman0@gmail.com> Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com>
Co-authored-by: Julie Vogelman <julievogelman0@gmail.com> Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com>
Co-authored-by: Julie Vogelman <julievogelman0@gmail.com> Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com>
workflow/controller/operator.go
Outdated
woc.log.WithField("workflow", woc.wf.ObjectMeta.Name).Info("pod reconciliation didn't complete, will retry") | ||
woc.requeue() | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I just realized that we probably need to move the if err != nil
clause above the if !podReconciliationCompleted {
, since we can return err, false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hopefully after that we should be good, thank you for the iterations!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I changed it. Thanks!
Signed-off-by: shuangkun <tsk2013uestc@163.com>
…s outputs (argoproj#12537) Signed-off-by: shuangkun <tsk2013uestc@163.com> Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com> Co-authored-by: Julie Vogelman <julievogelman0@gmail.com>
…s outputs (argoproj#12537) Signed-off-by: shuangkun <tsk2013uestc@163.com> Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com> Co-authored-by: Julie Vogelman <julievogelman0@gmail.com>
…s outputs (argoproj#12537) Signed-off-by: shuangkun <tsk2013uestc@163.com> Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com> Co-authored-by: Julie Vogelman <julievogelman0@gmail.com>
…s outputs (argoproj#12537) Signed-off-by: shuangkun <tsk2013uestc@163.com> Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com> Co-authored-by: Julie Vogelman <julievogelman0@gmail.com>
…s outputs (argoproj#12537) Signed-off-by: shuangkun <tsk2013uestc@163.com> Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com> Co-authored-by: Julie Vogelman <julievogelman0@gmail.com> Signed-off-by: Isitha Subasinghe <isubasinghe@student.unimelb.edu.au>
When my cluster has lots of workflows, I meet some errors.
When the number of workflows is not large, there is no such error.
My workflow has lots of template like this, the next step refer the output of the previous step. Like hello2a refer hello1 in parameter
steps['hello1'].outputs.parameters['workflow_artifact_key']
.When I search the logs. I find the time of preStep(hello1)‘s “node changed” to succeed are earlier than "task-result changed". And
this cause the hello2a's evaluate expression error. So I want to make sure taskresult completed when mark node succeed when it has outputs.
Motivation
Modifications
Verification