Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Allow users to selectively retry specific failed nodes. Fixes #12543 #12553

Closed
wants to merge 9 commits into from
15 changes: 13 additions & 2 deletions workflow/util/util.go
Original file line number Diff line number Diff line change
Expand Up @@ -960,6 +960,15 @@ func FormulateRetryWorkflow(ctx context.Context, wf *wfv1.Workflow, restartSucce
log.Debugf("Reset %s node %s since it's a group node", node.Name, string(node.Phase))
continue
} else {
// If restartSuccessful is unset and nodeFieldSelector is set, only retry the specified failed node
if !restartSuccessful && len(nodeFieldSelector) > 0 {
if _, present := nodeIDsToReset[node.ID]; !present {
newWF.Status.Nodes.Set(node.ID, node)
// Skip the current iteration and move to the next node
continue
}
}

log.Debugf("Deleted %s node %s since it's not a group node", node.Name, string(node.Phase))
deletedPods, podsToDelete = deletePodNodeDuringRetryWorkflow(wf, node, deletedPods, podsToDelete)
log.Debugf("Deleted pod node: %s", node.Name)
Expand Down Expand Up @@ -1049,16 +1058,18 @@ func GetTemplateFromNode(node wfv1.NodeStatus) string {

func getNodeIDsToReset(restartSuccessful bool, nodeFieldSelector string, nodes wfv1.Nodes) (map[string]bool, error) {
nodeIDsToReset := make(map[string]bool)
if !restartSuccessful || len(nodeFieldSelector) == 0 {
if len(nodeFieldSelector) == 0 {
return nodeIDsToReset, nil
}

selector, err := fields.ParseSelector(nodeFieldSelector)
mio4kon marked this conversation as resolved.
Show resolved Hide resolved
if err != nil {
return nil, err
} else {
for _, node := range nodes {
if SelectorMatchesNode(selector, node) {
if !restartSuccessful && node.Phase == wfv1.NodeSucceeded {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this before, but the latter condition would be better as part of the selector if possible

Copy link
Author

@mio4kon mio4kon Feb 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agilgur5 Hello. Here I want to confirm whether restartSuccessful has the highest priority compared with selector. If a successful node is selected, but restartSuccessful is not specified, whether the successful node is still executed. In this case, the restartSuccessful parameter doesn't feel very meaningful. Or should it be the logic of the diagram below?

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diagram looks correct to me

Copy link
Author

@mio4kon mio4kon Feb 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diagram looks correct to me

@agilgur5 However, the above diagram may be a bit in conflict with the previous logic, which may cause problems with the previous e2e test. For example, TestFormulateRetryWorkflow/Nested_DAG_with_Non-group_Node_Selected. Do you think it's better to fix only the previous red logic in the figure below and keep the previous blue logic?

image

Copy link
Contributor

@agilgur5 agilgur5 Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhhh -- based on the issue and your fix I strongly suspected there was something deeper incorrect with the previous logic. I actually asked in the contributors channel if someone more familiar with the retry logic could check this PR because of that suspicion.

Do you think it's better to fix only the previous red logic in the figure below and keep the previous blue logic?

Ah is that what you were doing with your initial/current fix?
I honestly think we should change it all to work as expected (the black boxes) since the previous logic feels very confusing/unexpected (and has confused users many a time before, as well as contributors).

Big thanks for drawing the diagrams here, those are super helpful! We may want to add a similar mermaid flowchart of this to the docs as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we rewrite this, we may want to release it as a breaking change similar to retryStrategy fixes from #11005

(different retries, but similar concept that they both behaved unexpectedly)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we rewrite this, we may want to release it as a breaking change similar to retryStrategy fixes from #11005

(different retries, but similar concept that they both behaved unexpectedly)

By the way, do you have any plans to rewrite this part of the logic? Right now, our business needs to be able to retry a single error node. The changes to this issue are to minimize the impact on existing API logic and to provide the ability to retry a single faulty node. Of course, a broader refactoring might provide a more elegant solution. Do you know if the official team has any plans in this regard? 😃

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, a broader refactoring might provide a more elegant solution. Do you know if the official team has any plans in this regard? 😃

Should be happening shortly per below

continue
}
// traverse all children of the node
var queue []string
queue = append(queue, node.ID)
Expand Down
Loading
Loading