Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Allow users to selectively retry specific failed nodes. Fixes #12543 #12553

Closed
wants to merge 9 commits into from

Conversation

mio4kon
Copy link

@mio4kon mio4kon commented Jan 19, 2024

Fixes #12543

Motivation

Allow users to selectively retry specific failed nodes instead of retrying all failed nodes at once.

Modifications

Removed the restriction that required the simultaneous use of --node-field-selector and --restart-successful. Now, using --node-field-selector alone allows for individual retries of specific failed nodes, instead of retrying all failures.

Verification

image

--node-field-selector can be used independently.
./dist/argo retry fail-24ptx --node-field-selector name=fail-24ptx.BB -v

Regressively used in combination.
./dist/argo retry fail-mz9c4 --restart-successful --node-field-selector name=fail-mz9c4.A

@agilgur5 agilgur5 self-assigned this Jan 19, 2024
@agilgur5 agilgur5 added area/cli The `argo` CLI area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries labels Jan 19, 2024
Copy link
Contributor

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something feels off here -- nodeIdsToReset should be different if --restart-successful is set or not. This is making it unconditionally restart all successful nodes now. This might need more semantic refactoring

workflow/util/util_test.go Outdated Show resolved Hide resolved
workflow/util/util.go Show resolved Hide resolved
workflow/util/util.go Outdated Show resolved Hide resolved
@mio4kon mio4kon force-pushed the support-retry-part-faiil branch 2 times, most recently from 74c6717 to dc5048e Compare January 20, 2024 15:24
workflow/util/util.go Outdated Show resolved Hide resolved
@agilgur5
Copy link
Contributor

feat: [...]

Please re-title this PR as a fix:, since per #12543 (comment) this very much seems like a bug and not intended behavior

@mio4kon mio4kon changed the title feat: Allow users to selectively retry specific failed nodes . Fixes #12543 fix: Allow users to selectively retry specific failed nodes . Fixes #12543 Jan 22, 2024
@mio4kon
Copy link
Author

mio4kon commented Jan 22, 2024

feat: [...]

Please re-title this PR as a fix:, since per #12543 (comment) this very much seems like a bug and not intended behavior

done

@mio4kon
Copy link
Author

mio4kon commented Feb 4, 2024

@agilgur5 hello,Will this MR be merged into the trunk in the future?

selector, err := fields.ParseSelector(nodeFieldSelector)
if err != nil {
return nil, err
} else {
for _, node := range nodes {
if SelectorMatchesNode(selector, node) {
if !restartSuccessful && node.Phase == wfv1.NodeSucceeded {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this before, but the latter condition would be better as part of the selector if possible

Copy link
Author

@mio4kon mio4kon Feb 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agilgur5 Hello. Here I want to confirm whether restartSuccessful has the highest priority compared with selector. If a successful node is selected, but restartSuccessful is not specified, whether the successful node is still executed. In this case, the restartSuccessful parameter doesn't feel very meaningful. Or should it be the logic of the diagram below?

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diagram looks correct to me

Copy link
Author

@mio4kon mio4kon Feb 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diagram looks correct to me

@agilgur5 However, the above diagram may be a bit in conflict with the previous logic, which may cause problems with the previous e2e test. For example, TestFormulateRetryWorkflow/Nested_DAG_with_Non-group_Node_Selected. Do you think it's better to fix only the previous red logic in the figure below and keep the previous blue logic?

image

Copy link
Contributor

@agilgur5 agilgur5 Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhhh -- based on the issue and your fix I strongly suspected there was something deeper incorrect with the previous logic. I actually asked in the contributors channel if someone more familiar with the retry logic could check this PR because of that suspicion.

Do you think it's better to fix only the previous red logic in the figure below and keep the previous blue logic?

Ah is that what you were doing with your initial/current fix?
I honestly think we should change it all to work as expected (the black boxes) since the previous logic feels very confusing/unexpected (and has confused users many a time before, as well as contributors).

Big thanks for drawing the diagrams here, those are super helpful! We may want to add a similar mermaid flowchart of this to the docs as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we rewrite this, we may want to release it as a breaking change similar to retryStrategy fixes from #11005

(different retries, but similar concept that they both behaved unexpectedly)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we rewrite this, we may want to release it as a breaking change similar to retryStrategy fixes from #11005

(different retries, but similar concept that they both behaved unexpectedly)

By the way, do you have any plans to rewrite this part of the logic? Right now, our business needs to be able to retry a single error node. The changes to this issue are to minimize the impact on existing API logic and to provide the ability to retry a single faulty node. Of course, a broader refactoring might provide a more elegant solution. Do you know if the official team has any plans in this regard? 😃

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, a broader refactoring might provide a more elegant solution. Do you know if the official team has any plans in this regard? 😃

Should be happening shortly per below

Copy link
Member

@isubasinghe isubasinghe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add two e2e tests please?
One that is a dag and one that is not.

Don't stress about invoking the server if there isn't infrastructure for those kind of tests (although I suspect you should be able to use REST) already.

Just add those two tests with comments linking them to this issue and PR.

@tooptoop4

This comment was marked as spam.

mio4kon and others added 7 commits February 17, 2024 15:50
Signed-off-by: mio4kon <mio4kon.dev@gmail.com>
Signed-off-by: mio4kon <mio4kon.dev@gmail.com>
Signed-off-by: mio4kon <mio4kon.dev@gmail.com>
Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com>
Signed-off-by: mio4kon <mio4kon@sina.com>
…t be reset

Signed-off-by: mio4kon <mio4kon.dev@gmail.com>
Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com>
Signed-off-by: mio4kon <mio4kon@sina.com>
@mio4kon
Copy link
Author

mio4kon commented Feb 17, 2024

Can you add two e2e tests please? One that is a dag and one that is not.

Don't stress about invoking the server if there isn't infrastructure for those kind of tests (although I suspect you should be able to use REST) already.

Just add those two tests with comments linking them to this issue and PR.

@isubasinghe add e2e tests : TestRetryWorkflowWithStepsWithSelectedFailNodes and TestRetryWorkflowWithDAGWithSelectedFailNodes, Please help to check it out workflow/util/util_test.go

@JasonChen86899
Copy link

JasonChen86899 commented Jun 4, 2024

Hi, I review the logic and this modification may cause problems. The current logic of the main branch is to not retry the error node with successful child nodes. This modification will result in some errors

@shuangkun
Copy link
Member

Hi, will this question be updated again? I encountered a new problem. I want to retry a hanging node while the workflow is running.

@isubasinghe
Copy link
Member

isubasinghe commented Oct 10, 2024

Just an update of the retry logic. I had to rewrite it from scratch, will create a PR very soon (within the day).
Closing this as a result, the current retry logic is very broken.

@agilgur5
Copy link
Contributor

agilgur5 commented Oct 10, 2024

For back-link reference, I mentioned this PR in #13692 (comment)

Closing this as a result, the current retry logic is very broken.

I assume Isitha's refactor will fix the diagram @mio4kon made above, which illustrates some of the broken logic

@agilgur5 agilgur5 added the solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate) label Oct 10, 2024
@agilgur5
Copy link
Contributor

I had to rewrite it from scratch, will create a PR very soon (within the day).

See #13734

@agilgur5 agilgur5 changed the title fix: Allow users to selectively retry specific failed nodes . Fixes #12543 fix: Allow users to selectively retry specific failed nodes. Fixes #12543 Oct 10, 2024
@shuangkun
Copy link
Member

shuangkun commented Oct 13, 2024

Can nodes that did not fail also be retried? #13749

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Retrying specific failed node does not work
6 participants