-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Allow users to selectively retry specific failed nodes. Fixes #12543 #12553
Conversation
2d45879
to
e3fef71
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something feels off here -- nodeIdsToReset
should be different if --restart-successful
is set or not. This is making it unconditionally restart all successful nodes now. This might need more semantic refactoring
74c6717
to
dc5048e
Compare
Please re-title this PR as a |
done |
@agilgur5 hello,Will this MR be merged into the trunk in the future? |
selector, err := fields.ParseSelector(nodeFieldSelector) | ||
if err != nil { | ||
return nil, err | ||
} else { | ||
for _, node := range nodes { | ||
if SelectorMatchesNode(selector, node) { | ||
if !restartSuccessful && node.Phase == wfv1.NodeSucceeded { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned this before, but the latter condition would be better as part of the selector if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agilgur5 Hello. Here I want to confirm whether restartSuccessful
has the highest priority compared with selector
. If a successful node is selected, but restartSuccessful is not specified, whether the successful node is still executed. In this case, the restartSuccessful parameter doesn't feel very meaningful. Or should it be the logic of the diagram below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diagram looks correct to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diagram looks correct to me
@agilgur5 However, the above diagram may be a bit in conflict with the previous logic, which may cause problems with the previous e2e test. For example, TestFormulateRetryWorkflow/Nested_DAG_with_Non-group_Node_Selected
. Do you think it's better to fix only the previous red logic in the figure below and keep the previous blue logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhhh -- based on the issue and your fix I strongly suspected there was something deeper incorrect with the previous logic. I actually asked in the contributors channel if someone more familiar with the retry logic could check this PR because of that suspicion.
Do you think it's better to fix only the previous red logic in the figure below and keep the previous blue logic?
Ah is that what you were doing with your initial/current fix?
I honestly think we should change it all to work as expected (the black boxes) since the previous logic feels very confusing/unexpected (and has confused users many a time before, as well as contributors).
Big thanks for drawing the diagrams here, those are super helpful! We may want to add a similar mermaid
flowchart of this to the docs as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we rewrite this, we may want to release it as a breaking change similar to retryStrategy
fixes from #11005
(different retries, but similar concept that they both behaved unexpectedly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we rewrite this, we may want to release it as a breaking change similar to
retryStrategy
fixes from #11005(different retries, but similar concept that they both behaved unexpectedly)
By the way, do you have any plans to rewrite this part of the logic? Right now, our business needs to be able to retry a single error node. The changes to this issue are to minimize the impact on existing API logic and to provide the ability to retry a single faulty node. Of course, a broader refactoring might provide a more elegant solution. Do you know if the official team has any plans in this regard? 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, a broader refactoring might provide a more elegant solution. Do you know if the official team has any plans in this regard? 😃
Should be happening shortly per below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add two e2e tests please?
One that is a dag and one that is not.
Don't stress about invoking the server if there isn't infrastructure for those kind of tests (although I suspect you should be able to use REST) already.
Just add those two tests with comments linking them to this issue and PR.
This comment was marked as spam.
This comment was marked as spam.
…rgoproj#12543 Signed-off-by: mio4kon <mio4kon.dev@gmail.com>
Signed-off-by: mio4kon <mio4kon.dev@gmail.com>
Signed-off-by: mio4kon <mio4kon.dev@gmail.com>
Signed-off-by: mio4kon <mio4kon.dev@gmail.com>
Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com> Signed-off-by: mio4kon <mio4kon@sina.com>
…t be reset Signed-off-by: mio4kon <mio4kon.dev@gmail.com>
Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com> Signed-off-by: mio4kon <mio4kon@sina.com>
8bab59c
to
efd346b
Compare
Signed-off-by: mio4kon <mio4kon.dev@gmail.com>
9f08b9d
to
e0f1e9c
Compare
@isubasinghe add e2e tests : |
Hi, I review the logic and this modification may cause problems. The current logic of the |
Hi, will this question be updated again? I encountered a new problem. I want to retry a hanging node while the workflow is running. |
Just an update of the retry logic. I had to rewrite it from scratch, will create a PR very soon (within the day). |
For back-link reference, I mentioned this PR in #13692 (comment)
I assume Isitha's refactor will fix the diagram @mio4kon made above, which illustrates some of the broken logic |
See #13734 |
Can nodes that did not fail also be retried? #13749 |
Fixes #12543
Motivation
Allow users to selectively retry specific failed nodes instead of retrying all failed nodes at once.
Modifications
Removed the restriction that required the simultaneous use of
--node-field-selector
and--restart-successful
. Now, using--node-field-selector
alone allows for individual retries of specific failed nodes, instead of retrying all failures.Verification
--node-field-selector can be used independently.
./dist/argo retry fail-24ptx --node-field-selector name=fail-24ptx.BB -v
Regressively used in combination.
./dist/argo retry fail-mz9c4 --restart-successful --node-field-selector name=fail-mz9c4.A