fix: Allow users to selectively retry specific failed nodes. Fixes #12543 #12553

mio4kon · 2024-01-19T15:33:22Z

Motivation

Allow users to selectively retry specific failed nodes instead of retrying all failed nodes at once.

Modifications

Removed the restriction that required the simultaneous use of --node-field-selector and --restart-successful. Now, using --node-field-selector alone allows for individual retries of specific failed nodes, instead of retrying all failures.

Verification

--node-field-selector can be used independently.
./dist/argo retry fail-24ptx --node-field-selector name=fail-24ptx.BB -v

Regressively used in combination.
./dist/argo retry fail-mz9c4 --restart-successful --node-field-selector name=fail-mz9c4.A

agilgur5

Something feels off here -- nodeIdsToReset should be different if --restart-successful is set or not. This is making it unconditionally restart all successful nodes now. This might need more semantic refactoring

workflow/util/util_test.go

workflow/util/util.go

agilgur5 · 2024-01-22T07:24:38Z

feat: [...]

Please re-title this PR as a fix:, since per #12543 (comment) this very much seems like a bug and not intended behavior

mio4kon · 2024-01-22T09:06:54Z

feat: [...]

Please re-title this PR as a fix:, since per #12543 (comment) this very much seems like a bug and not intended behavior

done

mio4kon · 2024-02-04T04:42:19Z

@agilgur5 hello，Will this MR be merged into the trunk in the future?

agilgur5 · 2024-02-08T23:17:26Z

workflow/util/util.go

 	selector, err := fields.ParseSelector(nodeFieldSelector)
 	if err != nil {
 		return nil, err
 	} else {
 		for _, node := range nodes {
 			if SelectorMatchesNode(selector, node) {
+				if !restartSuccessful && node.Phase == wfv1.NodeSucceeded {


I mentioned this before, but the latter condition would be better as part of the selector if possible

@agilgur5 Hello. Here I want to confirm whether restartSuccessful has the highest priority compared with selector. If a successful node is selected, but restartSuccessful is not specified, whether the successful node is still executed. In this case, the restartSuccessful parameter doesn't feel very meaningful. Or should it be the logic of the diagram below?

The diagram looks correct to me

The diagram looks correct to me

@agilgur5 However, the above diagram may be a bit in conflict with the previous logic, which may cause problems with the previous e2e test. For example, TestFormulateRetryWorkflow/Nested_DAG_with_Non-group_Node_Selected. Do you think it's better to fix only the previous red logic in the figure below and keep the previous blue logic?

Ahhhh -- based on the issue and your fix I strongly suspected there was something deeper incorrect with the previous logic. I actually asked in the contributors channel if someone more familiar with the retry logic could check this PR because of that suspicion.

Do you think it's better to fix only the previous red logic in the figure below and keep the previous blue logic?

Ah is that what you were doing with your initial/current fix?
I honestly think we should change it all to work as expected (the black boxes) since the previous logic feels very confusing/unexpected (and has confused users many a time before, as well as contributors).

Big thanks for drawing the diagrams here, those are super helpful! We may want to add a similar mermaid flowchart of this to the docs as well

If we rewrite this, we may want to release it as a breaking change similar to retryStrategy fixes from #11005

(different retries, but similar concept that they both behaved unexpectedly)

If we rewrite this, we may want to release it as a breaking change similar to retryStrategy fixes from #11005

(different retries, but similar concept that they both behaved unexpectedly)

By the way, do you have any plans to rewrite this part of the logic? Right now, our business needs to be able to retry a single error node. The changes to this issue are to minimize the impact on existing API logic and to provide the ability to retry a single faulty node. Of course, a broader refactoring might provide a more elegant solution. Do you know if the official team has any plans in this regard? 😃

Of course, a broader refactoring might provide a more elegant solution. Do you know if the official team has any plans in this regard? 😃

Should be happening shortly per below

isubasinghe

Can you add two e2e tests please?
One that is a dag and one that is not.

Don't stress about invoking the server if there isn't infrastructure for those kind of tests (although I suspect you should be able to use REST) already.

Just add those two tests with comments linking them to this issue and PR.

…rgoproj#12543 Signed-off-by: mio4kon <mio4kon.dev@gmail.com>

Signed-off-by: mio4kon <mio4kon.dev@gmail.com>

Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com> Signed-off-by: mio4kon <mio4kon@sina.com>

…t be reset Signed-off-by: mio4kon <mio4kon.dev@gmail.com>

Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com> Signed-off-by: mio4kon <mio4kon@sina.com>

Signed-off-by: mio4kon <mio4kon.dev@gmail.com>

mio4kon · 2024-02-17T12:31:06Z

Can you add two e2e tests please? One that is a dag and one that is not.

Don't stress about invoking the server if there isn't infrastructure for those kind of tests (although I suspect you should be able to use REST) already.

Just add those two tests with comments linking them to this issue and PR.

@isubasinghe add e2e tests : TestRetryWorkflowWithStepsWithSelectedFailNodes and TestRetryWorkflowWithDAGWithSelectedFailNodes, Please help to check it out workflow/util/util_test.go

JasonChen86899 · 2024-06-04T05:56:21Z

Hi, I review the logic and this modification may cause problems. The current logic of the main branch is to not retry the error node with successful child nodes. This modification will result in some errors

shuangkun · 2024-08-29T06:04:49Z

Hi, will this question be updated again? I encountered a new problem. I want to retry a hanging node while the workflow is running.

isubasinghe · 2024-10-10T04:51:34Z

Just an update of the retry logic. I had to rewrite it from scratch, will create a PR very soon (within the day).
Closing this as a result, the current retry logic is very broken.

agilgur5 · 2024-10-10T05:26:11Z

For back-link reference, I mentioned this PR in #13692 (comment)

Closing this as a result, the current retry logic is very broken.

I assume Isitha's refactor will fix the diagram @mio4kon made above, which illustrates some of the broken logic

agilgur5 · 2024-10-10T13:53:23Z

I had to rewrite it from scratch, will create a PR very soon (within the day).

See #13734

shuangkun · 2024-10-13T05:56:41Z

Can nodes that did not fail also be retried？ #13749

mio4kon mentioned this pull request Jan 19, 2024

Retrying specific failed node does not work #12543

Open

mio4kon force-pushed the support-retry-part-faiil branch from 2d45879 to e3fef71 Compare January 19, 2024 15:43

agilgur5 mentioned this pull request Jan 19, 2024

fix: Allow users to selectively retry specific failed nodes. Fixes #12543 #12550

Closed

agilgur5 self-assigned this Jan 19, 2024

agilgur5 added area/cli The `argo` CLI area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries labels Jan 19, 2024

agilgur5 reviewed Jan 19, 2024

View reviewed changes

workflow/util/util_test.go Outdated Show resolved Hide resolved

workflow/util/util.go Show resolved Hide resolved

workflow/util/util.go Outdated Show resolved Hide resolved

mio4kon force-pushed the support-retry-part-faiil branch 2 times, most recently from 74c6717 to dc5048e Compare January 20, 2024 15:24

agilgur5 reviewed Jan 22, 2024

View reviewed changes

workflow/util/util.go Outdated Show resolved Hide resolved

mio4kon changed the title ~~feat: Allow users to selectively retry specific failed nodes . Fixes #12543~~ fix: Allow users to selectively retry specific failed nodes . Fixes #12543 Jan 22, 2024

agilgur5 reviewed Feb 8, 2024

View reviewed changes

isubasinghe requested changes Feb 11, 2024

View reviewed changes

This comment was marked as spam.

Sign in to view

mio4kon and others added 7 commits February 17, 2024 15:50

feat: Allow users to selectively retry specific failed nodes . Fixes a…

fbf2ce0

…rgoproj#12543 Signed-off-by: mio4kon <mio4kon.dev@gmail.com>

fix: add ut. Fixes argoproj#12550

0bdfc7d

Signed-off-by: mio4kon <mio4kon.dev@gmail.com>

fix: ut error

1cc48c8

Signed-off-by: mio4kon <mio4kon.dev@gmail.com>

fix: ut error

c60c808

Signed-off-by: mio4kon <mio4kon.dev@gmail.com>

Apply suggestions from code review

57f2dfa

Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com> Signed-off-by: mio4kon <mio4kon@sina.com>

feat: restartSuccessful -- when it's false, successful nodes shouldn'…

c9f2b1a

…t be reset Signed-off-by: mio4kon <mio4kon.dev@gmail.com>

Update workflow/util/util.go

efd346b

Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com> Signed-off-by: mio4kon <mio4kon@sina.com>

mio4kon force-pushed the support-retry-part-faiil branch from 8bab59c to efd346b Compare February 17, 2024 07:51

mio4kon and others added 2 commits February 17, 2024 16:09

Merge branch 'argoproj:main' into support-retry-part-faiil

42b0c80

fix: add e2d tests

e0f1e9c

Signed-off-by: mio4kon <mio4kon.dev@gmail.com>

mio4kon force-pushed the support-retry-part-faiil branch from 9f08b9d to e0f1e9c Compare February 17, 2024 12:28

mio4kon requested a review from isubasinghe February 18, 2024 03:44

agilgur5 mentioned this pull request Apr 12, 2024

fix: DAG with continueOn in error after retry. Fixes: #11395 #12817

Merged

agilgur5 removed the area/cli The `argo` CLI label Apr 12, 2024

This was referenced Oct 10, 2024

Deprecate retrying of failed nodes by default. #13692

Open

feat(ui): Retry a single workflow step manually #13343

Merged

isubasinghe closed this Oct 10, 2024

agilgur5 added the solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate) label Oct 10, 2024

agilgur5 changed the title ~~fix: Allow users to selectively retry specific failed nodes . Fixes #12543~~ fix: Allow users to selectively retry specific failed nodes. Fixes #12543 Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Allow users to selectively retry specific failed nodes. Fixes #12543 #12553

fix: Allow users to selectively retry specific failed nodes. Fixes #12543 #12553

mio4kon commented Jan 19, 2024

agilgur5 left a comment

agilgur5 commented Jan 22, 2024

mio4kon commented Jan 22, 2024

mio4kon commented Feb 4, 2024 •

edited by agilgur5

Loading

agilgur5 Feb 8, 2024

mio4kon Feb 17, 2024 •

edited

Loading

agilgur5 Feb 17, 2024

mio4kon Feb 17, 2024 •

edited

Loading

agilgur5 Feb 20, 2024 •

edited

Loading

agilgur5 Feb 20, 2024

mio4kon Feb 21, 2024

agilgur5 Oct 10, 2024

isubasinghe left a comment •

edited

Loading

This comment was marked as spam.

mio4kon commented Feb 17, 2024 •

edited

Loading

JasonChen86899 commented Jun 4, 2024 •

edited by agilgur5

Loading

shuangkun commented Aug 29, 2024

isubasinghe commented Oct 10, 2024 •

edited

Loading

agilgur5 commented Oct 10, 2024 •

edited

Loading

agilgur5 commented Oct 10, 2024

shuangkun commented Oct 13, 2024 •

edited

Loading

fix: Allow users to selectively retry specific failed nodes. Fixes #12543 #12553

fix: Allow users to selectively retry specific failed nodes. Fixes #12543 #12553

Conversation

mio4kon commented Jan 19, 2024

Motivation

Modifications

Verification

agilgur5 left a comment

Choose a reason for hiding this comment

agilgur5 commented Jan 22, 2024

mio4kon commented Jan 22, 2024

mio4kon commented Feb 4, 2024 • edited by agilgur5 Loading

agilgur5 Feb 8, 2024

Choose a reason for hiding this comment

mio4kon Feb 17, 2024 • edited Loading

Choose a reason for hiding this comment

agilgur5 Feb 17, 2024

Choose a reason for hiding this comment

mio4kon Feb 17, 2024 • edited Loading

Choose a reason for hiding this comment

agilgur5 Feb 20, 2024 • edited Loading

Choose a reason for hiding this comment

agilgur5 Feb 20, 2024

Choose a reason for hiding this comment

mio4kon Feb 21, 2024

Choose a reason for hiding this comment

agilgur5 Oct 10, 2024

Choose a reason for hiding this comment

isubasinghe left a comment • edited Loading

Choose a reason for hiding this comment

This comment was marked as spam.

mio4kon commented Feb 17, 2024 • edited Loading

JasonChen86899 commented Jun 4, 2024 • edited by agilgur5 Loading

shuangkun commented Aug 29, 2024

isubasinghe commented Oct 10, 2024 • edited Loading

agilgur5 commented Oct 10, 2024 • edited Loading

agilgur5 commented Oct 10, 2024

shuangkun commented Oct 13, 2024 • edited Loading

mio4kon commented Feb 4, 2024 •

edited by agilgur5

Loading

mio4kon Feb 17, 2024 •

edited

Loading

mio4kon Feb 17, 2024 •

edited

Loading

agilgur5 Feb 20, 2024 •

edited

Loading

isubasinghe left a comment •

edited

Loading

mio4kon commented Feb 17, 2024 •

edited

Loading

JasonChen86899 commented Jun 4, 2024 •

edited by agilgur5

Loading

isubasinghe commented Oct 10, 2024 •

edited

Loading

agilgur5 commented Oct 10, 2024 •

edited

Loading

shuangkun commented Oct 13, 2024 •

edited

Loading