Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: DAG with continueOn in error after retry. Fixes: #11395 #12817

Merged
merged 11 commits into from
Apr 12, 2024

Conversation

shuangkun
Copy link
Member

@shuangkun shuangkun commented Mar 18, 2024

fix: DAG with continueOn in error after retry

Fixes: #11395

Motivation

Modifications

Verification

local test and e2e test.

After retry:

Before fix: Error and lose some nodes.
image
After fix: Failed and not lose nodes
image

Signed-off-by: shuangkun <tsk2013uestc@163.com>
@shuangkun shuangkun marked this pull request as draft March 18, 2024 13:11
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
@shuangkun shuangkun added the area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries label Mar 18, 2024
@shuangkun shuangkun marked this pull request as ready for review March 18, 2024 14:33
Copy link
Member

@tczhao tczhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add something in util_test.go?
I feel we rely too much on e2e, would this be an issue?
Remember to test if the correct node is retried, it's not included in the current test

workflow/util/util.go Outdated Show resolved Hide resolved
workflow/util/util.go Outdated Show resolved Hide resolved
@tczhao tczhao self-assigned this Mar 19, 2024
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
@shuangkun
Copy link
Member Author

Could we add something in util_test.go? I feel we rely too much on e2e, would this be an issue? Remember to test if the correct node is retried, it's not included in the current test

I add a ut in util_test.go and test correct node retried. thanks!

@shuangkun shuangkun requested a review from tczhao March 26, 2024 06:34
Signed-off-by: shuangkun <tsk2013uestc@163.com>
@shuangkun
Copy link
Member Author

Hi, @tczhao can you take a look again? Thanks!

Copy link
Member

@tczhao tczhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

test/e2e/argo_server_test.go Outdated Show resolved Hide resolved
test/e2e/argo_server_test.go Outdated Show resolved Hide resolved
test/e2e/argo_server_test.go Outdated Show resolved Hide resolved
test/e2e/testdata/retry-workflow-with-continueon.yaml Outdated Show resolved Hide resolved
test/e2e/testdata/retry-workflow-with-continueon.yaml Outdated Show resolved Hide resolved
workflow/util/util_test.go Outdated Show resolved Hide resolved
workflow/util/util_test.go Outdated Show resolved Hide resolved
Signed-off-by: shuangkun <tsk2013uestc@163.com>
test/e2e/argo_server_test.go Outdated Show resolved Hide resolved
workflow/util/util_test.go Outdated Show resolved Hide resolved
Signed-off-by: shuangkun <tsk2013uestc@163.com>
@shuangkun shuangkun force-pushed the fix/continueOnRetry branch 2 times, most recently from 47ff6c6 to 028fb1b Compare April 6, 2024 17:43
Signed-off-by: shuangkun <tsk2013uestc@163.com>
@shuangkun shuangkun force-pushed the fix/continueOnRetry branch 2 times, most recently from 810d0c5 to 31b2889 Compare April 6, 2024 17:57
@shuangkun
Copy link
Member Author

Thanks! Modified it!

@shuangkun shuangkun requested a review from tczhao April 6, 2024 17:58
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Copy link
Member

@tczhao tczhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice tests. Thanks!

@terrytangyuan terrytangyuan merged commit 2eb2415 into argoproj:main Apr 12, 2024
32 of 39 checks passed
Copy link
Contributor

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just looked this over, small comment.

@shuangkun you might want to look at #12156. I feel like there's potentially a deeper root cause here with the retry logic being buggy -- see also #12553 (comment)

also wanted to say thanks to @tczhao for the initial review -- you might've noticed earlier but I thumbs-up'd pretty much all of your comments 🙂

workflow/util/util.go Show resolved Hide resolved
@shuangkun
Copy link
Member Author

shuangkun commented Apr 12, 2024

@shuangkun you might want to look at #12156. I feel like there's potentially a deeper root cause here with the retry logic being buggy -- see also #12553 (comment)

OK,I will have a look.

@tczhao
Copy link
Member

tczhao commented Apr 12, 2024

Thank you for the warm words @agilgur5 ,
These are great thinkings. I will comment in the other PR #12931

@agilgur5 agilgur5 added this to the v3.5.x patches milestone Apr 19, 2024
@agilgur5
Copy link
Contributor

agilgur5 commented Apr 19, 2024

Backported cleanly into release-3.5 as 30f2e0d

We may want to backport this to release-3.4 as well, since #11395 was filed with 3.4.x

agilgur5 pushed a commit that referenced this pull request Apr 19, 2024
Signed-off-by: shuangkun <tsk2013uestc@163.com>
(cherry picked from commit 2eb2415)
@Joibel
Copy link
Member

Joibel commented May 2, 2024

This has caused the regression in #13003, so I suggest not backporting until that is resolved because I feel that is a bigger regression than the thing this fixes.

@agilgur5
Copy link
Contributor

agilgur5 commented May 2, 2024

I really think the root cause I mentioned above #12817 (review) needs a deep dive. There's probably a refactor needed for the manual retry logic to correct all of the issues

@JasonChen86899
Copy link

JasonChen86899 commented May 13, 2024

I really think the root cause I mentioned above #12817 (review) needs a deep dive. There's probably a refactor needed for the manual retry logic to correct all of the issues

There is an issue with this modification: the failed/error node was retained (with successful child nodes), causing this failed/error node to be unable to retry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries area/templates/dag
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DAG with continueOn in error after retry
6 participants