-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: DAG with continueOn in error after retry. Fixes: #11395 #12817
Conversation
Signed-off-by: shuangkun <tsk2013uestc@163.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add something in util_test.go
?
I feel we rely too much on e2e, would this be an issue?
Remember to test if the correct node is retried, it's not included in the current test
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
I add a ut in util_test.go and test correct node retried. thanks! |
0cde322
to
5ca8b69
Compare
Hi, @tczhao can you take a look again? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Signed-off-by: shuangkun <tsk2013uestc@163.com>
5ca8b69
to
05025d2
Compare
05025d2
to
ca25e6a
Compare
47ff6c6
to
028fb1b
Compare
Signed-off-by: shuangkun <tsk2013uestc@163.com>
810d0c5
to
31b2889
Compare
Thanks! Modified it! |
31b2889
to
af31b8a
Compare
95f5c32
to
55049ca
Compare
55049ca
to
95f5c32
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice tests. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just looked this over, small comment.
@shuangkun you might want to look at #12156. I feel like there's potentially a deeper root cause here with the retry logic being buggy -- see also #12553 (comment)
also wanted to say thanks to @tczhao for the initial review -- you might've noticed earlier but I thumbs-up'd pretty much all of your comments 🙂
OK,I will have a look. |
This has caused the regression in #13003, so I suggest not backporting until that is resolved because I feel that is a bigger regression than the thing this fixes. |
I really think the root cause I mentioned above #12817 (review) needs a deep dive. There's probably a refactor needed for the manual retry logic to correct all of the issues |
There is an issue with this modification: the failed/error node was retained (with successful child nodes), causing this failed/error node to be unable to retry. |
fix: DAG with continueOn in error after retry
Fixes: #11395
Motivation
Modifications
Verification
local test and e2e test.
After retry:
Before fix: Error and lose some nodes.
After fix: Failed and not lose nodes