-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Correct SIGTERM handling. Fixes #10518 #10337 #10033 #10490 #10523
Conversation
FIY (😄): #10520 (comment) |
Could you test this out? Start the workflow controller with |
Ok~ |
@@ -16,7 +14,7 @@ func NewWaitCommand() *cobra.Command { | |||
Use: "wait", | |||
Short: "wait for main container to finish and save artifacts", | |||
Run: func(cmd *cobra.Command, args []string) { | |||
ctx := context.Background() | |||
ctx := cmd.Context() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest putting the code related to signal handling here. The current implementation affects all subcommands of argoexec
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s intentional, as they’re currently not able to deal with SIGTERM. That said, each one needs to be updated to have this line in them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data
was followed up in #12544 (comment)
@juliev0 @sarabala1979 v3.4 is broken for large artifacts. This PR should fix it. |
Just large artifacts? If this is related to the wait container failing with an exit code of 2, I was seeing that with my artifact GC e2e test (small artifacts). I can test your PR with that. Thanks for fixing it. |
Not exactly. Artifacts that take a longer than the reconciliation loop to upload. They tend to be large. |
Thanks for fixing this in your free time. It would be great generally to get a description of the change in the PR to help reviewers (like myself). :) I gather from @sxllwx 's comment here that the issue was that the wait container received a SIGTERM and called cancel() and then received another SIGTERM and failed with exit code 2? So, your change is to allow the wait container to play everything out rather than looking for SIGTERM, and instead have the main container respond to the SIGTERM, passing the context into the container so it can prematurely terminate, right? That does seem better. |
Well, I was definitely seeing premature exiting with exit code 2 intermittently with the Artifact GC test. This comment seemed to indicate it tends to affect short running Workflows, which definitely applies in my case. |
Okay, I've now run this on the ArtifactGC e2e test 10 times and the Workflows never failed with exit code 2. :) |
Does this need @sarabala1979 approval? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Alex Collins alex_collins@intuit.com
This MIGHT fix these:
Fixes #10518
Fixes #10337
Fixes #10033
Fixes #10490
Please do not open a pull request until you have checked ALL of these:
make pre-commit -B
to fix codegen and lint problems.If changes were requested, and you've made them, dismiss the review to get it reviewed again.