Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Correct SIGTERM handling. Fixes #10518 #10337 #10033 #10490 #10523

Merged
merged 4 commits into from
Feb 23, 2023

Conversation

alexec
Copy link
Contributor

@alexec alexec commented Feb 13, 2023

Signed-off-by: Alex Collins alex_collins@intuit.com

This MIGHT fix these:

Fixes #10518
Fixes #10337
Fixes #10033
Fixes #10490

Please do not open a pull request until you have checked ALL of these:

  • Create the PR as draft .
  • Run make pre-commit -B to fix codegen and lint problems.
  • Sign-off your commits (otherwise the DCO check will fail).
  • Use a conventional commit message (otherwise the commit message check will fail).
  • "Fixes #" is in both the PR title (for release notes) and this description (to automatically link and close the issue).
  • Add unit or e2e tests. Say how you tested your changes. If you changed the UI, attach screenshots.
  • Github checks are green.
  • Once required tests have passed, mark your PR "Ready for review".

If changes were requested, and you've made them, dismiss the review to get it reviewed again.

alexec and others added 4 commits February 12, 2023 13:34
Signed-off-by: Alex Collins <alex_collins@intuit.com>
Signed-off-by: Alex Collins <alex_collins@intuit.com>
Signed-off-by: Alex Collins <alex_collins@intuit.com>
@sxllwx
Copy link
Contributor

sxllwx commented Feb 13, 2023

FIY (😄): #10520 (comment)

@alexec
Copy link
Contributor Author

alexec commented Feb 13, 2023

Could you test this out? Start the workflow controller with --executor-image quay.io/argoproj/argoexec:dev-sigterm (I've not tested that, so LMK if you have problems/find typos).

@sxllwx
Copy link
Contributor

sxllwx commented Feb 13, 2023

Could you test this out? Start the workflow controller with --executor-image quay.io/argoproj/argoexec:dev-sigterm (I've not tested that, so LMK if you have problems/find typos).

Ok~

@@ -16,7 +14,7 @@ func NewWaitCommand() *cobra.Command {
Use: "wait",
Short: "wait for main container to finish and save artifacts",
Run: func(cmd *cobra.Command, args []string) {
ctx := context.Background()
ctx := cmd.Context()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest putting the code related to signal handling here. The current implementation affects all subcommands of argoexec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s intentional, as they’re currently not able to deal with SIGTERM. That said, each one needs to be updated to have this line in them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok~

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data was followed up in #12544 (comment)

@alexec alexec marked this pull request as ready for review February 13, 2023 15:09
@alexec
Copy link
Contributor Author

alexec commented Feb 16, 2023

@juliev0 @sarabala1979 v3.4 is broken for large artifacts. This PR should fix it.

@juliev0
Copy link
Contributor

juliev0 commented Feb 16, 2023

@juliev0 @sarabala1979 v3.4 is broken for large artifacts. This PR should fix it.

Just large artifacts? If this is related to the wait container failing with an exit code of 2, I was seeing that with my artifact GC e2e test (small artifacts). I can test your PR with that. Thanks for fixing it.

@alexec
Copy link
Contributor Author

alexec commented Feb 16, 2023

Not exactly. Artifacts that take a longer than the reconciliation loop to upload. They tend to be large.

@juliev0
Copy link
Contributor

juliev0 commented Feb 16, 2023

Thanks for fixing this in your free time. It would be great generally to get a description of the change in the PR to help reviewers (like myself). :)

I gather from @sxllwx 's comment here that the issue was that the wait container received a SIGTERM and called cancel() and then received another SIGTERM and failed with exit code 2?

So, your change is to allow the wait container to play everything out rather than looking for SIGTERM, and instead have the main container respond to the SIGTERM, passing the context into the container so it can prematurely terminate, right?

That does seem better.

@juliev0
Copy link
Contributor

juliev0 commented Feb 16, 2023

Not exactly. Artifacts that take a longer than the reconciliation loop to upload. They tend to be large.

Well, I was definitely seeing premature exiting with exit code 2 intermittently with the Artifact GC test.

This comment seemed to indicate it tends to affect short running Workflows, which definitely applies in my case.

@juliev0
Copy link
Contributor

juliev0 commented Feb 16, 2023

Okay, I've now run this on the ArtifactGC e2e test 10 times and the Workflows never failed with exit code 2. :)

@alexec
Copy link
Contributor Author

alexec commented Feb 16, 2023

Does this need @sarabala1979 approval?

@JPZ13 JPZ13 mentioned this pull request Feb 23, 2023
3 tasks
Copy link
Member

@sarabala1979 sarabala1979 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alexec alexec merged commit d75e37e into master Feb 23, 2023
@alexec alexec deleted the dev-sigterm branch February 23, 2023 23:31
terrytangyuan pushed a commit that referenced this pull request Mar 29, 2023
)

Signed-off-by: Alex Collins <alex_collins@intuit.com>
@yanxingponyai
Copy link

yanxingponyai commented Apr 25, 2023

Hi @alexec, Thanks for your great work.
May I ask a question that how this pr fixes #10033? I have read all the related comments, but I did not understand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants