Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backend] "Signal command failed: command terminated with exit code 1" when terminating a pipeline #7361

Closed
emenendez opened this issue Feb 28, 2022 · 4 comments
Assignees
Labels
area/backend kind/bug lifecycle/stale The issue / pull request is stale, any activities remove this label.

Comments

@emenendez
Copy link

Environment

  • How did you deploy Kubeflow Pipelines (KFP)?: As part of a full Kubeflow 1.4 deployment on GKE.
  • KFP version: 1.7.0, packaged with Kubeflow 1.4
  • KFP SDK version: N/A

Steps to reproduce

  1. Create a new pipeline run.
  2. Click the "Terminate" link in the KFP web UI.
  3. Observe that the currently-running pod runs to completion before termination.

Expected result

In step 3 above, the currently-running pod should immediately terminate.

Materials and Reference

I have been able to determine the following so far:

  1. When the "Terminate" button is clicked, KFP adds activeDeadlineSeconds: 0 to the spec of the workflow being terminated. This is happening as expected.
  2. The Argo Worfklows controller notices this and attempts to kill the currently-running pod by executing sh -c kill -s USR2 $(pidof argoexec) in the main container of the running pod. This causes the Argo Workflows controller to log the following error:
time="2022-02-25T21:54:54.186Z" level=info msg="Applying sooner Workflow Deadline for pod emenendez-taxi-x5q66-345673846 at: 2022-02-25 21:53:08 +0000 UTC" namespace=emenendez workflow=emenendez-taxi-x5q66
time="2022-02-25T21:54:54.186Z" level=info msg="Updating execution control of emenendez-taxi-x5q66-345673846: {\"deadline\":\"2022-02-25T21:53:08Z\"}" namespace=emenendez workflow=emenendez-taxi-x5q66
time="2022-02-25T21:54:54.244Z" level=info msg="Patch pods 200"
time="2022-02-25T21:54:54.246Z" level=info msg="Signalling emenendez-taxi-x5q66-345673846 of updates" namespace=emenendez workflow=emenendez-taxi-x5q66
time="2022-02-25T21:54:54.247Z" level=info msg="https://7.255.204.1:443/api/v1/namespaces/emenendez/pods/emenendez-taxi-x5q66-345673846/exec?command=sh&command=-c&command=kill+-s+USR2+%24%28pidof+argoexec%29&container=main&stderr=true&stdout=true&tty=false"
time="2022-02-25T21:54:54.292Z" level=info msg="Create pods/exec 101"
time="2022-02-25T21:54:54.390Z" level=warning msg="Signal command failed: command terminated with exit code 1" namespace=emenendez workflow=emenendez-taxi-x5q66

This appears to be a bug with the Argo Workflows controller -- instead of executing sh -c kill -s USR2 $(pidof argoexec) in the main container, it should execute that command in the wait container.

It appears this bug was introduced in argoproj/argo-workflows#5099, and fixed as part of an unrelated refactor in argoproj/argo-workflows#6022, which is included in Argo Workflows 3.2.0.

The specific buggy code is this block, which iterates over the main containers in a pod and sends the USR2 signal there.

Interestingly enough, I could not find an issue related to this bug in the Argo Workflows project.

Questions:

  1. Are there any other workarounds or fixes for the inability to terminate a running pipeline pod other than upgrading Argo Workflows?
  2. If not, could Argo Workflows be updated to a version with the fix?

Thanks so much!


Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@zijianjoy
Copy link
Collaborator

/assign @chensun

@mbaijal
Copy link
Contributor

mbaijal commented Jun 14, 2022

@chensun @emenendez

Has this issue been fixed already ? I am seeing a similar issue on Kubeflow 1.5 though the error logs do not show a Signal command failed. Instead, the wait container seems to send the right signal to the main container but the main container does not catch it. From the argo documentation this is possibly because the process is not running as PID1 - is there a workaround/fix for this ?

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label May 22, 2024
Copy link

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backend kind/bug lifecycle/stale The issue / pull request is stale, any activities remove this label.
Projects
Status: Closed
Development

No branches or pull requests

4 participants