[backend] "Signal command failed: command terminated with exit code 1" when terminating a pipeline #7361

emenendez · 2022-02-28T22:34:26Z

Environment

How did you deploy Kubeflow Pipelines (KFP)?: As part of a full Kubeflow 1.4 deployment on GKE.
KFP version: 1.7.0, packaged with Kubeflow 1.4
KFP SDK version: N/A

Steps to reproduce

Create a new pipeline run.
Click the "Terminate" link in the KFP web UI.
Observe that the currently-running pod runs to completion before termination.

Expected result

In step 3 above, the currently-running pod should immediately terminate.

Materials and Reference

I have been able to determine the following so far:

When the "Terminate" button is clicked, KFP adds activeDeadlineSeconds: 0 to the spec of the workflow being terminated. This is happening as expected.
The Argo Worfklows controller notices this and attempts to kill the currently-running pod by executing sh -c kill -s USR2 $(pidof argoexec) in the main container of the running pod. This causes the Argo Workflows controller to log the following error:

time="2022-02-25T21:54:54.186Z" level=info msg="Applying sooner Workflow Deadline for pod emenendez-taxi-x5q66-345673846 at: 2022-02-25 21:53:08 +0000 UTC" namespace=emenendez workflow=emenendez-taxi-x5q66
time="2022-02-25T21:54:54.186Z" level=info msg="Updating execution control of emenendez-taxi-x5q66-345673846: {\"deadline\":\"2022-02-25T21:53:08Z\"}" namespace=emenendez workflow=emenendez-taxi-x5q66
time="2022-02-25T21:54:54.244Z" level=info msg="Patch pods 200"
time="2022-02-25T21:54:54.246Z" level=info msg="Signalling emenendez-taxi-x5q66-345673846 of updates" namespace=emenendez workflow=emenendez-taxi-x5q66
time="2022-02-25T21:54:54.247Z" level=info msg="https://7.255.204.1:443/api/v1/namespaces/emenendez/pods/emenendez-taxi-x5q66-345673846/exec?command=sh&command=-c&command=kill+-s+USR2+%24%28pidof+argoexec%29&container=main&stderr=true&stdout=true&tty=false"
time="2022-02-25T21:54:54.292Z" level=info msg="Create pods/exec 101"
time="2022-02-25T21:54:54.390Z" level=warning msg="Signal command failed: command terminated with exit code 1" namespace=emenendez workflow=emenendez-taxi-x5q66

This appears to be a bug with the Argo Workflows controller -- instead of executing sh -c kill -s USR2 $(pidof argoexec) in the main container, it should execute that command in the wait container.

It appears this bug was introduced in argoproj/argo-workflows#5099, and fixed as part of an unrelated refactor in argoproj/argo-workflows#6022, which is included in Argo Workflows 3.2.0.

The specific buggy code is this block, which iterates over the main containers in a pod and sends the USR2 signal there.

Interestingly enough, I could not find an issue related to this bug in the Argo Workflows project.

Questions:

Are there any other workarounds or fixes for the inability to terminate a running pipeline pod other than upgrading Argo Workflows?
If not, could Argo Workflows be updated to a version with the fix?

Thanks so much!

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

zijianjoy · 2022-03-03T23:38:24Z

/assign @chensun

mbaijal · 2022-06-14T20:52:10Z

@chensun @emenendez

Has this issue been fixed already ? I am seeing a similar issue on Kubeflow 1.5 though the error logs do not show a Signal command failed. Instead, the wait container seems to send the right signal to the main container but the main container does not catch it. From the argo documentation this is possibly because the process is not running as PID1 - is there a workaround/fix for this ?

github-actions · 2024-05-22T07:41:56Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions · 2024-06-13T07:41:44Z

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

emenendez added area/backend kind/bug labels Feb 28, 2022

google-oss-prow bot assigned chensun Mar 3, 2022

github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label May 22, 2024

github-actions bot closed this as completed Jun 13, 2024

github-project-automation bot added this to KFP Runtime Triage Aug 29, 2024

github-project-automation bot moved this to Closed in KFP Runtime Triage Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[backend] "Signal command failed: command terminated with exit code 1" when terminating a pipeline #7361

[backend] "Signal command failed: command terminated with exit code 1" when terminating a pipeline #7361

emenendez commented Feb 28, 2022

zijianjoy commented Mar 3, 2022

mbaijal commented Jun 14, 2022

github-actions bot commented May 22, 2024

github-actions bot commented Jun 13, 2024

[backend] "Signal command failed: command terminated with exit code 1" when terminating a pipeline #7361

[backend] "Signal command failed: command terminated with exit code 1" when terminating a pipeline #7361

Comments

emenendez commented Feb 28, 2022

Environment

Steps to reproduce

Expected result

Materials and Reference

zijianjoy commented Mar 3, 2022

mbaijal commented Jun 14, 2022

github-actions bot commented May 22, 2024

github-actions bot commented Jun 13, 2024