Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow can't be stopped/terminated when wait container was terminated while main container is still running. #11247

Closed
2 of 3 tasks
LiuYuuChenWorkspace opened this issue Jun 21, 2023 · 5 comments
Labels
area/executor problem/more information needed Not enough information has been provide to diagnose this issue. problem/stale This has not had a response in some time type/support User support issue - likely not a bug

Comments

@LiuYuuChenWorkspace
Copy link

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

Workflow can't be stopped/terminated when wait container was terminated while main container is still running.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: gpu-evaluate-multi-step-49f294a6-b6b
  ...
  ...
spec:
  ...
  imagePullSecrets:
  - name: cloud-dragon-pull-secret
  nodeSelector:
    group: group-evaluate
  onExit: exit-handler
  podGC:
    strategy: OnPodCompletion
  podMetadata:
    annotations:
      multicluster.admiralty.io/elect: ""
      multicluster.admiralty.io/no-reservation: ""
  podPriorityClassName: free-check-priviledge
  shutdown: Terminate
  ...
----------
apiVersion: v1
kind: Pod
metadata:
  name: gpu-evaluate-multi-step-49f294a6-b6b-849349960
...
spec:
  containers:
    wait:
      ...
      Command:
        argoexec
        wait
        --loglevel
        info
      State:          Terminated
        Reason:       Completed
        Message:      Step terminated
        Exit Code:    0
        Started:      Fri, 16 Jun 2023 19:29:05 +0800
        Finished:     Sun, 18 Jun 2023 10:27:16 +0800
      Ready:          False
     ...
    main:
      ...
      State:          Running
        Started:      Fri, 16 Jun 2023 19:29:06 +0800
      Ready:          True
      ...

Version

v3.4.1

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: gpu-evaluate-multi-step-49f294a6-b6b
  ...
  ...
spec:
  ...
  imagePullSecrets:
  - name: cloud-dragon-pull-secret
  nodeSelector:
    group: group-evaluate
  onExit: exit-handler
  podGC:
    strategy: OnPodCompletion
  podMetadata:
    annotations:
      multicluster.admiralty.io/elect: ""
      multicluster.admiralty.io/no-reservation: ""
  podPriorityClassName: free-check-privilege
  shutdown: Terminate
  ...

Logs from the workflow controller

kubectl logs -n argo workflow-controller-5b65ff6b84-l99z6 |grep gpu-evaluate-multi-step-49f294a6-b6b


time="2023-06-21T06:22:58.572Z" level=info msg="TaskSet Reconciliation" namespace=evaluate workflow=gpu-evaluate-multi-step-49f294a6-b6b
time="2023-06-21T06:22:58.572Z" level=info msg=reconcileAgentPod namespace=evaluate workflow=gpu-evaluate-multi-step-49f294a6-b6b
time="2023-06-21T06:39:38.552Z" level=info msg="cleaning up pod" action=shutdownPod key=evaluate/gpu-evaluate-multi-step-49f294a6-b6b-849349960/shutdownPod
time="2023-06-21T06:39:38.553Z" level=info msg="https://10.233.0.1:443/api/v1/namespaces/evaluate/pods/gpu-evaluate-multi-step-49f294a6-b6b-849349960/exec?command=%2Fbin%2Fsh&command=-c&command=kill+-15+%24%28pidof+argoexec%29&container=wait&stderr=true&stdout=true&tty=false"
time="2023-06-21T06:39:38.579Z" level=info msg="signaled container" container=wait error="unable to upgrade connection: container not found (\"wait\")" namespace=evaluate pod=gpu-evaluate-multi-step-49f294a6-b6b-849349960 stderr="<nil>" stdout="<nil>"
time="2023-06-21T06:39:38.579Z" level=warning msg="failed to clean-up pod" action=shutdownPod error="unable to upgrade connection: container not found (\"wait\")" key=evaluate/gpu-evaluate-multi-step-49f294a6-b6b-849349960/shutdownPod
time="2023-06-21T06:42:58.545Z" level=info msg="Processing workflow" namespace=evaluate workflow=gpu-evaluate-multi-step-49f294a6-b6b
time="2023-06-21T06:42:58.557Z" level=info msg="Shutting down pod gpu-evaluate-multi-step-49f294a6-b6b-849349960" namespace=evaluate workflow=gpu-evaluate-multi-step-49f294a6-b6b
time="2023-06-21T06:42:58.578Z" level=info msg="Workflow step group node gpu-evaluate-multi-step-49f294a6-b6b-3503623152 not yet completed" namespace=evaluate workflow=gpu-evaluate-multi-step-49f294a6-b6b
time="2023-06-21T06:42:58.578Z" level=info msg="Workflow step group node gpu-evaluate-multi-step-49f294a6-b6b-3005692030 not yet completed" namespace=evaluate workflow=gpu-evaluate-multi-step-49f294a6-b6b
time="2023-06-21T06:42:58.578Z" level=info msg="TaskSet Reconciliation" namespace=evaluate workflow=gpu-evaluate-multi-step-49f294a6-b6b
time="2023-06-21T06:42:58.578Z" level=info msg=reconcileAgentPod namespace=evaluate workflow=gpu-evaluate-multi-step-49f294a6-b6b
time="2023-06-21T06:59:38.557Z" level=info msg="cleaning up pod" action=shutdownPod key=evaluate/gpu-evaluate-multi-step-49f294a6-b6b-849349960/shutdownPod
time="2023-06-21T06:59:38.558Z" level=info msg="https://10.233.0.1:443/api/v1/namespaces/evaluate/pods/gpu-evaluate-multi-step-49f294a6-b6b-849349960/exec?command=%2Fbin%2Fsh&command=-c&command=kill+-15+%24%28pidof+argoexec%29&container=wait&stderr=true&stdout=true&tty=false"
time="2023-06-21T06:59:38.584Z" level=info msg="signaled container" container=wait error="unable to upgrade connection: container not found (\"wait\")" namespace=evaluate pod=gpu-evaluate-multi-step-49f294a6-b6b-849349960 stderr="<nil>" stdout="<nil>"
time="2023-06-21T06:59:38.584Z" level=warning msg="failed to clean-up pod" action=shutdownPod error="unable to upgrade connection: container not found (\"wait\")" key=evaluate/gpu-evaluate-multi-step-49f294a6-b6b-849349960/shutdownPod
time="2023-06-21T07:02:58.549Z" level=info msg="Processing workflow" namespace=evaluate workflow=gpu-evaluate-multi-step-49f294a6-b6b
time="2023-06-21T07:02:58.559Z" level=info msg="Shutting down pod gpu-evaluate-multi-step-49f294a6-b6b-849349960" namespace=evaluate workflow=gpu-evaluate-multi-step-49f294a6-b6b
time="2023-06-21T07:02:58.581Z" level=info msg="Workflow step group node gpu-evaluate-multi-step-49f294a6-b6b-3503623152 not yet completed" namespace=evaluate workflow=gpu-evaluate-multi-step-49f294a6-b6b
time="2023-06-21T07:02:58.582Z" level=info msg="Workflow step group node gpu-evaluate-multi-step-49f294a6-b6b-3005692030 not yet completed" namespace=evaluate workflow=gpu-evaluate-multi-step-49f294a6-b6b
time="2023-06-21T07:02:58.582Z" level=info msg="TaskSet Reconciliation" namespace=evaluate workflow=gpu-evaluate-multi-step-49f294a6-b6b
time="2023-06-21T07:02:58.582Z" level=info msg=reconcileAgentPod namespace=evaluate workflow=gpu-evaluate-multi-step-49f294a6-b6b
time="2023-06-21T07:19:38.560Z" level=info msg="cleaning up pod" action=shutdownPod key=evaluate/gpu-evaluate-multi-step-49f294a6-b6b-849349960/shutdownPod
time="2023-06-21T07:19:38.560Z" level=info msg="https://10.233.0.1:443/api/v1/namespaces/evaluate/pods/gpu-evaluate-multi-step-49f294a6-b6b-849349960/exec?command=%2Fbin%2Fsh&command=-c&command=kill+-15+%24%28pidof+argoexec%29&container=wait&stderr=true&stdout=true&tty=false"
time="2023-06-21T07:19:38.586Z" level=info msg="signaled container" container=wait error="unable to upgrade connection: container not found (\"wait\")" namespace=evaluate pod=gpu-evaluate-multi-step-49f294a6-b6b-849349960 stderr="<nil>" stdout="<nil>"
time="2023-06-21T07:19:38.586Z" level=warning msg="failed to clean-up pod" action=shutdownPod error="unable to upgrade connection: container not found (\"wait\")" key=evaluate/gpu-evaluate-multi-step-49f294a6-b6b-849349960/shutdownPod
time="2023-06-21T07:22:58.556Z" level=info msg="Processing workflow" namespace=evaluate workflow=gpu-evaluate-multi-step-49f294a6-b6b
time="2023-06-21T07:22:58.570Z" level=info msg="Shutting down pod gpu-evaluate-multi-step-49f294a6-b6b-849349960" namespace=evaluate workflow=gpu-evaluate-multi-step-49f294a6-b6b
time="2023-06-21T07:22:58.593Z" level=info msg="Workflow step group node gpu-evaluate-multi-step-49f294a6-b6b-3503623152 not yet completed" namespace=evaluate workflow=gpu-evaluate-multi-step-49f294a6-b6b
time="2023-06-21T07:22:58.593Z" level=info msg="Workflow step group node gpu-evaluate-multi-step-49f294a6-b6b-3005692030 not yet completed" namespace=evaluate workflow=gpu-evaluate-multi-step-49f294a6-b6b
time="2023-06-21T07:22:58.593Z" level=info msg="TaskSet Reconciliation" namespace=evaluate workflow=gpu-evaluate-multi-step-49f294a6-b6b
time="2023-06-21T07:22:58.593Z" level=info msg=reconcileAgentPod namespace=evaluate workflow=gpu-evaluate-multi-step-49f294a6-b6b

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

time="2023-06-18T02:23:05.394Z" level=info msg="Patch pods 200"
time="2023-06-18T02:24:05.374Z" level=info msg="patching pod progress annotation" progress=
time="2023-06-18T02:24:05.374Z" level=info msg="Alloc=5995 TotalAlloc=3349328 Sys=32466 NumGC=1171 Goroutines=8"
time="2023-06-18T02:24:05.393Z" level=info msg="Patch pods 200"
time="2023-06-18T02:25:05.375Z" level=info msg="patching pod progress annotation" progress=
time="2023-06-18T02:25:05.392Z" level=info msg="Patch pods 200"
time="2023-06-18T02:26:05.375Z" level=info msg="patching pod progress annotation" progress=
time="2023-06-18T02:26:05.392Z" level=info msg="Patch pods 200"
time="2023-06-18T02:27:05.375Z" level=info msg="patching pod progress annotation" progress=
time="2023-06-18T02:27:05.392Z" level=info msg="Patch pods 200"
time="2023-06-18T02:27:15.456Z" level=info msg="Step terminated"
time="2023-06-18T02:27:15.456Z" level=info msg="Killing containers"
time="2023-06-18T02:27:15.460Z" level=info msg="Get pods 200"
time="2023-06-18T02:27:16.287Z" level=info msg="Main container completed"
time="2023-06-18T02:27:16.287Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2023-06-18T02:27:16.287Z" level=info msg="Saving logs"
time="2023-06-18T02:27:16.289Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: data/artifact-training-39-prod/evaluate/2023/06/16/c2372b66-8efa-4049-8640-d9af7bccaa7f/gpu-evaluate-multi-step-49f294a6-b6b-849349960/main.log"
time="2023-06-18T02:27:16.289Z" level=info msg="Creating minio client using static credentials" endpoint="10.233.58.195:8080"
time="2023-06-18T02:27:16.290Z" level=info msg="Saving file to s3" bucket=data-artifact endpoint="10.233.58.195:8080" key=data/artifact-training-39-prod/evaluate/2023/06/16/c2372b66-8efa-4049-8640-d9af7bccaa7f/gpu-evaluate-multi-step-49f294a6-b6b-849349960/main.log path=/tmp/argo/outputs/logs/main.log
time="2023-06-18T02:27:16.487Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2023-06-18T02:27:16.487Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2023-06-18T02:27:16.487Z" level=info msg="No output parameters"
time="2023-06-18T02:27:16.487Z" level=info msg="Saving output artifacts"
time="2023-06-18T02:27:16.487Z" level=info msg="Staging artifact: evaluate_log"
time="2023-06-18T02:27:16.487Z" level=info msg="Staging /data/log/run.log from mirrored volume mount /mainctrfs/data/log/run.log"
time="2023-06-18T02:27:16.488Z" level=info msg="Taring /mainctrfs/data/log/run.log"
time="2023-06-18T02:27:16.543Z" level=info msg="Successfully staged /data/log/run.log from mirrored volume mount /mainctrfs/data/log/run.log"
time="2023-06-18T02:27:16.543Z" level=info msg="S3 Save path: /tmp/argo/outputs/artifacts/evaluate_log.tgz, key: data/artifact-training-39-prod/evaluate/2023/06/16/c2372b66-8efa-4049-8640-d9af7bccaa7f/gpu-evaluate-multi-step-49f294a6-b6b-849349960/evaluate_log.tgz"
time="2023-06-18T02:27:16.543Z" level=info msg="Creating minio client using static credentials" endpoint="10.233.58.195:8080"
time="2023-06-18T02:27:16.544Z" level=info msg="Saving file to s3" bucket=data-artifact endpoint="10.233.58.195:8080" key=data/artifact-training-39-prod/evaluate/2023/06/16/c2372b66-8efa-4049-8640-d9af7bccaa7f/gpu-evaluate-multi-step-49f294a6-b6b-849349960/evaluate_log.tgz path=/tmp/argo/outputs/artifacts/evaluate_log.tgz
time="2023-06-18T02:27:16.680Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/artifacts/evaluate_log.tgz
time="2023-06-18T02:27:16.680Z" level=info msg="Successfully saved file: /tmp/argo/outputs/artifacts/evaluate_log.tgz"
time="2023-06-18T02:27:16.680Z" level=info msg="Annotating pod with output"
time="2023-06-18T02:27:16.698Z" level=info msg="Patch pods 200"
time="2023-06-18T02:27:16.701Z" level=info msg="Killing sidecars []"
time="2023-06-18T02:27:16.701Z" level=info msg="Alloc=8687 TotalAlloc=3359557 Sys=32722 NumGC=1173 Goroutines=11"
@terrytangyuan
Copy link
Member

Can you try the latest release?

@juliev0
Copy link
Contributor

juliev0 commented Jun 22, 2023

right, probably fixed by this

@juliev0 juliev0 added problem/more information needed Not enough information has been provide to diagnose this issue. and removed type/bug labels Jun 22, 2023
@stale
Copy link

stale bot commented Sep 17, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the problem/stale This has not had a response in some time label Sep 17, 2023
@terrytangyuan terrytangyuan removed the problem/stale This has not had a response in some time label Sep 20, 2023
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

@github-actions github-actions bot added the problem/stale This has not had a response in some time label Jan 12, 2024
Copy link
Contributor

This issue has been closed due to inactivity and lack of information. If you still encounter this issue, please add the requested information and re-open.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 27, 2024
@agilgur5 agilgur5 added area/executor type/support User support issue - likely not a bug labels Jan 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/executor problem/more information needed Not enough information has been provide to diagnose this issue. problem/stale This has not had a response in some time type/support User support issue - likely not a bug
Projects
None yet
Development

No branches or pull requests

4 participants