Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - Argo-Workflow pods linger after completed workflows on GCP #1573

Closed
iameskild opened this issue Nov 30, 2022 · 6 comments · Fixed by #1614
Closed

[BUG] - Argo-Workflow pods linger after completed workflows on GCP #1573

iameskild opened this issue Nov 30, 2022 · 6 comments · Fixed by #1614
Labels
area: integration/Argo needs: investigation 🔍 Someone in the team needs to find the root cause and replicate this bug type: bug 🐛 Something isn't working

Comments

@iameskild
Copy link
Member

iameskild commented Nov 30, 2022

Describe the bug

When running a workflow via Argo-Workflows, the associated pod seems to linger even after the workflow successfully completed.

Expected behavior

Once the workflow is completed (either failed or successful), the associated pod should stop running as well.

OS and architecture in which you are running Nebari

GCP GKE

How to Reproduce the problem?

Run the hello-argo example workflow from the /argo UI. This seems to be only happening on Nebari clusters running on GCP.

Command output

main hello argo
wait time="2022-11-30T16:00:36.967Z" level=info msg="Starting Workflow Executor" executorType=docker version=v3.2.9
wait time="2022-11-30T16:00:36.975Z" level=info msg="Creating a docker executor"
wait time="2022-11-30T16:00:36.975Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=dev podName=awesome-python template="{\"name\":\"argosay\",\"inputs\":{\"parameters\":[{\"name\":\"message\",\"value\":\"hello argo\"}]},\"outputs\":{},\"metadata\":{},\"container\":{\"name\":\"main\",\"image\":\"argoproj/argosay:v2\",\"command\":[\"/argosay\"],\"args\":[\"echo\",\"hello argo\"],\"resources\":{}}}" version="&Version{Version:v3.2.9,BuildDate:2022-03-02T21:41:01Z,GitCommit:ce91d7b1d0115d5c73f6472dca03ddf5cc2c98f4,GitTag:v3.2.9,GitTreeState:clean,GoVersion:go1.16.14,Compiler:gc,Platform:linux/amd64,}"
wait time="2022-11-30T16:00:36.975Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=dev --filter=label=io.kubernetes.pod.name=awesome-python"
wait time="2022-11-30T16:00:36.976Z" level=info msg="Starting deadline monitor"
wait time="2022-11-30T16:00:37.018Z" level=info msg="listed containers" containers="map[]"
wait time="2022-11-30T16:00:38.018Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=dev --filter=label=io.kubernetes.pod.name=awesome-python"
wait time="2022-11-30T16:00:38.052Z" level=info msg="listed containers" containers="map[]"
wait time="2022-11-30T16:00:39.052Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=dev --filter=label=io.kubernetes.pod.name=awesome-python"
wait time="2022-11-30T16:00:39.084Z" level=info msg="listed containers" containers="map[]"
wait time="2022-11-30T16:00:40.085Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=dev --filter=label=io.kubernetes.pod.name=awesome-python"
wait time="2022-11-30T16:00:40.116Z" level=info msg="listed containers" containers="map[]"
wait time="2022-11-30T16:00:41.117Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=dev --filter=label=io.kubernetes.pod.name=awesome-python"
wait time="2022-11-30T16:00:41.150Z" level=info msg="listed containers" containers="map[]"
wait time="2022-11-30T16:00:42.150Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=dev --filter=label=io.kubernetes.pod.name=awesome-python"
Stream closed EOF for dev/awesome-python (main)
wait time="2022-11-30T16:00:42.183Z" level=info msg="listed containers" containers="map[]"
wait time="2022-11-30T16:00:43.183Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=dev --filter=label=io.kubernetes.pod.name=awesome-python"

Versions and dependencies used.

No response

Compute environment

No response

Integrations

No response

Anything else?

By looking at k9s, the workflow pod it appears that only 1 of 2 init containers completed before the workflow started. This second init container might be waiting for a signal that it will never receive.

@iameskild iameskild added type: bug 🐛 Something isn't working needs: triage 🚦 Someone needs to have a look at this issue and triage labels Nov 30, 2022
@ericdatakelly
Copy link
Contributor

@iameskild Is this the hello world example you ran?

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: hello-world-
  labels:
    workflows.argoproj.io/archive-strategy: "false"
  annotations:
    workflows.argoproj.io/description: |
      This is a simple hello world example.
      You can also run it in Python: https://couler-proj.github.io/couler/examples/#hello-world
spec:
  entrypoint: whalesay
  templates:
  - name: whalesay
    container:
      image: docker/whalesay:latest
      command: [cowsay]
      args: ["hello world"]

When I run this via the Argo UI, I see no log output, even after 15 minutes, so I'm curious why you have log messages and I don't. Maybe you have elevated permissions. It seems that the workflow completed for you, but not for me.
image

@iameskild
Copy link
Member Author

Hey @ericdatakelly I have been submitting the hello-argo workflow from the Argo-Workflow UI:

Screen Shot 2022-12-01 at 22 55 17

Screen Shot 2022-12-01 at 22 55 22

@trallard trallard added needs: investigation 🔍 Someone in the team needs to find the root cause and replicate this bug area: integration/Argo and removed needs: triage 🚦 Someone needs to have a look at this issue and triage labels Dec 5, 2022
@ericdatakelly
Copy link
Contributor

@iameskild FYI: I ran the same workflow and waited 20 minutes, but I still see no logs. (for anyone following along, the workflow is the default workflow that is populated when the manifest editor is opened)

@iameskild
Copy link
Member Author

After looking into this some more, I believe the issue lies with the containerRuntimeExecutor for the Argo-Workflow controller. The default is set to docker for <=v3.2 and if we change it to the new (i.e. after v3.3) default emissary , then this works.

The docs for the workflow executors also call out that the docker executor is the least secure. Switching the workflow executor to emissary seems like the way to go.

After some more testing, I will open a PR for this fix.

cc @ericdatakelly

@ericdatakelly
Copy link
Contributor

Thanks @iameskild ! I just tried the default (hello argo) workflow and it shows that it succeeded. I think you are ok to open the PR unless you want to wait for me to make a custom workflow and test that.

@iameskild
Copy link
Member Author

Thanks @ericdatakelly! I just opened the PR for the fix 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: integration/Argo needs: investigation 🔍 Someone in the team needs to find the root cause and replicate this bug type: bug 🐛 Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants