Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

volumeClaimGC OnWorkflowCompletion does not work when onExit/entrypoint template errors #10408

Closed
2 of 3 tasks
jessesuen opened this issue Jan 28, 2023 · 3 comments · Fixed by #10424
Closed
2 of 3 tasks
Assignees
Labels
P3 Low priority type/bug

Comments

@jessesuen
Copy link
Member

jessesuen commented Jan 28, 2023

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

Using OnWorkflowCompletion pvc cleanup strategy:

  volumeClaimGC:
    strategy: OnWorkflowCompletion

The PVC will not be deleted if there is a template error for the onExit handler (and possibly other failure scenarios with onExit).

$ k get pvc
NAME                STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
wf-gz95t-builddir   Pending                                      local-path     10s

$ argo list
NAME       STATUS   AGE   DURATION   PRIORITY   MESSAGE
wf-gz95t   Error    19s   10s        0          error in exit template execution : template 'notify' type is unknown

The key parts about this issue seems to be a combination of:

  • using WorkflowTemplates
  • an invalid onExit template

When I convert the WorkflowTemplate to a Workflow, the PVC clean up happens expectedly.

Version

latest as of 1/27

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: wft-pvc
spec:
  volumeClaimTemplates:
    - metadata:
        name: builddir
      spec:
        accessModes: [ "ReadWriteMany" ]
        resources:
          requests:
            storage: 1Mi

  volumeClaimGC:
    strategy: OnWorkflowCompletion

  entrypoint: whalesay
  onExit: notify

  templates:
  - name: whalesay
    container:
      image: docker/whalesay:latest
      command: [cowsay]
      args: ["hello world"]

  - name: notify
    # note this is a bad template

---
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: wf-
spec:
  workflowTemplateRef:
    name: wft-pvc

Note that this seems to be a combination of WorkflowTemplates and onExit

Logs from the workflow controller

$ kubectl logs -n argo deploy/workflow-controller | grep wf-gz95t
time="2023-01-28T01:49:06.711Z" level=info msg="Processing workflow" namespace=argo workflow=wf-gz95t
time="2023-01-28T01:49:06.737Z" level=info msg="Updated phase  -> Running" namespace=argo workflow=wf-gz95t
time="2023-01-28T01:49:06.737Z" level=info msg="Creating pvc wf-gz95t-builddir" namespace=argo workflow=wf-gz95t
time="2023-01-28T01:49:06.745Z" level=info msg="Pod node wf-gz95t initialized Pending" namespace=argo workflow=wf-gz95t
time="2023-01-28T01:49:06.758Z" level=info msg="Created pod: wf-gz95t (wf-gz95t)" namespace=argo workflow=wf-gz95t
time="2023-01-28T01:49:06.758Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=wf-gz95t
time="2023-01-28T01:49:06.758Z" level=info msg=reconcileAgentPod namespace=argo workflow=wf-gz95t
time="2023-01-28T01:49:06.764Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=406075 workflow=wf-gz95t
time="2023-01-28T01:49:16.712Z" level=info msg="Processing workflow" namespace=argo workflow=wf-gz95t
time="2023-01-28T01:49:16.713Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=wf-gz95t
time="2023-01-28T01:49:16.713Z" level=info msg="task-result changed" namespace=argo nodeID=wf-gz95t workflow=wf-gz95t
time="2023-01-28T01:49:16.713Z" level=info msg="node changed" namespace=argo new.message= new.phase=Succeeded new.progress=0/1 nodeID=wf-gz95t old.message= old.phase=Pending old.progress=0/1 workflow=wf-gz95t
time="2023-01-28T01:49:16.713Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=wf-gz95t
time="2023-01-28T01:49:16.713Z" level=info msg=reconcileAgentPod namespace=argo workflow=wf-gz95t
time="2023-01-28T01:49:16.713Z" level=info msg="Running OnExit handler: notify" namespace=argo workflow=wf-gz95t
time="2023-01-28T01:49:16.713Z" level=info msg="Skipped node wf-gz95t-309339311 initialized Error (message: template 'notify' type is unknown)" namespace=argo workflow=wf-gz95t
time="2023-01-28T01:49:16.713Z" level=info msg="Updated phase Running -> Error" namespace=argo workflow=wf-gz95t
time="2023-01-28T01:49:16.713Z" level=info msg="Updated message  -> error in exit template execution : template 'notify' type is unknown" namespace=argo workflow=wf-gz95t
time="2023-01-28T01:49:16.713Z" level=info msg="Marking workflow completed" namespace=argo workflow=wf-gz95t
time="2023-01-28T01:49:16.713Z" level=info msg="Marking workflow as pending archiving" namespace=argo workflow=wf-gz95t
time="2023-01-28T01:49:16.713Z" level=info msg="Checking daemoned children of " namespace=argo workflow=wf-gz95t
time="2023-01-28T01:49:16.722Z" level=info msg="cleaning up pod" action=deletePod key=argo/wf-gz95t-1340600742-agent/deletePod
time="2023-01-28T01:49:16.724Z" level=info msg="Workflow update successful" namespace=argo phase=Error resourceVersion=406101 workflow=wf-gz95t
time="2023-01-28T01:49:16.729Z" level=info msg="archiving workflow" namespace=argo uid=74611389-a1da-4b8a-9c96-947241c5afe8 workflow=wf-gz95t
time="2023-01-28T01:49:16.734Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/wf-gz95t/labelPodCompleted

Logs from in your workflow's wait container

N/A
@jessesuen jessesuen changed the title volumeClaimGC OnWorkflowCompletion does not work with failed onExit template volumeClaimGC OnWorkflowCompletion does not work with WorkflowTemplate with invalid onExit template Jan 28, 2023
@jiachengxu
Copy link
Member

I managed to reproduce the issue and working on a fix

@sarabala1979 sarabala1979 added the P3 Low priority label Jan 30, 2023
@jessesuen jessesuen changed the title volumeClaimGC OnWorkflowCompletion does not work with WorkflowTemplate with invalid onExit template volumeClaimGC OnWorkflowCompletion does not work when onExit template errors Feb 1, 2023
@jessesuen
Copy link
Member Author

jessesuen commented Feb 1, 2023

Turns out it has nothing to do with workflow templates. Here is a workflow not using WorkflowTemplates which can reproduce the problem:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: wft-pvc-
spec:
  volumeClaimTemplates:
    - metadata:
        name: builddir
      spec:
        accessModes: [ "ReadWriteMany" ]
        resources:
          requests:
            storage: 1Mi

  volumeClaimGC:
    strategy: OnWorkflowCompletion

  entrypoint: whalesay
  onExit: notify

  templates:
  - name: whalesay
    container:
      image: docker/whalesay:latest
      command: [cowsay]
      args: ["hello world"]

  - name: notify
    container:
      image: docker/whalesay:latest
      command: [sh, -c]
      args: ["exit 0"]
      volumeMounts:
      - name: foo # <<< invalid since no volume `foo`

@jiachengxu also pointed out that it could be affecting entrypoint as well. Need to test it.

@jiachengxu
Copy link
Member

jiachengxu commented Feb 1, 2023

I just tested the entrypoint, and this affects it indeed.
The following workflow can be used to reproduce the issue:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: wft-pvc-
spec:
  volumeClaimTemplates:
    - metadata:
        name: builddir
      spec:
        accessModes: [ "ReadWriteMany" ]
        resources:
          requests:
            storage: 1Mi

  volumeClaimGC:
    strategy: OnWorkflowCompletion

  entrypoint: whalesay # <<< referencing the invalid template
  onExit: notify

  templates:
  - name: whalesay
    container:
      image: docker/whalesay:latest
      command: [cowsay]
      args: ["hello world"]
      volumeMounts:
      - name: foo # <<< invalid since no volume `foo`

  - name: notify
    container:
      image: docker/whalesay:latest
      command: [sh, -c]
      args: ["exit 0"]

Then the workflow will be Errored and PVC will be left over.
validation of workflow and workflowTemplate is already in place: https://github.com/argoproj/argo-workflows/blob/master/workflow/validate/validate.go#L213-L227
deletePVCs call is needed for entrypoint as well.

@jessesuen jessesuen changed the title volumeClaimGC OnWorkflowCompletion does not work when onExit template errors volumeClaimGC OnWorkflowCompletion does not work when onExit/entrypoint template errors Feb 1, 2023
jessesuen pushed a commit that referenced this issue Feb 8, 2023
…ed. Fixes #10408 (#10424)

Signed-off-by: Jiacheng Xu <xjcmaxwellcjx@gmail.com>
isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this issue Feb 9, 2023
…ed. Fixes argoproj#10408 (argoproj#10424)

Signed-off-by: Jiacheng Xu <xjcmaxwellcjx@gmail.com>
GoshaDo pushed a commit to GoshaDo/argo-workflows that referenced this issue Feb 9, 2023
…ed. Fixes argoproj#10408 (argoproj#10424)

Signed-off-by: Jiacheng Xu <xjcmaxwellcjx@gmail.com>
Signed-off-by: goshado <goshatoo@gmail.com>
terrytangyuan pushed a commit that referenced this issue Mar 29, 2023
…ed. Fixes #10408 (#10424)

Signed-off-by: Jiacheng Xu <xjcmaxwellcjx@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 Low priority type/bug
Projects
None yet
3 participants