Retry failed workflow with `ttl` deleted after initial `secondsAfterFailure` while still running #12636

manuelbmar · 2024-02-07T10:53:02Z

Pre-requisites

I have double-checked my configuration
I can confirm the issue exists when I tested with :latest
I have searched existing issues and could not find a match for this bug
I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

I am retrying a failed worflow that has steps set up with manual approval and ttl secondsAfterFailure .
Step-0 failed in the first execution , i tried a call to /retry method , the failed step has now gone fine and I manage to advance our workflow correctly.

The workflow is now waiting for the next step with manual approval as you can see in the screenshot .
But once the TTL secondsAfterFailure is completed, "workflow gone" message is displayed and the workflow status shown in the UI is the one before the /retry action.
"workflow gone" image :

"archived workflows" image :

I think ttl secondsAfterFailure is not takig into account that the workflow is running, ~~as it has a "suspend" step waiting for approval.~~ edited by agilgur5: suspend is unrelated / red herring in this case, see below

Version

v3.4.11

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: "ttl-workflow-deploy-template"
spec:
 activeDeadlineSeconds: 1814400
 archiveLogs: true
 ttlStrategy:
   secondsAfterCompletion: 300 # Time to live after workflow is completed, replaces ttlSecondsAfterFinished
   secondsAfterSuccess: 300   # Time to live after workflow is successful
   secondsAfterFailure: 300    # Time to live after workflow fail
 entrypoint: deploy-workflow
 onExit: exit-handler
 templates:
    - name: exit-handler
      steps:
        - - name: send-report
            templateRef:
              name: workflow-template-whalesay-template
              template: whalesay-template
            arguments:
              parameters:
                 - name: message
                   value: "{{workflow.status}}"

    - name: deploy-workflow
      dag:
        tasks:
        - name: step-provision
          templateRef:
              name: workflow-template-whalesay-template
              template: whalesay-template
          arguments:
              parameters:
                - name: message
                  value: "step-provision"

        - name: step-0
          depends: "step-provision"
          templateRef:
            name: workflow-template-random-fail-template
            template: random-fail-template

        - name: step-0-approval
          depends: "step-0.Succeeded"
          template: step-approval

        - name: step-1
          depends: "step-0-approval"
          when: '{{tasks.step-0-approval.outputs.parameters.action}} == DEPLOY'
          templateRef:
            name: workflow-template-random-fail-template
            template: random-fail-template


        - name: step-1-approval
          depends: "step-1.Succeeded"
          template: step-approval

        - name: step-2
          depends: "step-1-approval"
          when: '{{tasks.step-1-approval.outputs.parameters.action}} == DEPLOY'
          templateRef:
            name: workflow-template-random-fail-template
            template: random-fail-template

        - name: step-2-approval
          depends: "step-2.Succeeded"
          template: step-approval

        - name: step-3
          depends: "step-2-approval"
          when: '{{tasks.step-2-approval.outputs.parameters.action}} == DEPLOY'
          templateRef:
            name: workflow-template-random-fail-template
            template: random-fail-template

    - name: step-approval
      suspend: {}
      inputs:
        parameters:
        - name: action
          default: 'DEPLOY'
          enum:
          - 'DEPLOY'
          - 'ROLLBACK'
        - name: skipAutovalidation
          default: false
      outputs:
        parameters:
          - name: action
            valueFrom:
              supplied: {}
          - name: skipAutovalidation
            valueFrom:
              supplied: {}

Logs from the workflow controller

time="2024-02-07T09:58:53.901Z" level=info msg="Updated phase Running -> Failed" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:58:53.901Z" level=info msg="Marking workflow completed" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:58:53.901Z" level=info msg="Marking workflow as pending archiving" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:58:53.907Z" level=info msg="cleaning up pod" action=deletePod key=argo-workflows-des/ttl-workflow-deploy-template-h557k-1340600742-agent/deletePod
time="2024-02-07T09:58:53.922Z" level=info msg="Workflow update successful" namespace=argo-workflows-des phase=Failed resourceVersion=2206381733 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:58:53.949Z" level=info msg="archiving workflow" namespace=argo-workflows-des uid=7a114a25-3f2d-42af-b128-f647bb1a3e59 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:58:53.953Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo-workflows-des/ttl-workflow-deploy-template-h557k-whalesay-template-1434212105/labelPodCompleted
time="2024-02-07T09:58:53.991Z" level=info msg="Queueing Failed workflow argo-workflows-des/ttl-workflow-deploy-template-h557k for delete in 5m0s due to TTL"
time="2024-02-07T09:58:57.983Z" level=info msg="Processing workflow" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:58:57.983Z" level=info msg="Task-result reconciliation" namespace=argo-workflows-des numObjs=0 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:58:57.983Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-14571621, taskName step-3"
time="2024-02-07T09:58:57.983Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-2029873582, taskName step-2-approval"
time="2024-02-07T09:58:57.984Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4292761298, taskName step-2"
time="2024-02-07T09:58:57.984Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3362124077, taskName step-1-approval"
time="2024-02-07T09:58:57.984Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4275983679, taskName step-1"
time="2024-02-07T09:58:57.984Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3686633912, taskName step-0-approval"
time="2024-02-07T09:58:57.984Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4259206060, taskName step-0"
time="2024-02-07T09:58:57.984Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4259206060, taskName step-0"
time="2024-02-07T09:58:57.984Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4259206060, taskName step-0"
time="2024-02-07T09:58:57.984Z" level=info msg="All of node ttl-workflow-deploy-template-h557k.step-0 dependencies [step-provision] completed" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:58:57.984Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:58:57.987Z" level=info msg="Pod node ttl-workflow-deploy-template-h557k-4259206060 initialized Pending" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:58:58.039Z" level=info msg="Created pod: ttl-workflow-deploy-template-h557k.step-0 (ttl-workflow-deploy-template-h557k-random-fail-template-4259206060)" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:58:58.039Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3686633912, taskName step-0-approval"
time="2024-02-07T09:58:58.039Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4275983679, taskName step-1"
time="2024-02-07T09:58:58.040Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3686633912, taskName step-0-approval"
time="2024-02-07T09:58:58.040Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3362124077, taskName step-1-approval"
time="2024-02-07T09:58:58.040Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4275983679, taskName step-1"
time="2024-02-07T09:58:58.040Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4292761298, taskName step-2"
time="2024-02-07T09:58:58.040Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3362124077, taskName step-1-approval"
time="2024-02-07T09:58:58.040Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-2029873582, taskName step-2-approval"
time="2024-02-07T09:58:58.040Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4292761298, taskName step-2"
time="2024-02-07T09:58:58.040Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-14571621, taskName step-3"
time="2024-02-07T09:58:58.040Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-2029873582, taskName step-2-approval"
time="2024-02-07T09:58:58.040Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-14571621, taskName step-3"
time="2024-02-07T09:58:58.040Z" level=info msg="TaskSet Reconciliation" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:58:58.040Z" level=info msg=reconcileAgentPod namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:58:58.040Z" level=warning msg="Coudn't obtain child for ttl-workflow-deploy-template-h557k-11827592, panicking"
time="2024-02-07T09:58:58.062Z" level=info msg="Workflow update successful" namespace=argo-workflows-des phase=Running resourceVersion=2206381913 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:04.014Z" level=info msg="Processing workflow" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:04.014Z" level=info msg="Task-result reconciliation" namespace=argo-workflows-des numObjs=0 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:04.014Z" level=info msg="node changed" namespace=argo-workflows-des new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=ttl-workflow-deploy-template-h557k-4259206060 old.message= old.phase=Pending old.progress=0/1 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:04.014Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-14571621, taskName step-3"
time="2024-02-07T09:59:04.014Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-2029873582, taskName step-2-approval"
time="2024-02-07T09:59:04.014Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4292761298, taskName step-2"
time="2024-02-07T09:59:04.014Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3362124077, taskName step-1-approval"
time="2024-02-07T09:59:04.014Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4275983679, taskName step-1"
time="2024-02-07T09:59:04.014Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3686633912, taskName step-0-approval"
time="2024-02-07T09:59:04.015Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3686633912, taskName step-0-approval"
time="2024-02-07T09:59:04.015Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4275983679, taskName step-1"
time="2024-02-07T09:59:04.015Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3686633912, taskName step-0-approval"
time="2024-02-07T09:59:04.015Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3362124077, taskName step-1-approval"
time="2024-02-07T09:59:04.015Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4275983679, taskName step-1"
time="2024-02-07T09:59:04.015Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4292761298, taskName step-2"
time="2024-02-07T09:59:04.015Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3362124077, taskName step-1-approval"
time="2024-02-07T09:59:04.015Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-2029873582, taskName step-2-approval"
time="2024-02-07T09:59:04.015Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4292761298, taskName step-2"
time="2024-02-07T09:59:04.015Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-14571621, taskName step-3"
time="2024-02-07T09:59:04.015Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-2029873582, taskName step-2-approval"
time="2024-02-07T09:59:04.015Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-14571621, taskName step-3"
time="2024-02-07T09:59:04.015Z" level=info msg="TaskSet Reconciliation" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:04.015Z" level=info msg=reconcileAgentPod namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:04.015Z" level=warning msg="Coudn't obtain child for ttl-workflow-deploy-template-h557k-11827592, panicking"
time="2024-02-07T09:59:04.030Z" level=info msg="Workflow update successful" namespace=argo-workflows-des phase=Running resourceVersion=2206382142 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:16.319Z" level=info msg="Processing workflow" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:16.319Z" level=info msg="Task-result reconciliation" namespace=argo-workflows-des numObjs=1 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:16.319Z" level=info msg="task-result changed" namespace=argo-workflows-des nodeID=ttl-workflow-deploy-template-h557k-4259206060 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:16.319Z" level=info msg="node changed" namespace=argo-workflows-des new.message= new.phase=Succeeded new.progress=0/1 nodeID=ttl-workflow-deploy-template-h557k-4259206060 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:16.320Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-14571621, taskName step-3"
time="2024-02-07T09:59:16.320Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-2029873582, taskName step-2-approval"
time="2024-02-07T09:59:16.320Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4292761298, taskName step-2"
time="2024-02-07T09:59:16.320Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3362124077, taskName step-1-approval"
time="2024-02-07T09:59:16.320Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4275983679, taskName step-1"
time="2024-02-07T09:59:16.320Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3686633912, taskName step-0-approval"
time="2024-02-07T09:59:16.320Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3686633912, taskName step-0-approval"
time="2024-02-07T09:59:16.320Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3686633912, taskName step-0-approval"
time="2024-02-07T09:59:16.320Z" level=info msg="All of node ttl-workflow-deploy-template-h557k.step-0-approval dependencies [step-0] completed" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:16.320Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:16.323Z" level=info msg="Suspend node ttl-workflow-deploy-template-h557k-3686633912 initialized Pending" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:16.323Z" level=info msg="node ttl-workflow-deploy-template-h557k.step-0-approval suspended" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:16.323Z" level=info msg="node ttl-workflow-deploy-template-h557k-3686633912 phase Pending -> Running" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:16.323Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4275983679, taskName step-1"
time="2024-02-07T09:59:16.323Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3362124077, taskName step-1-approval"
time="2024-02-07T09:59:16.323Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4275983679, taskName step-1"
time="2024-02-07T09:59:16.323Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4292761298, taskName step-2"
time="2024-02-07T09:59:16.323Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3362124077, taskName step-1-approval"
time="2024-02-07T09:59:16.323Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-2029873582, taskName step-2-approval"
time="2024-02-07T09:59:16.323Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4292761298, taskName step-2"
time="2024-02-07T09:59:16.323Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-14571621, taskName step-3"
time="2024-02-07T09:59:16.323Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-2029873582, taskName step-2-approval"
time="2024-02-07T09:59:16.323Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-14571621, taskName step-3"
time="2024-02-07T09:59:16.323Z" level=info msg="TaskSet Reconciliation" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:16.323Z" level=info msg=reconcileAgentPod namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:16.323Z" level=warning msg="Coudn't obtain child for ttl-workflow-deploy-template-h557k-11827592, panicking"
time="2024-02-07T09:59:16.349Z" level=info msg="Workflow update successful" namespace=argo-workflows-des phase=Running resourceVersion=2206382789 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:16.355Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo-workflows-des/ttl-workflow-deploy-template-h557k-random-fail-template-4259206060/labelPodCompleted
time="2024-02-07T09:59:26.341Z" level=info msg="Processing workflow" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:26.341Z" level=info msg="Task-result reconciliation" namespace=argo-workflows-des numObjs=1 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:26.341Z" level=info msg="task-result changed" namespace=argo-workflows-des nodeID=ttl-workflow-deploy-template-h557k-4259206060 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-14571621, taskName step-3"
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-2029873582, taskName step-2-approval"
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4292761298, taskName step-2"
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3362124077, taskName step-1-approval"
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4275983679, taskName step-1"
time="2024-02-07T09:59:26.341Z" level=info msg="node ttl-workflow-deploy-template-h557k.step-0-approval suspended" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4275983679, taskName step-1"
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3362124077, taskName step-1-approval"
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4275983679, taskName step-1"
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4292761298, taskName step-2"
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3362124077, taskName step-1-approval"
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-2029873582, taskName step-2-approval"
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4292761298, taskName step-2"
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-14571621, taskName step-3"
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-2029873582, taskName step-2-approval"
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-14571621, taskName step-3"
time="2024-02-07T09:59:26.341Z" level=info msg="TaskSet Reconciliation" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:26.341Z" level=info msg=reconcileAgentPod namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:26.341Z" level=warning msg="Coudn't obtain child for ttl-workflow-deploy-template-h557k-11827592, panicking"
time="2024-02-07T09:59:26.351Z" level=info msg="Workflow update successful" namespace=argo-workflows-des phase=Running resourceVersion=2206382789 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T10:03:14.000Z" level=info msg="Deleting garbage collected workflow 'argo-workflows-des/ttl-workflow-deploy-template-h557k'"
time="2024-02-07T10:03:14.012Z" level=info msg="Successfully deleted 'argo-workflows-des/ttl-workflow-deploy-template-h557k'"

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

The text was updated successfully, but these errors were encountered:

agilgur5 · 2024-02-18T17:46:55Z

The workflow is now waiting for the next step with manual approval as you can see in the screenshot .
But once the TTL secondsAfterFailure is completed, "workflow gone" message is displayed and the workflow status shown in the UI is the one before the /retry action.
[...]
"archived workflows" image :

So using archived workflows is an important piece of this. Updates to the live and archived workflows naturally have a race condition -- the archived version may lag behind the live one.

Does the archived workflow's old status persist after, say, 10 minutes? Or does it then match the new status?

I think ttl secondsAfterFailure is not takig into account that the workflow is running, as it has a "suspend" step waiting for approval.

Not sure if you have a typo here or not -- it sounds like it hit the TTL during the suspend based on your description, no?

The TTL should take into account suspended time, I believe it is a pure calculation from startedAt and not how much time the Workflow has spent actively running.

manuelbmar · 2024-02-22T13:14:14Z

Once a workflow fails and has configured the value ttlStrategy.secondsAfterFailure: 300, the workflow is queued for deletion .

You can see an example in attached traces :

time="2024-02-07T09:58:53.991Z" level=info msg="Queueing Failed workflow argo-workflows-des/ttl-workflow-deploy-template-h557k for delete in 5m0s due to TTL"

And the workflow is deleted 5 minutes later,

time="2024-02-07T10:03:14.012Z" level=info msg="Successfully deleted 'argo-workflows-des/ttl-workflow-deploy-template-h557k'"

But if we launch a retry command ,and the workflow goes to "running" state ,the workflow is deleted even if it is in running state as you can see in the logs .

This behaviour occurs when archiving is enabled and disabled.

I would expect that if a retry is performed on a failed wf and the workflow goes to "running" state, the workflow should not be deleted because the retry operation has fixed the error and the workflow is running.

agilgur5 · 2024-02-23T12:02:53Z

Thanks for investigating this behavior more.
Yes, in the case of a secondsAfter* TTL, it should be reset when there is a retry. I suspect that the Workflow is just never removed from the queue when a retry is triggered (retries are also currently handled a bit incorrectly: #12538)

Per your analysis, that sounds like the suspend step is unrelated then, right?

Could you also answer the question I had regarding the Workflow Archive? Does it eventually update to the retried Workflow? If not, that might be another bug, an unhandled race

manuelbmar · 2024-02-26T08:48:44Z

Per your analysis, that sounds like the suspend step is unrelated then, right?

Correct , suspend step is unrelated .

Could you also answer the question I had regarding the Workflow Archive? Does it eventually update to the retried Workflow? If not, that might be another bug, an unhandled race

It is not updated to the retried workflow. Failed workflow is displayed in archived tab . I attached two images related in my initial comment

agilgur5 · 2024-02-26T19:33:56Z

I attached two images related in my initial comment

Yes I saw those, but I was wondering if the Archived Workflow might update after, say, 10 more minutes. If the last screenshot stays the same or changes to match the retried Workflow.

Per your response, it stays the same, so it seems like there's a secondary unhandled race condition here. Although fixing the TTL issue might resolve that race as well; the TTL GC is not anticipating an incomplete Workflow, so in this case GC is happening before archiving (since only completed Workflows get archived)

malisettikalyan · 2024-02-29T21:34:04Z

@agilgur5 @manuelbmar We are also facing the same issue. We have set a ttl of 7 days for the workflow. Let's say Workflow has failed 7 days ago, the workflow was retried 2 days ago and it is in Running state. Still the workflow has deleted.

Any resolution for this issue. It seems a critical issue

manuelbmar · 2024-03-14T16:37:56Z

@agilgur5 Any updates? Are there any open issues or pull requests being worked on to resolve the problem?

agilgur5 · 2024-03-14T19:50:58Z

It seems a critical issue

This only occurs as a race condition with a combination of several features. Retries in particular should not be used frequently (as that would suggest there is an issue with the tasks themselves that should be fixed) and are also one of the most complex areas of the codebase. I.e. this is a low frequency + high complexity issue.
It is labeled P1 solely due to its upvotes (which, given they occurred almost instantly, are likely to have come from a single organization)

@agilgur5 Any updates? Are there any open issues or pull requests being worked on to resolve the problem?

If there were updates, they would already be in the thread. Please follow proper open source etiquette.

I'd like to contribute the fix myself (see contributing guide)

You are also more than welcome to contribute as you checked that you'd like to do.
Argo also recently started a sustainability effort.

Although fixing the TTL issue might resolve that race as well; the TTL GC is not anticipating an incomplete Workflow, so in this case GC is happening before archiving (since only completed Workflows get archived)

Clarification here, archiving only occurs for a completed Workflow, so this is a single bug. The solution is still likely to be to remove a Workflow from the TTL queue when it is retried.

…d not be deleted (Fixes argoproj#12636)

siwet · 2024-04-06T18:06:22Z

hi, @agilgur5 @manuelbmar We encountered a similar issue where we couldn't find a good solution for removing retried workflows from the TTL queue, as there seems to be no built-in method for deleting elements from the delayed queue.

As a temporary workaround, we have implemented a somewhat inelegant bypass, as in #12905, which incurs a query cost before the deletion operation, but it effectively addresses our current issue.

Regarding the idea of removing elements from the queue:

Adding a label named workflows.argoproj.io/retried to workflows that have been retried.
Adding logic in the handler function of the gc_controller.go informer to handle workflows with the 'retried' label, such as keeping a record of retried workflows and skipping them during dequeuing from the delayed queue.

would like to ask if this direction is feasible?

agilgur5 · 2024-04-09T06:27:16Z

We encountered a similar issue where we couldn't find a good solution for removing retried workflows from the TTL queue, as there seems to be no built-in method for deleting elements from the delayed queue.

The client-go workqueue doesn't have a Remove function, but can't you just call Done early basically?

As a temporary workaround, we have implemented a somewhat inelegant bypass, as in #12905, which incurs a query cost before the deletion operation, but it effectively addresses our current issue.

Yea that could potentially add quite a lot of queries, since it's one more for every deletion 😕

Adding a label named workflows.argoproj.io/retried to workflows that have been retried.

We actually are implementing something similar in #12734 (see also #12538)

…j#12636) Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com> Signed-off-by: Shiwei Tang <siwe.tang@gmail.com>

…#12905) Signed-off-by: Shiwei Tang <siwe.tang@gmail.com> Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com>

…#12905) Signed-off-by: Shiwei Tang <siwe.tang@gmail.com> Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com> (cherry picked from commit 2095621)

Signed-off-by: Shiwei Tang <siwe.tang@gmail.com>

…j#12636) (argoproj#12905) Signed-off-by: Shiwei Tang <siwe.tang@gmail.com> Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com>

manuelbmar added the type/bug label Feb 7, 2024

agilgur5 added area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority labels Feb 18, 2024

agilgur5 changed the title ~~Retry failed workflow with suspend steps and ttl secondsAfterFailure configured shows failed initial status after ttl secondsAfterFailure completed~~ Retry failed workflow with suspend and ttl shows failed initial status after secondsAfterFailure Feb 18, 2024

agilgur5 added the area/workflow-archive label Feb 18, 2024

agilgur5 added problem/more information needed Not enough information has been provide to diagnose this issue. area/suspend-resume Suspending and resuming workflows labels Feb 18, 2024

agilgur5 removed the problem/more information needed Not enough information has been provide to diagnose this issue. label Feb 23, 2024

agilgur5 changed the title ~~Retry failed workflow with suspend and ttl shows failed initial status after secondsAfterFailure~~ Retry failed workflow with ttl deleted after initial secondsAfterFailure Feb 26, 2024

agilgur5 removed the area/suspend-resume Suspending and resuming workflows label Feb 26, 2024

agilgur5 added the area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more label Feb 26, 2024

agilgur5 changed the title ~~Retry failed workflow with ttl deleted after initial secondsAfterFailure~~ Retry failed workflow with ttl deleted after initial secondsAfterFailure while still running Feb 26, 2024

agilgur5 added the solution/suggested A solution to the bug has been suggested. Someone needs to implement it. label Mar 14, 2024

siwet added a commit to siwet/argo-workflows that referenced this issue Apr 6, 2024

fix: workflows that have not completed due to a retry operation shoul…

8a5ec81

…d not be deleted (Fixes argoproj#12636)

siwet mentioned this issue Apr 6, 2024

fix: workflows that are retrying should not be deleted (Fixes #12636) #12905

Merged

agilgur5 closed this as completed in #12905 Apr 13, 2024

agilgur5 added a commit that referenced this issue Apr 13, 2024

fix: workflows that are retrying should not be deleted (Fixes #12636) (…

2095621

…#12905) Signed-off-by: Shiwei Tang <siwe.tang@gmail.com> Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com>

agilgur5 added this to the v3.5.x patches milestone Apr 13, 2024

siwet added a commit to siwet/argo-workflows that referenced this issue Apr 21, 2024

test: Add e2e test for argoproj#12636

a1fded1

siwet mentioned this issue Apr 21, 2024

test: Add e2e test for #12636 #12959

Open

siwet added a commit to siwet/argo-workflows that referenced this issue Apr 21, 2024

test: Add e2e test for argoproj#12636

aade1a0

Signed-off-by: Shiwei Tang <siwe.tang@gmail.com>

agilgur5 removed the area/workflow-archive label Jul 20, 2024

agilgur5 mentioned this issue Jul 20, 2024

fix: gc_controller check workflow again prior to deletion #11481

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry failed workflow with `ttl` deleted after initial `secondsAfterFailure` while still running #12636

Retry failed workflow with `ttl` deleted after initial `secondsAfterFailure` while still running #12636

manuelbmar commented Feb 7, 2024 •

edited by agilgur5

Loading

agilgur5 commented Feb 18, 2024

manuelbmar commented Feb 22, 2024 •

edited by agilgur5

Loading

agilgur5 commented Feb 23, 2024 •

edited

Loading

manuelbmar commented Feb 26, 2024 •

edited by agilgur5

Loading

agilgur5 commented Feb 26, 2024

malisettikalyan commented Feb 29, 2024 •

edited

Loading

manuelbmar commented Mar 14, 2024

agilgur5 commented Mar 14, 2024 •

edited

Loading

siwet commented Apr 6, 2024

agilgur5 commented Apr 9, 2024

Retry failed workflow with ttl deleted after initial secondsAfterFailure while still running #12636

Retry failed workflow with ttl deleted after initial secondsAfterFailure while still running #12636

Comments

manuelbmar commented Feb 7, 2024 • edited by agilgur5 Loading

Pre-requisites

What happened/what did you expect to happen?

Version

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

agilgur5 commented Feb 18, 2024

manuelbmar commented Feb 22, 2024 • edited by agilgur5 Loading

agilgur5 commented Feb 23, 2024 • edited Loading

manuelbmar commented Feb 26, 2024 • edited by agilgur5 Loading

agilgur5 commented Feb 26, 2024

malisettikalyan commented Feb 29, 2024 • edited Loading

manuelbmar commented Mar 14, 2024

agilgur5 commented Mar 14, 2024 • edited Loading

siwet commented Apr 6, 2024

agilgur5 commented Apr 9, 2024

Retry failed workflow with `ttl` deleted after initial `secondsAfterFailure` while still running #12636

Retry failed workflow with `ttl` deleted after initial `secondsAfterFailure` while still running #12636

manuelbmar commented Feb 7, 2024 •

edited by agilgur5

Loading

manuelbmar commented Feb 22, 2024 •

edited by agilgur5

Loading

agilgur5 commented Feb 23, 2024 •

edited

Loading

manuelbmar commented Feb 26, 2024 •

edited by agilgur5

Loading

malisettikalyan commented Feb 29, 2024 •

edited

Loading

agilgur5 commented Mar 14, 2024 •

edited

Loading