Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry failed workflow with ttl deleted after initial secondsAfterFailure while still running #12636

Closed
4 tasks done
manuelbmar opened this issue Feb 7, 2024 · 10 comments · Fixed by #12905
Closed
4 tasks done
Labels
area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority solution/suggested A solution to the bug has been suggested. Someone needs to implement it. type/bug

Comments

@manuelbmar
Copy link

manuelbmar commented Feb 7, 2024

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issue exists when I tested with :latest
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

I am retrying a failed worflow that has steps set up with manual approval and ttl secondsAfterFailure .
Step-0 failed in the first execution , i tried a call to /retry method , the failed step has now gone fine and I manage to advance our workflow correctly.
image
The workflow is now waiting for the next step with manual approval as you can see in the screenshot .
But once the TTL secondsAfterFailure is completed, "workflow gone" message is displayed and the workflow status shown in the UI is the one before the /retry action.
"workflow gone" image :
image
"archived workflows" image :
image
I think ttl secondsAfterFailure is not takig into account that the workflow is running, as it has a "suspend" step waiting for approval. edited by agilgur5: suspend is unrelated / red herring in this case, see below

Version

v3.4.11

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: "ttl-workflow-deploy-template"
spec:
 activeDeadlineSeconds: 1814400
 archiveLogs: true
 ttlStrategy:
   secondsAfterCompletion: 300 # Time to live after workflow is completed, replaces ttlSecondsAfterFinished
   secondsAfterSuccess: 300   # Time to live after workflow is successful
   secondsAfterFailure: 300    # Time to live after workflow fail
 entrypoint: deploy-workflow
 onExit: exit-handler
 templates:
    - name: exit-handler
      steps:
        - - name: send-report
            templateRef:
              name: workflow-template-whalesay-template
              template: whalesay-template
            arguments:
              parameters:
                 - name: message
                   value: "{{workflow.status}}"

    - name: deploy-workflow
      dag:
        tasks:
        - name: step-provision
          templateRef:
              name: workflow-template-whalesay-template
              template: whalesay-template
          arguments:
              parameters:
                - name: message
                  value: "step-provision"

        - name: step-0
          depends: "step-provision"
          templateRef:
            name: workflow-template-random-fail-template
            template: random-fail-template

        - name: step-0-approval
          depends: "step-0.Succeeded"
          template: step-approval

        - name: step-1
          depends: "step-0-approval"
          when: '{{tasks.step-0-approval.outputs.parameters.action}} == DEPLOY'
          templateRef:
            name: workflow-template-random-fail-template
            template: random-fail-template


        - name: step-1-approval
          depends: "step-1.Succeeded"
          template: step-approval

        - name: step-2
          depends: "step-1-approval"
          when: '{{tasks.step-1-approval.outputs.parameters.action}} == DEPLOY'
          templateRef:
            name: workflow-template-random-fail-template
            template: random-fail-template

        - name: step-2-approval
          depends: "step-2.Succeeded"
          template: step-approval

        - name: step-3
          depends: "step-2-approval"
          when: '{{tasks.step-2-approval.outputs.parameters.action}} == DEPLOY'
          templateRef:
            name: workflow-template-random-fail-template
            template: random-fail-template

    - name: step-approval
      suspend: {}
      inputs:
        parameters:
        - name: action
          default: 'DEPLOY'
          enum:
          - 'DEPLOY'
          - 'ROLLBACK'
        - name: skipAutovalidation
          default: false
      outputs:
        parameters:
          - name: action
            valueFrom:
              supplied: {}
          - name: skipAutovalidation
            valueFrom:
              supplied: {}

Logs from the workflow controller

time="2024-02-07T09:58:53.901Z" level=info msg="Updated phase Running -> Failed" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:58:53.901Z" level=info msg="Marking workflow completed" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:58:53.901Z" level=info msg="Marking workflow as pending archiving" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:58:53.907Z" level=info msg="cleaning up pod" action=deletePod key=argo-workflows-des/ttl-workflow-deploy-template-h557k-1340600742-agent/deletePod
time="2024-02-07T09:58:53.922Z" level=info msg="Workflow update successful" namespace=argo-workflows-des phase=Failed resourceVersion=2206381733 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:58:53.949Z" level=info msg="archiving workflow" namespace=argo-workflows-des uid=7a114a25-3f2d-42af-b128-f647bb1a3e59 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:58:53.953Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo-workflows-des/ttl-workflow-deploy-template-h557k-whalesay-template-1434212105/labelPodCompleted
time="2024-02-07T09:58:53.991Z" level=info msg="Queueing Failed workflow argo-workflows-des/ttl-workflow-deploy-template-h557k for delete in 5m0s due to TTL"
time="2024-02-07T09:58:57.983Z" level=info msg="Processing workflow" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:58:57.983Z" level=info msg="Task-result reconciliation" namespace=argo-workflows-des numObjs=0 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:58:57.983Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-14571621, taskName step-3"
time="2024-02-07T09:58:57.983Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-2029873582, taskName step-2-approval"
time="2024-02-07T09:58:57.984Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4292761298, taskName step-2"
time="2024-02-07T09:58:57.984Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3362124077, taskName step-1-approval"
time="2024-02-07T09:58:57.984Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4275983679, taskName step-1"
time="2024-02-07T09:58:57.984Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3686633912, taskName step-0-approval"
time="2024-02-07T09:58:57.984Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4259206060, taskName step-0"
time="2024-02-07T09:58:57.984Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4259206060, taskName step-0"
time="2024-02-07T09:58:57.984Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4259206060, taskName step-0"
time="2024-02-07T09:58:57.984Z" level=info msg="All of node ttl-workflow-deploy-template-h557k.step-0 dependencies [step-provision] completed" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:58:57.984Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:58:57.987Z" level=info msg="Pod node ttl-workflow-deploy-template-h557k-4259206060 initialized Pending" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:58:58.039Z" level=info msg="Created pod: ttl-workflow-deploy-template-h557k.step-0 (ttl-workflow-deploy-template-h557k-random-fail-template-4259206060)" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:58:58.039Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3686633912, taskName step-0-approval"
time="2024-02-07T09:58:58.039Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4275983679, taskName step-1"
time="2024-02-07T09:58:58.040Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3686633912, taskName step-0-approval"
time="2024-02-07T09:58:58.040Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3362124077, taskName step-1-approval"
time="2024-02-07T09:58:58.040Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4275983679, taskName step-1"
time="2024-02-07T09:58:58.040Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4292761298, taskName step-2"
time="2024-02-07T09:58:58.040Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3362124077, taskName step-1-approval"
time="2024-02-07T09:58:58.040Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-2029873582, taskName step-2-approval"
time="2024-02-07T09:58:58.040Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4292761298, taskName step-2"
time="2024-02-07T09:58:58.040Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-14571621, taskName step-3"
time="2024-02-07T09:58:58.040Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-2029873582, taskName step-2-approval"
time="2024-02-07T09:58:58.040Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-14571621, taskName step-3"
time="2024-02-07T09:58:58.040Z" level=info msg="TaskSet Reconciliation" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:58:58.040Z" level=info msg=reconcileAgentPod namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:58:58.040Z" level=warning msg="Coudn't obtain child for ttl-workflow-deploy-template-h557k-11827592, panicking"
time="2024-02-07T09:58:58.062Z" level=info msg="Workflow update successful" namespace=argo-workflows-des phase=Running resourceVersion=2206381913 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:04.014Z" level=info msg="Processing workflow" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:04.014Z" level=info msg="Task-result reconciliation" namespace=argo-workflows-des numObjs=0 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:04.014Z" level=info msg="node changed" namespace=argo-workflows-des new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=ttl-workflow-deploy-template-h557k-4259206060 old.message= old.phase=Pending old.progress=0/1 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:04.014Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-14571621, taskName step-3"
time="2024-02-07T09:59:04.014Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-2029873582, taskName step-2-approval"
time="2024-02-07T09:59:04.014Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4292761298, taskName step-2"
time="2024-02-07T09:59:04.014Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3362124077, taskName step-1-approval"
time="2024-02-07T09:59:04.014Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4275983679, taskName step-1"
time="2024-02-07T09:59:04.014Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3686633912, taskName step-0-approval"
time="2024-02-07T09:59:04.015Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3686633912, taskName step-0-approval"
time="2024-02-07T09:59:04.015Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4275983679, taskName step-1"
time="2024-02-07T09:59:04.015Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3686633912, taskName step-0-approval"
time="2024-02-07T09:59:04.015Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3362124077, taskName step-1-approval"
time="2024-02-07T09:59:04.015Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4275983679, taskName step-1"
time="2024-02-07T09:59:04.015Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4292761298, taskName step-2"
time="2024-02-07T09:59:04.015Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3362124077, taskName step-1-approval"
time="2024-02-07T09:59:04.015Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-2029873582, taskName step-2-approval"
time="2024-02-07T09:59:04.015Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4292761298, taskName step-2"
time="2024-02-07T09:59:04.015Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-14571621, taskName step-3"
time="2024-02-07T09:59:04.015Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-2029873582, taskName step-2-approval"
time="2024-02-07T09:59:04.015Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-14571621, taskName step-3"
time="2024-02-07T09:59:04.015Z" level=info msg="TaskSet Reconciliation" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:04.015Z" level=info msg=reconcileAgentPod namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:04.015Z" level=warning msg="Coudn't obtain child for ttl-workflow-deploy-template-h557k-11827592, panicking"
time="2024-02-07T09:59:04.030Z" level=info msg="Workflow update successful" namespace=argo-workflows-des phase=Running resourceVersion=2206382142 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:16.319Z" level=info msg="Processing workflow" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:16.319Z" level=info msg="Task-result reconciliation" namespace=argo-workflows-des numObjs=1 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:16.319Z" level=info msg="task-result changed" namespace=argo-workflows-des nodeID=ttl-workflow-deploy-template-h557k-4259206060 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:16.319Z" level=info msg="node changed" namespace=argo-workflows-des new.message= new.phase=Succeeded new.progress=0/1 nodeID=ttl-workflow-deploy-template-h557k-4259206060 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:16.320Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-14571621, taskName step-3"
time="2024-02-07T09:59:16.320Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-2029873582, taskName step-2-approval"
time="2024-02-07T09:59:16.320Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4292761298, taskName step-2"
time="2024-02-07T09:59:16.320Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3362124077, taskName step-1-approval"
time="2024-02-07T09:59:16.320Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4275983679, taskName step-1"
time="2024-02-07T09:59:16.320Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3686633912, taskName step-0-approval"
time="2024-02-07T09:59:16.320Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3686633912, taskName step-0-approval"
time="2024-02-07T09:59:16.320Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3686633912, taskName step-0-approval"
time="2024-02-07T09:59:16.320Z" level=info msg="All of node ttl-workflow-deploy-template-h557k.step-0-approval dependencies [step-0] completed" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:16.320Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:16.323Z" level=info msg="Suspend node ttl-workflow-deploy-template-h557k-3686633912 initialized Pending" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:16.323Z" level=info msg="node ttl-workflow-deploy-template-h557k.step-0-approval suspended" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:16.323Z" level=info msg="node ttl-workflow-deploy-template-h557k-3686633912 phase Pending -> Running" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:16.323Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4275983679, taskName step-1"
time="2024-02-07T09:59:16.323Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3362124077, taskName step-1-approval"
time="2024-02-07T09:59:16.323Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4275983679, taskName step-1"
time="2024-02-07T09:59:16.323Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4292761298, taskName step-2"
time="2024-02-07T09:59:16.323Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3362124077, taskName step-1-approval"
time="2024-02-07T09:59:16.323Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-2029873582, taskName step-2-approval"
time="2024-02-07T09:59:16.323Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4292761298, taskName step-2"
time="2024-02-07T09:59:16.323Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-14571621, taskName step-3"
time="2024-02-07T09:59:16.323Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-2029873582, taskName step-2-approval"
time="2024-02-07T09:59:16.323Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-14571621, taskName step-3"
time="2024-02-07T09:59:16.323Z" level=info msg="TaskSet Reconciliation" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:16.323Z" level=info msg=reconcileAgentPod namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:16.323Z" level=warning msg="Coudn't obtain child for ttl-workflow-deploy-template-h557k-11827592, panicking"
time="2024-02-07T09:59:16.349Z" level=info msg="Workflow update successful" namespace=argo-workflows-des phase=Running resourceVersion=2206382789 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:16.355Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo-workflows-des/ttl-workflow-deploy-template-h557k-random-fail-template-4259206060/labelPodCompleted
time="2024-02-07T09:59:26.341Z" level=info msg="Processing workflow" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:26.341Z" level=info msg="Task-result reconciliation" namespace=argo-workflows-des numObjs=1 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:26.341Z" level=info msg="task-result changed" namespace=argo-workflows-des nodeID=ttl-workflow-deploy-template-h557k-4259206060 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-14571621, taskName step-3"
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-2029873582, taskName step-2-approval"
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4292761298, taskName step-2"
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3362124077, taskName step-1-approval"
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4275983679, taskName step-1"
time="2024-02-07T09:59:26.341Z" level=info msg="node ttl-workflow-deploy-template-h557k.step-0-approval suspended" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4275983679, taskName step-1"
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3362124077, taskName step-1-approval"
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4275983679, taskName step-1"
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4292761298, taskName step-2"
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-3362124077, taskName step-1-approval"
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-2029873582, taskName step-2-approval"
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-4292761298, taskName step-2"
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-14571621, taskName step-3"
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-2029873582, taskName step-2-approval"
time="2024-02-07T09:59:26.341Z" level=warning msg="was unable to obtain the node for ttl-workflow-deploy-template-h557k-14571621, taskName step-3"
time="2024-02-07T09:59:26.341Z" level=info msg="TaskSet Reconciliation" namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:26.341Z" level=info msg=reconcileAgentPod namespace=argo-workflows-des workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T09:59:26.341Z" level=warning msg="Coudn't obtain child for ttl-workflow-deploy-template-h557k-11827592, panicking"
time="2024-02-07T09:59:26.351Z" level=info msg="Workflow update successful" namespace=argo-workflows-des phase=Running resourceVersion=2206382789 workflow=ttl-workflow-deploy-template-h557k
time="2024-02-07T10:03:14.000Z" level=info msg="Deleting garbage collected workflow 'argo-workflows-des/ttl-workflow-deploy-template-h557k'"
time="2024-02-07T10:03:14.012Z" level=info msg="Successfully deleted 'argo-workflows-des/ttl-workflow-deploy-template-h557k'"

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
@agilgur5 agilgur5 added area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority labels Feb 18, 2024
@agilgur5 agilgur5 changed the title Retry failed workflow with suspend steps and ttl secondsAfterFailure configured shows failed initial status after ttl secondsAfterFailure completed Retry failed workflow with suspend and ttl shows failed initial status after secondsAfterFailure Feb 18, 2024
@agilgur5
Copy link
Contributor

The workflow is now waiting for the next step with manual approval as you can see in the screenshot .
But once the TTL secondsAfterFailure is completed, "workflow gone" message is displayed and the workflow status shown in the UI is the one before the /retry action.
[...]
"archived workflows" image :

So using archived workflows is an important piece of this. Updates to the live and archived workflows naturally have a race condition -- the archived version may lag behind the live one.

Does the archived workflow's old status persist after, say, 10 minutes? Or does it then match the new status?

I think ttl secondsAfterFailure is not takig into account that the workflow is running, as it has a "suspend" step waiting for approval.

Not sure if you have a typo here or not -- it sounds like it hit the TTL during the suspend based on your description, no?

The TTL should take into account suspended time, I believe it is a pure calculation from startedAt and not how much time the Workflow has spent actively running.

@agilgur5 agilgur5 added problem/more information needed Not enough information has been provide to diagnose this issue. area/suspend-resume Suspending and resuming workflows labels Feb 18, 2024
@manuelbmar
Copy link
Author

manuelbmar commented Feb 22, 2024

Once a workflow fails and has configured the value ttlStrategy.secondsAfterFailure: 300, the workflow is queued for deletion .

You can see an example in attached traces :

time="2024-02-07T09:58:53.991Z" level=info msg="Queueing Failed workflow argo-workflows-des/ttl-workflow-deploy-template-h557k for delete in 5m0s due to TTL"

And the workflow is deleted 5 minutes later,

time="2024-02-07T10:03:14.012Z" level=info msg="Successfully deleted 'argo-workflows-des/ttl-workflow-deploy-template-h557k'"

But if we launch a retry command ,and the workflow goes to "running" state ,the workflow is deleted even if it is in running state as you can see in the logs .

This behaviour occurs when archiving is enabled and disabled.

I would expect that if a retry is performed on a failed wf and the workflow goes to "running" state, the workflow should not be deleted because the retry operation has fixed the error and the workflow is running.

@agilgur5
Copy link
Contributor

agilgur5 commented Feb 23, 2024

Thanks for investigating this behavior more.
Yes, in the case of a secondsAfter* TTL, it should be reset when there is a retry. I suspect that the Workflow is just never removed from the queue when a retry is triggered (retries are also currently handled a bit incorrectly: #12538)

Per your analysis, that sounds like the suspend step is unrelated then, right?

Could you also answer the question I had regarding the Workflow Archive? Does it eventually update to the retried Workflow? If not, that might be another bug, an unhandled race

@agilgur5 agilgur5 removed the problem/more information needed Not enough information has been provide to diagnose this issue. label Feb 23, 2024
@manuelbmar
Copy link
Author

manuelbmar commented Feb 26, 2024

Per your analysis, that sounds like the suspend step is unrelated then, right?

Correct , suspend step is unrelated .

Could you also answer the question I had regarding the Workflow Archive? Does it eventually update to the retried Workflow? If not, that might be another bug, an unhandled race

It is not updated to the retried workflow. Failed workflow is displayed in archived tab . I attached two images related in my initial comment

@agilgur5 agilgur5 changed the title Retry failed workflow with suspend and ttl shows failed initial status after secondsAfterFailure Retry failed workflow with ttl deleted after initial secondsAfterFailure Feb 26, 2024
@agilgur5 agilgur5 removed the area/suspend-resume Suspending and resuming workflows label Feb 26, 2024
@agilgur5
Copy link
Contributor

I attached two images related in my initial comment

Yes I saw those, but I was wondering if the Archived Workflow might update after, say, 10 more minutes. If the last screenshot stays the same or changes to match the retried Workflow.

Per your response, it stays the same, so it seems like there's a secondary unhandled race condition here. Although fixing the TTL issue might resolve that race as well; the TTL GC is not anticipating an incomplete Workflow, so in this case GC is happening before archiving (since only completed Workflows get archived)

@agilgur5 agilgur5 added the area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more label Feb 26, 2024
@agilgur5 agilgur5 changed the title Retry failed workflow with ttl deleted after initial secondsAfterFailure Retry failed workflow with ttl deleted after initial secondsAfterFailure while still running Feb 26, 2024
@malisettikalyan
Copy link

malisettikalyan commented Feb 29, 2024

@agilgur5 @manuelbmar We are also facing the same issue. We have set a ttl of 7 days for the workflow. Let's say Workflow has failed 7 days ago, the workflow was retried 2 days ago and it is in Running state. Still the workflow has deleted.

Any resolution for this issue. It seems a critical issue

@manuelbmar
Copy link
Author

@agilgur5 Any updates? Are there any open issues or pull requests being worked on to resolve the problem?

@agilgur5
Copy link
Contributor

agilgur5 commented Mar 14, 2024

It seems a critical issue

This only occurs as a race condition with a combination of several features. Retries in particular should not be used frequently (as that would suggest there is an issue with the tasks themselves that should be fixed) and are also one of the most complex areas of the codebase. I.e. this is a low frequency + high complexity issue.
It is labeled P1 solely due to its upvotes (which, given they occurred almost instantly, are likely to have come from a single organization)

@agilgur5 Any updates? Are there any open issues or pull requests being worked on to resolve the problem?

If there were updates, they would already be in the thread. Please follow proper open source etiquette.

You are also more than welcome to contribute as you checked that you'd like to do.
Argo also recently started a sustainability effort.

Although fixing the TTL issue might resolve that race as well; the TTL GC is not anticipating an incomplete Workflow, so in this case GC is happening before archiving (since only completed Workflows get archived)

Clarification here, archiving only occurs for a completed Workflow, so this is a single bug. The solution is still likely to be to remove a Workflow from the TTL queue when it is retried.

@agilgur5 agilgur5 added the solution/suggested A solution to the bug has been suggested. Someone needs to implement it. label Mar 14, 2024
siwet added a commit to siwet/argo-workflows that referenced this issue Apr 6, 2024
@siwet
Copy link
Contributor

siwet commented Apr 6, 2024

hi, @agilgur5 @manuelbmar We encountered a similar issue where we couldn't find a good solution for removing retried workflows from the TTL queue, as there seems to be no built-in method for deleting elements from the delayed queue.

As a temporary workaround, we have implemented a somewhat inelegant bypass, as in #12905, which incurs a query cost before the deletion operation, but it effectively addresses our current issue.

Regarding the idea of removing elements from the queue:

  • Adding a label named workflows.argoproj.io/retried to workflows that have been retried.
  • Adding logic in the handler function of the gc_controller.go informer to handle workflows with the 'retried' label, such as keeping a record of retried workflows and skipping them during dequeuing from the delayed queue.

would like to ask if this direction is feasible?

@agilgur5
Copy link
Contributor

agilgur5 commented Apr 9, 2024

We encountered a similar issue where we couldn't find a good solution for removing retried workflows from the TTL queue, as there seems to be no built-in method for deleting elements from the delayed queue.

The client-go workqueue doesn't have a Remove function, but can't you just call Done early basically?

As a temporary workaround, we have implemented a somewhat inelegant bypass, as in #12905, which incurs a query cost before the deletion operation, but it effectively addresses our current issue.

Yea that could potentially add quite a lot of queries, since it's one more for every deletion 😕

  • Adding a label named workflows.argoproj.io/retried to workflows that have been retried.

We actually are implementing something similar in #12734 (see also #12538)

siwet added a commit to siwet/argo-workflows that referenced this issue Apr 10, 2024
…j#12636)

Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com>
Signed-off-by: Shiwei Tang <siwe.tang@gmail.com>
siwet added a commit to siwet/argo-workflows that referenced this issue Apr 10, 2024
…j#12636)

Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com>
Signed-off-by: Shiwei Tang <siwe.tang@gmail.com>
agilgur5 added a commit that referenced this issue Apr 13, 2024
…#12905)

Signed-off-by: Shiwei Tang <siwe.tang@gmail.com>
Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com>
@agilgur5 agilgur5 added this to the v3.5.x patches milestone Apr 13, 2024
agilgur5 pushed a commit that referenced this issue Apr 19, 2024
…#12905)

Signed-off-by: Shiwei Tang <siwe.tang@gmail.com>
Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com>
(cherry picked from commit 2095621)
siwet added a commit to siwet/argo-workflows that referenced this issue Apr 21, 2024
siwet added a commit to siwet/argo-workflows that referenced this issue Apr 21, 2024
Signed-off-by: Shiwei Tang <siwe.tang@gmail.com>
isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this issue May 6, 2024
…j#12636) (argoproj#12905)

Signed-off-by: Shiwei Tang <siwe.tang@gmail.com>
Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com>
isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this issue May 7, 2024
…j#12636) (argoproj#12905)

Signed-off-by: Shiwei Tang <siwe.tang@gmail.com>
Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority solution/suggested A solution to the bug has been suggested. Someone needs to implement it. type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants