fix: Do not reset the root node by default. Fixes #13196 #13198

jswxstw · 2024-06-17T07:56:54Z

Fixes #13196

Motivation

Single-node workflow with failed exit handler can not be retried correctly.

Modifications

Only reset root node when it is FailedOrError, remove the default reset logic.

Verification

local test and e2e test

terrytangyuan

Single-node workflow with failed exit handler can not be retried correctly.

Can you add a test case?

jswxstw · 2024-06-18T04:00:11Z

Sure.

jswxstw · 2024-06-18T06:49:11Z

/retest

juliev0 · 2024-08-23T21:59:38Z

Thanks for fixing. Would you mind taking me through what happened to help me understand? I don't know how the exit handler is related.

First, the root node got reset to "Running" as you mentioned. So, how did the node being in "Running" cause the retry to fail? (any links into the code if possible will help)

Thanks!

jswxstw · 2024-08-24T02:10:01Z

First, the root node got reset to "Running" as you mentioned. So, how did the node being in "Running" cause the retry to fail? (any links into the code if possible will help)

The root node of Pod type got reset to Running from Succeeded state, which is not as expected:

Firstly, it should not re-run, because it is already Succeeded.
Secondly, its pod is completed, so this node can not run correctly: new pod will not be created because it already exists and then this node turns to Error with message pod deleted.

juliev0 · 2024-08-24T04:18:46Z

Got it, thanks. I see the difference is the Group node would be a DAG, Steps, etc, which doesn't run.

And regarding the Description, does it matter that there was an exit handler, or could it just have easily been a single node workflow that failed, and would've had the same issue?

jswxstw · 2024-08-24T13:17:44Z

And regarding the Description, does it matter that there was an exit handler, or could it just have easily been a single node workflow that failed, and would've had the same issue?

If there is only one single node without exit handler, the workflow will be Succeeded which is not retryable, so it will not cause any problems.

juliev0 · 2024-08-24T16:07:00Z

I'm confused. Is this not a single node Workflow?:

version: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: entrypoint-fail
spec:
  entrypoint: failing-entrypoint
  templates:
  - name: failing-entrypoint
    container:
      image: alpine:3.18
      command: [sh, -c]
      args: ["exit 1"]

jswxstw · 2024-08-24T16:27:56Z

I'm confused. Is this not a single node Workflow?:

Yes, this is a single node worklfow. But the root node is of pod type which is no need to be reset, since it will be deleted after manually retry.

# argo retry entrypoint-fail
INFO[2024-08-25T00:25:22.288Z] Deleting pod                                  podDeleted=entrypoint-fail
INFO[2024-08-25T00:25:22.295Z] Workflow to be dehydrated                     Workflow Size=902
Name:                entrypoint-fail
Namespace:           argo
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Running
Conditions:          
 PodRunning          False
 Completed           False
Created:             Sun Aug 25 00:24:35 +0800 (47 seconds ago)
Started:             Sun Aug 25 00:25:22 +0800 (now)
Duration:            0 seconds
Progress:            0/1
ResourcesDuration:   0s*(1 cpu),2s*(100Mi memory)

juliev0 · 2024-08-25T17:48:36Z

So, just trying to understand - this statement isn't necessarily true in all cases, right?:

If there is only one single node without exit handler, the workflow will be Succeeded which is not retryable, so it will not cause any problems.

since the Workflow I have above is an example of an unsucceeded Workflow, right? But you're saying that in my example, the Root node is of type Pod?

And in your example what type is the Root node? I imagine there are many nodes in yours, right? Can you help me understand what the tree looks like for that?

Thank you for bearing with me. I'm guessing your change is a good one. I just want to make sure I fully understand.

juliev0 · 2024-08-25T18:47:11Z

Okay, I just decided to try running yours. This is the result:

  nodes:
    workflow-exit-handler-fail:
      displayName: workflow-exit-handler-fail
      finishedAt: "2024-08-25T18:42:37Z"
      hostNodeName: lima-rancher-desktop
      id: workflow-exit-handler-fail
      name: workflow-exit-handler-fail
      outputs:
        artifacts:
        - name: main-logs
          s3:
            key: workflow-exit-handler-fail/workflow-exit-handler-fail/main.log
        exitCode: "0"
      phase: Succeeded
      progress: 1/1
      resourcesDuration:
        cpu: 0
        memory: 5
      startedAt: "2024-08-25T18:42:28Z"
      templateName: echo
      templateScope: local/workflow-exit-handler-fail
      type: Pod
    workflow-exit-handler-fail-925896592:
      children:
      - workflow-exit-handler-fail-1000917866
      displayName: workflow-exit-handler-fail.onExit
      finishedAt: "2024-08-25T18:42:47Z"
      id: workflow-exit-handler-fail-925896592
      message: child 'workflow-exit-handler-fail-2100426797' failed
      name: workflow-exit-handler-fail.onExit
      nodeFlag:
        hooked: true
      outboundNodes:
      - workflow-exit-handler-fail-2100426797
      phase: Failed
      progress: 0/1
      resourcesDuration:
        cpu: 0
        memory: 2
      startedAt: "2024-08-25T18:42:40Z"
      templateName: exit-handler
      templateScope: local/workflow-exit-handler-fail
      type: Steps
    workflow-exit-handler-fail-1000917866:
      boundaryID: workflow-exit-handler-fail-925896592
      children:
      - workflow-exit-handler-fail-2100426797
      displayName: '[0]'
      finishedAt: "2024-08-25T18:42:47Z"
      id: workflow-exit-handler-fail-1000917866
      message: child 'workflow-exit-handler-fail-2100426797' failed
      name: workflow-exit-handler-fail.onExit[0]
      nodeFlag: {}
      phase: Failed
      progress: 0/1
      resourcesDuration:
        cpu: 0
        memory: 2
      startedAt: "2024-08-25T18:42:40Z"
      templateScope: local/workflow-exit-handler-fail
      type: StepGroup
    workflow-exit-handler-fail-2100426797:
      boundaryID: workflow-exit-handler-fail-925896592
      displayName: exit-handler-task
      finishedAt: "2024-08-25T18:42:44Z"
      hostNodeName: lima-rancher-desktop
      id: workflow-exit-handler-fail-2100426797
      message: Error (exit code 1)
      name: workflow-exit-handler-fail.onExit[0].exit-handler-task
      outputs:
        artifacts:
        - name: main-logs
          s3:
            key: workflow-exit-handler-fail/workflow-exit-handler-fail-fail-2100426797/main.log
        exitCode: "1"
      phase: Failed
      progress: 0/1
      resourcesDuration:
        cpu: 0
        memory: 2
      startedAt: "2024-08-25T18:42:40Z"
      templateName: fail
      templateScope: local/workflow-exit-handler-fail
      type: Pod

So, root is a Pod type. And in addition there's a Steps with child StepGroup with child Pod representing the ExitHandler.

juliev0 · 2024-08-25T19:51:21Z

The thing I'm not clear on is why this special logic for resetting root node even exists at all. Do we know why it's not sufficient to just catch all errored GroupNodes here? (including the root node) Is it important to also reset Succeeded Root node, not just Errored?

jswxstw · 2024-08-26T03:19:08Z

And in your example what type is the Root node? I imagine there are many nodes in yours, right? Can you help me understand what the tree looks like for that?

The phase of the root node and the workflow is usually the same if there is no exit handler.
The type of root node is usually Steps or DAG since a workflow usually have many steps or tasks.
My example is very special which breaks the above two rules:

root node is of Pod type because there is only one node.
root node is Succeeded but workflow is Failed because exit handler runs failed.

Workflow is Failed, so I can retry it, then the root node of Pod type is reset to Running from Succeeded which is unacceptable.

jswxstw · 2024-08-26T03:24:15Z

The thing I'm not clear on is why this special logic for resetting root node even exists at all. Do we know why it's not sufficient to just catch all errored GroupNodes here? (including the root node)

I didn't notice this before. I agree with you that this special logic for resetting root node seems useless now since Steps is belong to GroupNode.

Is it important to also reset Succeeded Root node, not just Errored?

Perhaps to keep it consistent with the workflow phase?

juliev0 · 2024-08-26T04:45:08Z

Hey @terrytangyuan - I know this code was written a couple years back but do you have any input for this question?

terrytangyuan · 2024-08-26T16:38:57Z

It should be fine to remove that special handling.

juliev0 · 2024-08-26T17:24:45Z

It should be fine to remove that special handling.

Thanks for the quick response @terrytangyuan !

Signed-off-by: oninowang <oninowang@tencent.com>

…oproj#13198) Signed-off-by: oninowang <oninowang@tencent.com>

Signed-off-by: oninowang <oninowang@tencent.com>

terrytangyuan reviewed Jun 17, 2024

View reviewed changes

jswxstw force-pushed the fix-13196 branch from 6224240 to 785f75a Compare June 18, 2024 06:16

agilgur5 added area/exit-handler area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries labels Jul 2, 2024

agilgur5 requested a review from terrytangyuan July 2, 2024 06:12

fix: Do not reset the root node by default. Fixes argoproj#13196

1cde4d3

Signed-off-by: oninowang <oninowang@tencent.com>

jswxstw force-pushed the fix-13196 branch from 785f75a to 1cde4d3 Compare August 27, 2024 02:10

jswxstw changed the title ~~fix: Only reset root node of group type. Fixes #13196~~ fix: Do not reset the root node by default. Fixes #13196 Aug 27, 2024

chore: fix lint error. Fixes argoproj#13196

a1672b9

Signed-off-by: oninowang <oninowang@tencent.com>

juliev0 approved these changes Aug 27, 2024

View reviewed changes

juliev0 merged commit f8f1893 into argoproj:main Aug 27, 2024
28 checks passed

Joibel pushed a commit to pipekit/argo-workflows that referenced this pull request Sep 19, 2024

fix: Do not reset the root node by default. Fixes argoproj#13196 (arg…

7398d1c

…oproj#13198) Signed-off-by: oninowang <oninowang@tencent.com>

Joibel mentioned this pull request Sep 20, 2024

Release v3.5 patch releases discussion #11997

Open

Joibel pushed a commit that referenced this pull request Sep 20, 2024

fix: Do not reset the root node by default. Fixes #13196 (#13198)

269b54c

Signed-off-by: oninowang <oninowang@tencent.com>

agilgur5 added this to the v3.5.x patches milestone Sep 20, 2024

jswxstw mentioned this pull request Nov 1, 2024

REQUEST: Promotion to Reviewer for @jswxstw argoproj/argoproj#337

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Do not reset the root node by default. Fixes #13196 #13198

fix: Do not reset the root node by default. Fixes #13196 #13198

jswxstw commented Jun 17, 2024 •

edited

Loading

terrytangyuan left a comment

jswxstw commented Jun 18, 2024 •

edited by agilgur5

Loading

jswxstw commented Jun 18, 2024

juliev0 commented Aug 23, 2024

jswxstw commented Aug 24, 2024 •

edited

Loading

juliev0 commented Aug 24, 2024 •

edited by agilgur5

Loading

jswxstw commented Aug 24, 2024

juliev0 commented Aug 24, 2024 •

edited by agilgur5

Loading

jswxstw commented Aug 24, 2024 •

edited by agilgur5

Loading

juliev0 commented Aug 25, 2024 •

edited

Loading

juliev0 commented Aug 25, 2024 •

edited by agilgur5

Loading

juliev0 commented Aug 25, 2024 •

edited

Loading

jswxstw commented Aug 26, 2024

jswxstw commented Aug 26, 2024 •

edited

Loading

juliev0 commented Aug 26, 2024

terrytangyuan commented Aug 26, 2024

juliev0 commented Aug 26, 2024

fix: Do not reset the root node by default. Fixes #13196 #13198

fix: Do not reset the root node by default. Fixes #13196 #13198

Conversation

jswxstw commented Jun 17, 2024 • edited Loading

Motivation

Modifications

Verification

terrytangyuan left a comment

Choose a reason for hiding this comment

jswxstw commented Jun 18, 2024 • edited by agilgur5 Loading

jswxstw commented Jun 18, 2024

juliev0 commented Aug 23, 2024

jswxstw commented Aug 24, 2024 • edited Loading

juliev0 commented Aug 24, 2024 • edited by agilgur5 Loading

jswxstw commented Aug 24, 2024

juliev0 commented Aug 24, 2024 • edited by agilgur5 Loading

jswxstw commented Aug 24, 2024 • edited by agilgur5 Loading

juliev0 commented Aug 25, 2024 • edited Loading

juliev0 commented Aug 25, 2024 • edited by agilgur5 Loading

juliev0 commented Aug 25, 2024 • edited Loading

jswxstw commented Aug 26, 2024

jswxstw commented Aug 26, 2024 • edited Loading

juliev0 commented Aug 26, 2024

terrytangyuan commented Aug 26, 2024

juliev0 commented Aug 26, 2024

jswxstw commented Jun 17, 2024 •

edited

Loading

jswxstw commented Jun 18, 2024 •

edited by agilgur5

Loading

jswxstw commented Aug 24, 2024 •

edited

Loading

juliev0 commented Aug 24, 2024 •

edited by agilgur5

Loading

juliev0 commented Aug 24, 2024 •

edited by agilgur5

Loading

jswxstw commented Aug 24, 2024 •

edited by agilgur5

Loading

juliev0 commented Aug 25, 2024 •

edited

Loading

juliev0 commented Aug 25, 2024 •

edited by agilgur5

Loading

juliev0 commented Aug 25, 2024 •

edited

Loading

jswxstw commented Aug 26, 2024 •

edited

Loading