-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Do not reset the root node by default. Fixes #13196 #13198
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Single-node workflow with failed exit handler can not be retried correctly.
Can you add a test case?
Sure. |
/retest |
Thanks for fixing. Would you mind taking me through what happened to help me understand? I don't know how the exit handler is related. First, the root node got reset to "Running" as you mentioned. So, how did the node being in "Running" cause the retry to fail? (any links into the code if possible will help) Thanks! |
The root node of
|
Got it, thanks. I see the difference is the Group node would be a DAG, Steps, etc, which doesn't run. And regarding the Description, does it matter that there was an exit handler, or could it just have easily been a single node workflow that failed, and would've had the same issue? |
If there is only one single node without exit handler, the workflow will be |
I'm confused. Is this not a single node Workflow?: version: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: entrypoint-fail
spec:
entrypoint: failing-entrypoint
templates:
- name: failing-entrypoint
container:
image: alpine:3.18
command: [sh, -c]
args: ["exit 1"] |
Yes, this is a single node worklfow. But the root node is of pod type which is no need to be reset, since it will be deleted after manually retry. # argo retry entrypoint-fail
INFO[2024-08-25T00:25:22.288Z] Deleting pod podDeleted=entrypoint-fail
INFO[2024-08-25T00:25:22.295Z] Workflow to be dehydrated Workflow Size=902
Name: entrypoint-fail
Namespace: argo
ServiceAccount: unset (will run with the default ServiceAccount)
Status: Running
Conditions:
PodRunning False
Completed False
Created: Sun Aug 25 00:24:35 +0800 (47 seconds ago)
Started: Sun Aug 25 00:25:22 +0800 (now)
Duration: 0 seconds
Progress: 0/1
ResourcesDuration: 0s*(1 cpu),2s*(100Mi memory) |
So, just trying to understand - this statement isn't necessarily true in all cases, right?:
since the Workflow I have above is an example of an unsucceeded Workflow, right? But you're saying that in my example, the Root node is of type And in your example what type is the Root node? I imagine there are many nodes in yours, right? Can you help me understand what the tree looks like for that? Thank you for bearing with me. I'm guessing your change is a good one. I just want to make sure I fully understand. |
Okay, I just decided to try running yours. This is the result: nodes:
workflow-exit-handler-fail:
displayName: workflow-exit-handler-fail
finishedAt: "2024-08-25T18:42:37Z"
hostNodeName: lima-rancher-desktop
id: workflow-exit-handler-fail
name: workflow-exit-handler-fail
outputs:
artifacts:
- name: main-logs
s3:
key: workflow-exit-handler-fail/workflow-exit-handler-fail/main.log
exitCode: "0"
phase: Succeeded
progress: 1/1
resourcesDuration:
cpu: 0
memory: 5
startedAt: "2024-08-25T18:42:28Z"
templateName: echo
templateScope: local/workflow-exit-handler-fail
type: Pod
workflow-exit-handler-fail-925896592:
children:
- workflow-exit-handler-fail-1000917866
displayName: workflow-exit-handler-fail.onExit
finishedAt: "2024-08-25T18:42:47Z"
id: workflow-exit-handler-fail-925896592
message: child 'workflow-exit-handler-fail-2100426797' failed
name: workflow-exit-handler-fail.onExit
nodeFlag:
hooked: true
outboundNodes:
- workflow-exit-handler-fail-2100426797
phase: Failed
progress: 0/1
resourcesDuration:
cpu: 0
memory: 2
startedAt: "2024-08-25T18:42:40Z"
templateName: exit-handler
templateScope: local/workflow-exit-handler-fail
type: Steps
workflow-exit-handler-fail-1000917866:
boundaryID: workflow-exit-handler-fail-925896592
children:
- workflow-exit-handler-fail-2100426797
displayName: '[0]'
finishedAt: "2024-08-25T18:42:47Z"
id: workflow-exit-handler-fail-1000917866
message: child 'workflow-exit-handler-fail-2100426797' failed
name: workflow-exit-handler-fail.onExit[0]
nodeFlag: {}
phase: Failed
progress: 0/1
resourcesDuration:
cpu: 0
memory: 2
startedAt: "2024-08-25T18:42:40Z"
templateScope: local/workflow-exit-handler-fail
type: StepGroup
workflow-exit-handler-fail-2100426797:
boundaryID: workflow-exit-handler-fail-925896592
displayName: exit-handler-task
finishedAt: "2024-08-25T18:42:44Z"
hostNodeName: lima-rancher-desktop
id: workflow-exit-handler-fail-2100426797
message: Error (exit code 1)
name: workflow-exit-handler-fail.onExit[0].exit-handler-task
outputs:
artifacts:
- name: main-logs
s3:
key: workflow-exit-handler-fail/workflow-exit-handler-fail-fail-2100426797/main.log
exitCode: "1"
phase: Failed
progress: 0/1
resourcesDuration:
cpu: 0
memory: 2
startedAt: "2024-08-25T18:42:40Z"
templateName: fail
templateScope: local/workflow-exit-handler-fail
type: Pod
So, root is a |
The phase of the root node and the workflow is usually the same if there is no exit handler.
Workflow is |
I didn't notice this before. I agree with you that this special logic for resetting root node seems useless now since
Perhaps to keep it consistent with the workflow phase? |
Hey @terrytangyuan - I know this code was written a couple years back but do you have any input for this question? |
It should be fine to remove that special handling. |
Thanks for the quick response @terrytangyuan ! |
Signed-off-by: oninowang <oninowang@tencent.com>
Signed-off-by: oninowang <oninowang@tencent.com>
…oproj#13198) Signed-off-by: oninowang <oninowang@tencent.com>
Fixes #13196
Motivation
Single-node workflow with failed exit handler can not be retried correctly.
Modifications
Only reset root node when it is
FailedOrError
, remove the default reset logic.Verification
local test and e2e test