Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some parallell execute Steps may be stucked on running phase when retry workflow manually #12010

Open
2 of 3 tasks
jswxstw opened this issue Oct 16, 2023 · 1 comment
Open
2 of 3 tasks
Labels
area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries P3 Low priority type/bug

Comments

@jswxstw
Copy link
Member

jswxstw commented Oct 16, 2023

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

image

As the picture shows, workflow has three parallell execute nodes: step-case(Failed), retry-step-case(Failed), retry-step-group-case(Succeeded).
If I retry this workflow manually, retry-step-group-case(0) will stucked on running phase even if workflow is completed.
image

Version

latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: workflow-fail-with-argument-
spec:
  entrypoint: main
  arguments:
    parameters:
      - name: code
        value: 1
  templates:
  - name: main
    steps:
    - - name: retry-step-group-case
        template: fail-step-group
      - name: retry-step-case
        template: fail-with-argument-with-retry
        arguments:
          parameters:
            - name: code
              value: '{{workflow.parameters.code}}'
      - name: step-case
        template: fail-with-argument
        arguments:
          parameters:
            - name: code
              value: '{{workflow.parameters.code}}'
  - name: fail-with-argument-with-retry
    inputs:
      parameters:
        - name: code
    container:
      image: python:alpine3.6
      command: [python, -c]
      args: ["import sys; sys.exit({{inputs.parameters.code}})"]
    retryStrategy:
      limit: "5"
      backoff:
        duration: "5"
        factor: "2"
        maxDuration: "1m"
  - name: fail-with-argument
    inputs:
      parameters:
        - name: code
    container:
      image: python:alpine3.6
      command: [python, -c]
      args: ["import sys; sys.exit({{inputs.parameters.code}})"]
  - name: fail-with-rate
    container:
      image: python:alpine3.6
      command: ["python", -c]
      args: ["import random; import sys; exit_code = random.choice([0, 0, 1]); sys.exit(exit_code); print(exit_code)"]
  - name: fail-step-group
    steps:
    - - name: step1
        template: fail-with-rate
    - - name: step2
        template: fail-with-rate
    - - name: step3
        template: fail-with-rate
    retryStrategy:
      limit: "5"
      backoff:
        duration: "5"
        factor: "2"
        maxDuration: "1m"

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
@agilgur5 agilgur5 added area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries P3 Low priority labels Oct 16, 2023
@jswxstw
Copy link
Member Author

jswxstw commented Oct 18, 2023

workflow_server.go RetryWorkflow() -> util.go FormulateRetryWorkflow()

case wfv1.NodeError, wfv1.NodeFailed, wfv1.NodeOmitted:
	if isGroupNode(node) {
                 // retry-step-group-case(0) is Steps type, which belong to group node, so it will be reset to running phase
                 // retry-step-group-case is succeeded phase, so its child node running phase will not affect the workflow execution
		newNode := node.DeepCopy()
		newWF.Status.Nodes.Set(newNode.ID, resetNode(*newNode))
		log.Debugf("Reset %s node %s since it's a group node", node.Name, string(node.Phase))
		continue
	} else {
		log.Debugf("Deleted %s node %s since it's not a group node", node.Name, string(node.Phase))
		deletedPods, podsToDelete = deletePodNodeDuringRetryWorkflow(wf, node, deletedPods, podsToDelete)
		log.Debugf("Deleted pod node: %s", node.Name)
		deletedNodes[node.ID] = true
	}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries P3 Low priority type/bug
Projects
None yet
Development

No branches or pull requests

2 participants