Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Mark non-fulfilled taskSetNodes error when agent pod failed. Fixes #12703 #12723

Merged
merged 5 commits into from
Jul 23, 2024

Conversation

jswxstw
Copy link
Member

@jswxstw jswxstw commented Mar 1, 2024

Fixes #12703

Motivation

  1. Node status is non-fulfilled when agent pod failed.
  2. argo-cluster-role can not get secrets argo-workflows-agent-ca-certificates

Modifications

  1. Mark non-fulfilled taskSetNodes error when agent pod failed.
  2. add secrets argo-workflows-agent-ca-certificates get permission to argo-cluster-role

Verification

UT

Delete hello-executor-plugin serviceaccount to simulate agent pod failure scenario.

  1. single step example
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: hello-
spec:
  entrypoint: main
  templates:
    - name: main
      plugin:
        hello: { }
# argo get hello-q8h7f
Name:                hello-q8h7f
Namespace:           argo
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Error
Message:             create agent pod failed with reason:"serviceaccounts "hello-executor-plugin" not found"
Conditions:
 PodRunning          False
 Completed           True
Created:             Fri Mar 01 15:55:18 +0800 (53 seconds ago)
Started:             Fri Mar 01 15:55:18 +0800 (53 seconds ago)
Finished:            Fri Mar 01 15:55:19 +0800 (52 seconds ago)
Duration:            1 second
Progress:            0/1

STEP            TEMPLATE  PODNAME  DURATION  MESSAGE
 ⚠ hello-q8h7f  main                         create agent pod failed with reason:"serviceaccounts "hello-executor-plugin" not found"
  1. parallel steps example
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: hello-
spec:
  entrypoint: main
  templates:
    - name: main
      steps:
      - - name: plugin
          template: hello-plugin
        - name: container
          template: hello-container
    - name: hello-plugin
      plugin:
        hello: {}
    - name: hello-container
      container:
        image: docker/whalesay:latest
        command: [sh, -c]
        args: ["echo hello"]
# argo get hello-dtkm9
Name:                hello-dtkm9
Namespace:           argo
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Failed
Message:             child 'hello-dtkm9-3577871718' failed
Conditions:
 PodRunning          False
 Completed           True
Created:             Fri Mar 01 15:55:35 +0800 (28 seconds ago)
Started:             Fri Mar 01 15:55:35 +0800 (28 seconds ago)
Finished:            Fri Mar 01 15:55:45 +0800 (18 seconds ago)
Duration:            10 seconds
Progress:            1/2
ResourcesDuration:   0s*(1 cpu),4s*(100Mi memory)

STEP              TEMPLATE         PODNAME                                 DURATION  MESSAGE
 ✖ hello-dtkm9    main                                                               child 'hello-dtkm9-3577871718' failed
 └─┬─✔ container  hello-container  hello-dtkm9-hello-container-1541734788  7s
   └─⚠ plugin     hello-plugin                                                       create agent pod failed with reason:"serviceaccounts "hello-executor-plugin" not found"

@jswxstw jswxstw marked this pull request as draft March 1, 2024 02:38
@jswxstw jswxstw marked this pull request as ready for review March 1, 2024 07:40
@jswxstw jswxstw force-pushed the fix-12703 branch 8 times, most recently from 4ce87ba to bafcc48 Compare March 5, 2024 09:26
@agilgur5 agilgur5 added the area/agent Argo Agent that runs for HTTP and Plugin templates label Jul 2, 2024
@juliev0
Copy link
Contributor

juliev0 commented Jul 19, 2024

Using node-level errors seems good as far as I can tell. Want to confirm - when a standard (non-Agent) Pod fails to be created, I assume that the Node results in NodeError, and thus this would be the same as that?

@jswxstw
Copy link
Member Author

jswxstw commented Jul 22, 2024

You are right, if creating non-agent pod with non-transient error or pod executing with error, the node will be marked as NodeError.

…n to `argo-cluster-role`. Fixes argoproj#12703

Signed-off-by: oninowang <oninowang@tencent.com>
…ror. Fixes argoproj#12703

Signed-off-by: oninowang <oninowang@tencent.com>
Signed-off-by: oninowang <oninowang@tencent.com>
@jswxstw jswxstw requested a review from juliev0 July 23, 2024 02:04
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jul 23, 2024
@juliev0 juliev0 merged commit 1ed1368 into argoproj:main Jul 23, 2024
28 checks passed
Joibel pushed a commit to pipekit/argo-workflows that referenced this pull request Sep 19, 2024
Joibel pushed a commit that referenced this pull request Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/agent Argo Agent that runs for HTTP and Plugin templates lgtm This PR has been approved by a maintainer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Workflow is Error but node is not Error when Agent error
4 participants