Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Clean up pods of fulfilled nodes when workflow manual retry. Fix… #12105

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jswxstw
Copy link
Member

@jswxstw jswxstw commented Oct 30, 2023

Fixes #12028

Motivation

PodGC strategy: OnWorkflowSuccess.
Pods of the previously successful steps were not cleaned up after workflow manual retry.

Modifications

Pods of fulfilled nodes will be relabeled completed=false when workflow manual retry.

Verification

Workflow Demo

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: workflow-fail-with-argument-
spec:
  entrypoint: main
  podGC:
    strategy: OnWorkflowSuccess
  arguments:
    parameters:
      - name: code
        value: 1
  templates:
  - name: main
    steps:
    - - name: print
        template: print-with-argument
        arguments:
          parameters:
            - name: code
              value: '{{workflow.parameters.code}}'
    - - name: check-status
        template: fail-with-argument
        arguments:
          parameters:
            - name: code
              value: '{{workflow.parameters.code}}'
  - name: print-with-argument
    inputs:
      parameters:
        - name: code
    container:
      image: docker/whalesay:latest
      command: [sh, -c]
      args: ["echo {{inputs.parameters.code}}"]
  - name: fail-with-argument
    inputs:
      parameters:
        - name: code
    container:
      image: python:alpine3.6
      command: [python, -c]
      args: ["import sys; sys.exit({{inputs.parameters.code}})"]
  1. Workflow executed failed first time.
image image
  1. Retry the failed workflow manually, then workflow executes successfully and changes to succeeded phase.
image
  1. All pods were successfully cleaned up.
image

Signed-off-by: oninowang <oninowang@tencent.com>
@jswxstw jswxstw closed this Oct 30, 2023
@jswxstw jswxstw reopened this Oct 30, 2023
@agilgur5 agilgur5 added area/api Argo Server API area/controller Controller issues, panics area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries area/server and removed area/api Argo Server API labels Oct 30, 2023
@@ -29,6 +29,7 @@ rules:
- list
- watch
- delete
- patch
Copy link
Member

@agilgur5 agilgur5 Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pretty minor permission given the Server already has delete, but I was thinking this logic may make more sense on the Controller actually.

When the Controller detects a retry, it can check the Workflow's child Pods.

Need to think a bit more about how that would work, but that would preserve the existing separation of duties between Server and Controller, where the Server is just a simple intermediary for users that can be bypassed with correct RBAC.
The Server primarily reads and listens to changes, and its modifications are limited to signaling the Controller to perform an action (via a label for instance). But the Controller logic is actually responsible to perform the actions themselves (such as retrying)

Copy link
Member

@agilgur5 agilgur5 Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I see that podsToDelete already broke the separation a bit... we honestly may want to refactor that too...

@terrytangyuan do you have any thoughts on this? Specifically, the current behavior with the Server having Pod modification logic actually invalidates your answer in #12027 (comment). I think it ideally should behave the way you described there (and how I described above), if possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. RBAC modifications is required for my answer in #12027 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@terrytangyuan I was actually looking for your thoughts on the approach. I think we should refactor this logic to not have the Server perform anything but a label modification, and then the Controller should actually delete, reset, etc child Pods as needed (as that is the Controller's responsibility, not the Server's)

Copy link
Member

@sarabala1979 sarabala1979 Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@terrytangyuan I agree with @agilgur5. Server should not update or delete other than Argo workflow CRDs. If some user directly updates the WF spec then this will not work. Controller should handle the scenario to clean the workflow pods based on GCStrategy.
I remember @ishitasequeira was proposed to delete all workflow pods "label based" like workflowname deletion if the workflow complete.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Julie agreed in #12419 (comment) as well, so I filed #12538 as a tracking issue for this refactor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@@ -299,6 +300,20 @@ func (w *archivedWorkflowServer) RetryArchivedWorkflow(ctx context.Context, req
}
}

for _, podName := range podsToReset {
Copy link
Member

@agilgur5 agilgur5 Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was gonna say that this may make sense to place into a helper function, but I see that #7988 (comment) explicitly moved the kubeClient logic out of the helper function (when retry was introduced for archived workflows in #7988).

So this matches existing behavior with podsToDelete above. If we refactored these into the Controller (per above comment), this would go away though 🤔

@agilgur5
Copy link
Member

agilgur5 commented Oct 30, 2023

Thanks for taking the time to investigate and fix this!

Per the in-line comments, we may want to refactor some existing logic that this PR highlights as potentially having broken the Controller/Server separation of responsibilities.

I think the current code more or less makes sense within that existing context (the re-processing checks in the Controller are a bit confusing, though that's partially due to the existing logic. retries are also one of the most complex parts of the codebase), but we may want to change that existing code.

@jswxstw
Copy link
Member Author

jswxstw commented Oct 31, 2023

Retries(both auto or manual) are indeed the most complex parts of the codebase, which bring a lot of problems like #12009, #12010.
Moving the implementation into Controller really make sense, I'm glad to see this improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries area/server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Completed pods were not all cleaned when workflow is succeesed after manual retry
4 participants