Move `retry` Pod deletions out of Server and into Controller for proper separation of duties #12538

agilgur5 · 2024-01-17T20:19:04Z

Summary

Currently, the Server and Controller are architected & intended to be independent, with the Server not being strictly necessary for any operations. Most of the Server's functionality is to be a simple CRUD wrapper where a user could replicate that functionality themselves via kubectl. When the Server has to communicate with the Controller, it typically signals to it by adding a label to a Workflow.

This separation of duties is important to keep consistent and is currently true for all but one case: the retry operation currently has the Server delete Pods of a Workflow, which is something that the Controller should do instead. The Server shouldn't need permissions to delete Pods either as it currently does.

Use Cases

I (and then others) noticed this in #12105 (comment) and #12419 (comment) and were pretty surprised when we saw this.

Removing this functionality from the Server will make it more secure by not having delete pods permissions which the Controller already has.

It will also make it possible to do a retry with just kubectl by adding a label to the Workflow CR, as is intended and as was thought as possible per #12027 (comment).

Implementation details

The Server should only label the Workflow
The Server should no longer need delete pods permissions
The Controller should detect that label as a trigger for the retry
The Controller should perform the Pod deletion and then initiate the retry
The Controller should handle bugs / missing functionality such as that of fix: Clean up pods of fulfilled nodes when workflow manual retry. Fix… #12105 / Completed pods were not all cleaned when workflow is succeesed after manual retry #12028

Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritize the proposals with the most 👍.

The text was updated successfully, but these errors were encountered:

terrytangyuan · 2024-02-18T04:21:47Z

SGTM

agilgur5 added type/feature Feature request area/controller Controller issues, panics area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries labels Jan 17, 2024

agilgur5 mentioned this issue Jan 17, 2024

feat: delete pods in parallel to speed up retryworkflow #12419

Merged

agilgur5 added type/security Security related area/server labels Jan 17, 2024

agilgur5 mentioned this issue Jan 17, 2024

fix: Clean up pods of fulfilled nodes when workflow manual retry. Fix… #12105

Open

agilgur5 added the solution/suggested A solution to the bug has been suggested. Someone needs to implement it. label Jan 17, 2024

agilgur5 mentioned this issue Feb 2, 2024

REQUEST: Promotion to Approver for @agilgur5 argoproj/argoproj#277

Closed

7 tasks

agilgur5 mentioned this issue Feb 10, 2024

feat: speed up retry archived workflow #12624

Closed

This was referenced Feb 23, 2024

Retry failed workflow with ttl deleted after initial secondsAfterFailure while still running #12636

Closed

Casbin RBAC for Argo Server #6490

Open

shuangkun linked a pull request Mar 9, 2024 that will close this issue

refactor: change the logic of delete pod during retry. Fixes: #12538 #12734

Open

agilgur5 mentioned this issue Apr 3, 2024

workflows list page delay in showing retry status from details page #12868

Open

4 tasks

agilgur5 assigned shuangkun Apr 9, 2024

agilgur5 mentioned this issue Apr 12, 2024

retry DAG workflow with depends failed with msg Ancestor task node step2 not found #12924

Closed

4 tasks

agilgur5 mentioned this issue May 15, 2024

Resume/suspend/terminate/stop will result in invalid state #2942

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move `retry` Pod deletions out of Server and into Controller for proper separation of duties #12538

Move `retry` Pod deletions out of Server and into Controller for proper separation of duties #12538

agilgur5 commented Jan 17, 2024

terrytangyuan commented Feb 18, 2024

Move retry Pod deletions out of Server and into Controller for proper separation of duties #12538

Move retry Pod deletions out of Server and into Controller for proper separation of duties #12538

Comments

agilgur5 commented Jan 17, 2024

Summary

Use Cases

Implementation details

terrytangyuan commented Feb 18, 2024

Move `retry` Pod deletions out of Server and into Controller for proper separation of duties #12538

Move `retry` Pod deletions out of Server and into Controller for proper separation of duties #12538