Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add finalizer to workflow pod to prevent pod deleted. Fixes #8783 #9058

Closed
wants to merge 33 commits into from

Conversation

rohankmr414
Copy link
Member

@rohankmr414 rohankmr414 commented Jun 25, 2022

Signed-off-by: Rohan Kumar rohan@pipekit.io

Fixes #8783

@dpadhiar
Copy link
Member

dpadhiar commented Jun 27, 2022

@rohankmr414 could you write a test case or add an example of this to your PR description?

@juliev0 @sarabala1979 the e2e tests for this PR are all failing with 15m0 timeout, something to note I assume

@juliev0
Copy link
Contributor

juliev0 commented Jun 27, 2022

@rohankmr414 Regarding the test failures, we should determine if they're "real" or intermittently failing. Have you tried to run them locally? You should be able to do "make [testname]" like "make TestMarkDaemonedPodSucceeded"

@sarabala1979
Copy link
Member

@rohankmr414 It is a nice fix. Please add unit test and e2e test. Fix the test failures.

@rohankmr414
Copy link
Member Author

=== RUN   TestDaemonPodSuite/TestMarkDaemonedPodSucceeded
Submitting workflow  daemoned-pod-completed-
Waiting 1m0s for workflow metadata.name=daemoned-pod-completed-x6jsz
 ? daemoned-pod-completed-x6jsz Workflow 0s      

 ● daemoned-pod-completed-x6jsz   Workflow 0s      
 └ ● daemoned-pod-completed-x6jsz DAG      0s      
 └ ◷ daemoned                     Pod      0s      

 ● daemoned-pod-completed-x6jsz   Workflow 0s      
 └ ● daemoned-pod-completed-x6jsz DAG      0s      
 └ ◷ daemoned                     Pod      0s      PodInitializing

 ● daemoned-pod-completed-x6jsz   Workflow 0s      
 └ ● daemoned-pod-completed-x6jsz DAG      0s      
 └ ● daemoned                     Pod      3s      
 └ ◷ daemon-dependent             Pod      0s      

 ● daemoned-pod-completed-x6jsz   Workflow 0s      
 └ ● daemoned-pod-completed-x6jsz DAG      0s      
 └ ● daemoned                     Pod      3s      
 └ ◷ daemon-dependent             Pod      0s      PodInitializing

 ● daemoned-pod-completed-x6jsz   Workflow 0s      
 └ ● daemoned-pod-completed-x6jsz DAG      0s      
 └ ● daemoned                     Pod      3s      
 └ ● daemon-dependent             Pod      0s      

 ● daemoned-pod-completed-x6jsz   Workflow 0s      
 └ ● daemoned-pod-completed-x6jsz DAG      0s      
 └ ● daemoned                     Pod      3s      
 └ ● daemon-dependent             Pod      0s      

 ✔ daemoned-pod-completed-x6jsz   Workflow 9s      
 └ ✔ daemoned-pod-completed-x6jsz DAG      9s      
 └ ✔ daemoned                     Pod      3s      
 └ ✔ daemon-dependent             Pod      5s      

 ✔ daemoned-pod-completed-x6jsz   Workflow 9s      
 └ ✔ daemoned-pod-completed-x6jsz DAG      9s      
 └ ✔ daemoned                     Pod      3s      
 └ ✔ daemon-dependent             Pod      5s      

Condition "to be Succeeded" met after 9s
Checking expectation daemoned-pod-completed-x6jsz
daemoned-pod-completed-x6jsz : Succeeded 
=== PASS: DaemonPodSuite/TestMarkDaemonedPodSucceeded
    --- PASS: TestDaemonPodSuite/TestMarkDaemonedPodSucceeded (9.54s)
PASS

Screenshot from 2022-07-13 19-56-19

@sarabala1979 @juliev0 For some reason, the tests are passing on my local env with the finalizer, not sure what's happening here

@juliev0
Copy link
Contributor

juliev0 commented Jul 13, 2022

=== RUN   TestDaemonPodSuite/TestMarkDaemonedPodSucceeded
Submitting workflow  daemoned-pod-completed-
Waiting 1m0s for workflow metadata.name=daemoned-pod-completed-x6jsz
 ? daemoned-pod-completed-x6jsz Workflow 0s      

 ● daemoned-pod-completed-x6jsz   Workflow 0s      
 └ ● daemoned-pod-completed-x6jsz DAG      0s      
 └ ◷ daemoned                     Pod      0s      

 ● daemoned-pod-completed-x6jsz   Workflow 0s      
 └ ● daemoned-pod-completed-x6jsz DAG      0s      
 └ ◷ daemoned                     Pod      0s      PodInitializing

 ● daemoned-pod-completed-x6jsz   Workflow 0s      
 └ ● daemoned-pod-completed-x6jsz DAG      0s      
 └ ● daemoned                     Pod      3s      
 └ ◷ daemon-dependent             Pod      0s      

 ● daemoned-pod-completed-x6jsz   Workflow 0s      
 └ ● daemoned-pod-completed-x6jsz DAG      0s      
 └ ● daemoned                     Pod      3s      
 └ ◷ daemon-dependent             Pod      0s      PodInitializing

 ● daemoned-pod-completed-x6jsz   Workflow 0s      
 └ ● daemoned-pod-completed-x6jsz DAG      0s      
 └ ● daemoned                     Pod      3s      
 └ ● daemon-dependent             Pod      0s      

 ● daemoned-pod-completed-x6jsz   Workflow 0s      
 └ ● daemoned-pod-completed-x6jsz DAG      0s      
 └ ● daemoned                     Pod      3s      
 └ ● daemon-dependent             Pod      0s      

 ✔ daemoned-pod-completed-x6jsz   Workflow 9s      
 └ ✔ daemoned-pod-completed-x6jsz DAG      9s      
 └ ✔ daemoned                     Pod      3s      
 └ ✔ daemon-dependent             Pod      5s      

 ✔ daemoned-pod-completed-x6jsz   Workflow 9s      
 └ ✔ daemoned-pod-completed-x6jsz DAG      9s      
 └ ✔ daemoned                     Pod      3s      
 └ ✔ daemon-dependent             Pod      5s      

Condition "to be Succeeded" met after 9s
Checking expectation daemoned-pod-completed-x6jsz
daemoned-pod-completed-x6jsz : Succeeded 
=== PASS: DaemonPodSuite/TestMarkDaemonedPodSucceeded
    --- PASS: TestDaemonPodSuite/TestMarkDaemonedPodSucceeded (9.54s)
PASS

Screenshot from 2022-07-13 19-56-19

@sarabala1979 @juliev0 For some reason, the tests are passing on my local env with the finalizer, not sure what's happening here

Hmm, sometimes there are intermittent failures unfortunately. We have a bug logged for that (there's an idea that the issue is caused by some overall timeout being too short). Can you push an "empty commit" (i.e. add a space or line or something) to re-trigger the tests? Then you can see if the same tests are failing every time or are intermittent? (Sorry for the trouble)

@rohankmr414
Copy link
Member Author

@juliev0 actually I did a rebase and force push which re-triggered the tests, the same tests are failing again. I think the reason behind this is that in some tests pods are being terminated and they are being stuck in a terminating state due to the finalizer which somehow prevents the next test from being run. It happened in my local env as well.
This is after running make test-functional E2E_SUITE_TIMEOUT=15m STATIC_FILES=false after the test TestDaemonTemplateRef which prevents TestMarkDaemonedPodSucceeded from being run in https://github.com/argoproj/argo-workflows/runs/7320582957?check_suite_focus=true
Screenshot from 2022-07-13 22-11-45
Screenshot from 2022-07-13 22-12-08

If I remove the finalizer manually, after which the stuck pods are terminated, the tests are running fine.

@juliev0
Copy link
Contributor

juliev0 commented Jul 13, 2022

Oh, so it was after you did a rebase, meaning some recent commit in master caused it to change for you? Sorry about that.

When you say "some test pods are being terminated" do you mean the Controller is not the one deleting them and therefore they're not going through your logic to remove the finalizer? Unfortunately, I'm not an expert on those tests. Can you investigate and modify the tests if needed?

@rohankmr414
Copy link
Member Author

@juliev0 the error is not because of the rebase, ig it is because of some bug in pod deletion logic

@juliev0
Copy link
Contributor

juliev0 commented Jul 13, 2022

@juliev0 the error is not because of the rebase, ig it is because of some bug in pod deletion logic

Okay, thanks for the clarification. If you are able to investigate the bug that would be great.

@rohankmr414 rohankmr414 force-pushed the finalizer branch 4 times, most recently from 762c791 to b164fd4 Compare July 21, 2022 20:49
@rohankmr414
Copy link
Member Author

@juliev0 @sarabala1979 the issue occurs when running multiple tests, some pods are being terminated without being labelled as completed from the previous test, preventing the next test from being run. On running the tests individually the pods are being labelled as completed successfully and getting cleaned up later, not really sure what's happening here

@juliev0
Copy link
Contributor

juliev0 commented Jul 26, 2022

@juliev0 @sarabala1979 the issue occurs when running multiple tests, some pods are being terminated without being labelled as completed from the previous test, preventing the next test from being run. On running the tests individually the pods are being labelled as completed successfully and getting cleaned up later, not really sure what's happening here

Sorry for the lack of response. I'm new to the project so don't have any insider information. :) It looks like you've made some recent commits - are you able to address this issue of the test?

@rohankmr414 rohankmr414 force-pushed the finalizer branch 7 times, most recently from 68ff951 to f1ff236 Compare August 1, 2022 14:38
Copy link
Contributor

@alexec alexec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think any e2e tests should be changed. If you changed them, then maybe code does not work.

Code looks correct to me otherwise.

@alexec alexec added the pinned label Aug 10, 2022
* fix: Fix test fail flakey on finalizer branch

Signed-off-by: Atsushi Sakai <sakai.at24@gmail.com>
Signed-off-by: Rohan Kumar <rohan@pipekit.io>
Signed-off-by: Rohan Kumar <rohan@pipekit.io>
@juliev0
Copy link
Contributor

juliev0 commented Nov 29, 2023

Sorry for the late response. Thanks for fixing the DCO. I know that can be annoying. (For the future, it seems reasonable that we could make an exception here and there for that but I’m not sure what the precedent is.)

@sakai-ast
Copy link
Contributor

sakai-ast commented Nov 30, 2023

Thank you for fixing the DCO error. However, another CI test is now failing...
I have tried to reproduce the failed test, but could not fully reproduce the problem this time.

@juliev0
I believe this is the first time this CI test has failed on this PR, and it looks flaky.
I think this PR is ready for merge, should we still address this failed test?

@juliev0
Copy link
Contributor

juliev0 commented Nov 30, 2023

Unfortunately there are flakey tests. The typical thing to do in that case is repeated “empty commits” until it all passes

@sakai-ast
Copy link
Contributor

@rohankmr414
If you have time, it would be greatly appreciated if you could address the above until the CI passes.

@sakai-ast
Copy link
Contributor

Thank you for your help.

@juliev0
All tests have been passed. Could you please confirm?

Operation: "replace",
Path: "/metadata/labels/workflows.argoproj.io~1completed",
Value: "true",
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this label right? It's not supposed to be "workflows.argoproj.io/completed", is it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if so, you can use common.LabelKeyCompleted

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, thanks

propagation := metav1.DeletePropagationBackground
err := pods.Delete(ctx, podName, metav1.DeleteOptions{
err = pods.Delete(ctx, podName, metav1.DeleteOptions{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that in general, a given Pod reaches labelPodCompleted and then deletePod logic? Do you know if it's possible for that not to happen?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The execution of the deletePod logic is decided by the following function.

func determinePodCleanupAction(
selector labels.Selector,
podLabels map[string]string,
strategy wfv1.PodGCStrategy,
workflowPhase wfv1.WorkflowPhase,
podPhase apiv1.PodPhase,
) podCleanupAction {
switch {
case !selector.Matches(labels.Set(podLabels)): // if the pod will never be deleted, label it now
return labelPodCompleted
case strategy == wfv1.PodGCOnPodNone:
return labelPodCompleted
case strategy == wfv1.PodGCOnWorkflowCompletion && workflowPhase.Completed():
return deletePod
case strategy == wfv1.PodGCOnWorkflowSuccess && workflowPhase == wfv1.WorkflowSucceeded:
return deletePod
case strategy == wfv1.PodGCOnPodCompletion:
return deletePod
case strategy == wfv1.PodGCOnPodSuccess && podPhase == apiv1.PodSucceeded:
return deletePod
case strategy == wfv1.PodGCOnPodSuccess && podPhase == apiv1.PodFailed:
return labelPodCompleted
case workflowPhase.Completed():
return labelPodCompleted
}
return ""
}

This belongs by Pod GC strategy.
https://argoproj.github.io/argo-workflows/fields/#:~:text=pod%20GC%20queue.-,strategy,-string

# Pod GC strategy must be one of the following:
# * OnPodCompletion - delete pods immediately when pod is completed (including errors/failures)
# * OnPodSuccess - delete pods immediately when pod is successful
# * OnWorkflowCompletion - delete pods when workflow is completed
# * OnWorkflowSuccess - delete pods when workflow is successful
# Default: do not delete pods

@@ -168,6 +168,9 @@ const (
// Finalizer to block deletion of the workflow if deletion of artifacts fail for some reason.
FinalizerArtifactGC = workflow.WorkflowFullName + "/artifact-gc"

// Finalizer blocks deletion of pods until we're captured their status.
Finalizer = workflow.WorkflowFullName + "/status"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these 2 finalizers should be named to distinguish what they're being applied on? Like:

FinalizerWorkflowArtifactGC
FinalizerPodStatus

What do you think?

@juliev0
Copy link
Contributor

juliev0 commented Dec 6, 2023

There is a danger in this PR of there being some edge case which causes that finalizer to remain on the pod and for the pod to not be deletable unless manually patched...hence Alex's request to add a feature flag (I think that would generally be an environment variable - not sure if it should default to false or true).

Also, do you know for a fact that the deletePod logic in operator.go is the only way in the Controller for a pod to be deleted? (For example, I notice that the Pod is owned by the Workflow (OwnerReference here), which means there could theoretically be a scenario in which it's being automatically deleted when the Workflow gets deleted some of the time? Could you help alleviate any concern for that by talking through the flow, including any edge cases you can think of?

sakai-ast added a commit to sakai-ast/argo-workflows that referenced this pull request Dec 22, 2023
…rgoproj#8783 Continuing Work of argoproj#9058

Signed-off-by: Atsushi Sakai <sakai.at24@gmail.com>
sakai-ast added a commit to sakai-ast/argo-workflows that referenced this pull request Dec 26, 2023
…rgoproj#8783 Continuing Work of argoproj#9058

Signed-off-by: Atsushi Sakai <sakai.at24@gmail.com>
@sakai-ast
Copy link
Contributor

I still have a strong need for the finalizer logic in my organization. However, since this PR is not mine, it is difficult for me to contribute to it quickly. Therefore, I have created the following PR in respect of this one.
#12413

hence Alex's request to add a feature flag (I think that would generally be an environment variable - not sure if it should default to false or true).

I implemented a feature flag in the above PR.

Also, do you know for a fact that the deletePod logic in operator.go is the only way in the Controller for a pod to be deleted?

As far as I know, yes.

(For example, I notice that the Pod is owned by the Workflow (OwnerReference here), which means there could theoretically be a scenario in which it's being automatically deleted when the Workflow gets deleted some of the time?

Currently, Argo Workflows allows users to choose the pod deletion logic, as noted in the comments below.
#9058 (comment)

If the pod deletion logic were solely reliant on OwnerReference, users would not have this choice. Therefore, I believe the current pod deletion logic is necessary.

@agilgur5 agilgur5 added the area/controller Controller issues, panics label Dec 29, 2023
juliev0 pushed a commit that referenced this pull request Jan 29, 2024
…8783 Continuing Work of #9058 (#12413)

Signed-off-by: Atsushi Sakai <sakai.at24@gmail.com>
@agilgur5
Copy link
Contributor

agilgur5 commented Jan 29, 2024

Closing this as superseded by #12413, which was just merged. Big thanks to everyone who worked on it!

@agilgur5 agilgur5 closed this Jan 29, 2024
@juliev0
Copy link
Contributor

juliev0 commented Jan 29, 2024

Closing this as superseded by #12413, which was just merged. Big thanks to everyone who worked on it!

Good call

isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this pull request Feb 4, 2024
…rgoproj#8783 Continuing Work of argoproj#9058 (argoproj#12413)

Signed-off-by: Atsushi Sakai <sakai.at24@gmail.com>
isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this pull request Feb 27, 2024
…rgoproj#8783 Continuing Work of argoproj#9058 (argoproj#12413)

Signed-off-by: Atsushi Sakai <sakai.at24@gmail.com>
@agilgur5 agilgur5 added this to the v3.6.0 milestone Apr 12, 2024
@acornegruta
Copy link

Hi, can you give us an approximate date when this will be released please? Even with automatic retries, pods are being randomly deleted, happens almost on a daily basis.

@agilgur5 agilgur5 added the solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate) label Jun 12, 2024
@agilgur5 agilgur5 changed the title feat: add finalizer to workflow pod to prevent "pod deleted". Fixes #8783 feat: add finalizer to workflow pod to prevent pod deleted. Fixes #8783 Jun 12, 2024
@zonybob
Copy link

zonybob commented Aug 19, 2024

For anyone waiting for 3.6 to fix this issue (like I was/am), I thought I'd drop a Kyverno policy that works well for us to block OpenShift from evicting "Completed" pods that haven't been seen/accounted for by Argo. It blocks the evict call until Argo labels the pod with workflows.argoproj.io/completed: true

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disable-eviction-too-fast
spec:
  admission: true
  background: false
  rules:
    - context:
        - apiCall:
            method: GET
            urlPath: /api/v1/namespaces/{{request.object.metadata.namespace}}/pods/{{request.object.metadata.name}}
          name: wfpod
      match:
        any:
          - resources:
              kinds:
                - v1/Pod.eviction
              operations:
                - CREATE
      name: Prevent race condition on Workflow Pods
      skipBackgroundRequests: true
      validate:
        deny:
          conditions:
            all:
              - key: "{{ wfpod.metadata.ownerReferences[].kind }}"
                operator: AnyIn
                value:
                  - Workflow
              - key: "{{ wfpod.status.phase }}"
                operator: Equals
                value: Succeeded
              - key:
                  '{{ wfpod.metadata.labels."workflows.argoproj.io/completed" || foo
                  | to_string(@) }}'
                operator: Equals
                value: "false"
        message: Eviction is too soon after job completion
  validationFailureAction: Enforce

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use finalizer to prevent pod deleted