Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prevent driver pod from being deleted before its status is processed by the operator (#2054) #2076

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

imtzer
Copy link

@imtzer imtzer commented Jun 29, 2024

🛑 Important:

Please open an issue to discuss significant work before you start. We appreciate your contributions and don't want your efforts to go to waste!

For guidelines on how to contribute, please review the CONTRIBUTING.md document.

Purpose of this PR

The backgroud of this pr is discribed in #2054, driver pod is deleted before it should be use in operator, now the sparkapp state transformation is trigger by list-watch.
For example, if a new sparkapp submit successfully and driver pod has deleted after compeleted status but before operator start handling sparkapp state transformation to SubmittedState, the code is in here getAndUpdateAppState, the function getAndUpdateAppState need to get driver pod but it was deleted, and sparkapp become failed and error message show 'driver pod not found'.
So to add finalier to driver pod to prevent it

Proposed changes:

  • Add finalizer to driver pod and remove it in suitable time

Change Category

Indicate the type of change by marking the applicable boxes:

  • Bugfix (non-breaking change which fixes an issue)
  • Feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that could affect existing functionality)
  • Documentation update

Rationale

Checklist

Before submitting your PR, please review the following:

  • I have conducted a self-review of my own code.
  • I have updated documentation accordingly.
  • I have added tests that prove my changes are effective or that my feature works.
  • Existing unit tests pass locally with my changes.

Additional Notes

oldFinalizer := pod.Finalizers
var newFinalizer []string
for _, finalizer := range oldFinalizer {
if finalizer != webhook.DriverFinalize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Do we need a check to verify that the web hook is enabled?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @peter-mcclonski , glad to your review, if webhook is disabled, driver pod will not contains the finalizer, so it does not need a check

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be deleted.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no, this file may be created by vscode, I will remove it

Comment on lines +617 to +619
if err := c.removeDriverPodFinalizer(app); err != nil {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering when you try to delete a running SparkApplication, whether spark operator will need to wait the app to be completed or failed, and then it can remove the finalizer. If this is true, for a long running Spark job, like Spark streaming job, maybe it will never turned into completed or failed state and the finalizer will never be removed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, finalizer is used for controling pod deletion, if the streaming driver pod is running, the owner spark job will never turn into completed or failed state, but if driver pod is deleted manually, or become status out of pod pending and pod running status, the streaming sparkapp will finally turn into completed or failed, at that time, driver finalizer will be deleted

@imtzer imtzer force-pushed the master branch 3 times, most recently from 31ce452 to 06efe5b Compare July 6, 2024 14:41
@imtzer
Copy link
Author

imtzer commented Jul 6, 2024

Hi team, I have fix some previous review problem and this code is running fine in local cluster, please help code review again, thanks!

Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: peter-mcclonski
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…by the operator (kubeflow#2054)

Signed-off-by: imtzer <a912470496@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants