Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robustness to driver pod taking time to create #2315

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

Tom-Newton
Copy link
Contributor

@Tom-Newton Tom-Newton commented Nov 10, 2024

Purpose of this PR

Improve reliability.
Closes: #2302

Proposed changes:

  • Add a grace period for the driver pod to be created.
  • Grace period is controlled by a new config option driver-pod-creation-grace-period. Default is 10 seconds.
  • Add 3 new tests around reconciling submitted status, including the new grace period functionality. This is the majority of the lines changed.
  • Expose the new config option in the helm chart and add a helm test

Change Category

  • Bugfix (non-breaking change which fixes an issue)
  • Feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that could affect existing functionality)
  • Documentation update

Checklist

  • I have conducted a self-review of my own code.
  • I have updated documentation accordingly - I think its sufficient just to add a description on the new argument on the helm chart and regenerate the helm docs.
  • I have added tests that prove my changes are effective or that my feature works.
  • Existing unit tests pass locally with my changes.

Additional Notes

Some logs of a real example on our prod cluster where the spark application was saved by the grace period added in this PR.
Explore-logs-2024-11-11 12_09_13.txt

Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chenyi015 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
@Tom-Newton Tom-Newton force-pushed the tomnewton/robustness_to_driver_pod_taking_time_to_create branch from a3d6607 to bbe7510 Compare November 10, 2024 17:27
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
@Tom-Newton
Copy link
Contributor Author

Sorry for the direct ping. @ChenYi015 are you the right person to review this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Robustness to the driver pod taking some time to create
1 participant