-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24266][K8S][2.4] Restart the watcher when we receive a version changed from k8s #29496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24266][K8S][2.4] Restart the watcher when we receive a version changed from k8s #29496
Conversation
…ged from k8s Backport of SPARK-24266 to branch 2.4 In collaboration with Mike Royle.
|
Can one of the admins verify this patch? |
|
Mistakenly forgot to redirect this at branch-2.4 now fixed. |
|
@liyinan926 FYI |
|
Please retest this. |
|
FWIW, we have yet to see a hang for our Hourly job. |
|
Thank you for making a backporting PR, @jkleckner . BTW, cc @holdenk since she is the committer of the original PR. |
|
@dongjoon-hyun Ok, thanks. I'll close the request. |
|
@dongjoon-hyun After talking with Mike, we aren't sure how or whether to proceed since this is an attempt to backport #28423 which is already merged to master. What would be the steps? |
|
You can make another backporting PR against on branch-3.0 while keeping this PR independently. After that PR is merging, we can revisit here. |
|
My blind spot while reading was that I thought it had already been merged into 3.0 and didn't realize it was only on master, sorry. |
|
Ok, I created PR #29533 for branch 3.0. |
|
We have also run into problems due to this issue. Since it is not clear, is this fix going to be part of the next 2.4.X release? |
|
You are on the wrong thread. Currently, we are working on #29533 . Please comment on there to support @jkleckner . After that PR, we may restart to discuss on 2.4.x. |
|
@SQUIDwarrior FYI, I have an image with the patch here [1] if you want to try it out (and of course you can build it for yourself). It was built from this tag [2]. In fact, it would be good to get independent confirmation that this patch works in your environment. [1] registry.gitlab.com/jkleckner/spark/spark:v2.4.7-cphy1 |
|
@jkleckner Hi Jim, tried your fix and ran into a strange issue that makes the spark-submit quit immediately, with driver proceeding: Applying patch on top of 2.4.6. Edit: It seems that the current code in the PR is set to never wait on the application completion, specifically: when So if the application hasn't completed - this will result in the terminate, since the current code just quits and does not wait. Not sure if this is the intention... But in the current Not sure if change is intentional or not, we likely want to check for that flag, unless its not present in |
|
@shockdm This is an attempt to backport the master branch fix which has that behavior [1]. We launch using the spark-operator-on-k8s [4] mechanism with this version [5] and have not tried any other launch method. [1] https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala#L114 |
|
@jkleckner We are launching directly via a |
|
@shockdm Post back on the thread here if you determine that this is the cause. Very likely would apply to the master branch as well. If you have an example you would like me to run with something like spark-pi with specific arguments, I would be happy to run that in my environment. |
| if (time != null || time != "") time else "N/A" | ||
| } | ||
|
|
||
| override def watchOrStop(sId: String): Boolean = if (hasCompleted()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is different in the master branch. It checks whether the client was set to wait for the completion of the driver, not simply checking hasCompleted(), which should normally return false - causing spark-submit to terminate immediately.
For 2.4.7 port you may want to add additional parameter to the LoggingPodStatusWatcherImpl, specifically waitForCompletion: Boolean, that can be checked in this if statement. In current master they are using KubernetesDriverConf instead of passing parameters separately:
Line 40 in 5e6173e
| private[k8s] class LoggingPodStatusWatcherImpl(conf: KubernetesDriverConf) |
However in 2.4.6, KubernetesDriverConf is not yet introduced, and instead the appId and maybeLoggingInterval are passed individually. Simple solution is to add waitForCompletion as a parameter, and use it in the above if statement.
This has worked well for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shockdm Yes, you are right. And it is great that you have it working!
We're happy to make the change but would you mind sharing the code so as to be consistent with what you already have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shockdm ping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkleckner sorry, spent some time doing test runs to make sure it fixed our use case. I'll share the final code shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shockdm ping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkleckner Thank you for the reminder, and sorry for the delay! Check out shockdm#1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your patches look like an improvement, thank you @shockdm.
There are enough differences that perhaps you might be willing to submit them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the original comment by @dongjoon-hyun:
Thank you for making a backporting PR, @jkleckner .
Could you make a backporting PR to branch-3.0 first?
To prevent any regression from 2.4 to 3.0, we have a policy for backporting sequence. We cannot backport this to branch-2.4 directly.
What would be the correct approach to submit this backport atm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun Is the merging or the submitting of a PR to branch-3.0 a pre-requisite to submitting this PR or an improved PR?
Note that the implementations inherently diverge due to the divergence of the base branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What changes were proposed in this pull request?
This patch processes the HTTP Gone event and restarts the pod watcher.
Why are the changes needed?
This is a backport of PR #28423 to branch-2.4.
The reasons are explained in SPARK-24266 that spark jobs using the k8s resource scheduler may hang.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
This was rolled into an image that is deployed into our test cluster.
We have not seen the hangs since enabling this patch.