-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22845] [Scheduler] Modify spark.kubernetes.allocation.batch.delay to take time instead of int #20032
Conversation
Test build #85185 has finished for PR 20032 at commit
|
@@ -217,7 +217,7 @@ private[spark] class KubernetesClusterSchedulerBackend( | |||
.watch(new ExecutorPodsWatcher())) | |||
|
|||
allocatorExecutor.scheduleWithFixedDelay( | |||
allocatorRunnable, 0L, podAllocationInterval, TimeUnit.SECONDS) | |||
allocatorRunnable, 0L, podAllocationInterval.toLong, TimeUnit.MILLISECONDS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking this definitely returns ms? Looks good then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use conf.getTimeAsMs
for podAllocationInterval
? It takes care of format specification by users and would be consistent (it is a private var anyway)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. That's much better, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any performance consideration due to very rapid allocation requests (due to very low batch delay) ? For example, if set to 1
(And what happens if set to 0/-ve ?)
Within the allocator's control loop, it's all asynchronous requests being made for executor pods from the k8s API, so, each iteration doesn't take very long or block. If a user were to set a very low value for the delay, it wouldn't necessarily make more requests because the creation is still rate limited by the time it takes for the executors launched in the previous round to become ready (which is around 1s typically, if not higher). So, in the worst case, we may end up polling the state of our internal data structures quickly adding to the driver's CPU load, but the data structures are updated async by watches, so, it doesn't increase the load on the K8s API server or have to fetch any state over the network. I can add a caveat to the documentation that setting it to a very low number of ms may increase load on the driver. If the intent with reducing batch internal is for people looking to spin up N executors as quickly as possible, if the resources are present, I'd recommend that they use the |
Test build #85208 has finished for PR 20032 at commit
|
We have a check preventing that in the option itself. The value should be strictly greater than 0 ms. |
@@ -86,7 +86,7 @@ private[spark] class KubernetesClusterSchedulerBackend( | |||
|
|||
private val initialExecutors = SchedulerBackendUtils.getInitialTargetExecutorNumber(conf) | |||
|
|||
private val podAllocationInterval = conf.get(KUBERNETES_ALLOCATION_BATCH_DELAY) | |||
private val podAllocationInterval = conf.getTimeAsMs(KUBERNETES_ALLOCATION_BATCH_DELAY.key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be just conf.get(KUBERNETES_ALLOCATION_BATCH_DELAY)
. The config's unit is already ms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the previous state actually. Not sure what the best practice is here. @mridulm, thoughts? getTimeAsMs
seemed better as it might protect us from changes in the config if at all that happens. It enforces the contract that it must be milliseconds, which is essential for the allocator to function correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conf.get(KUBERNETES_ALLOCATION_BATCH_DELAY)
returns a Long
if it's a time conf. That's how time configs are expected to be used.
You don't need podAllocationInterval.toLong
later on like you had before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I (incorrectly) assumed KUBERNETES_ALLOCATION_BATCH_DELAY was a String, and not ConfigEntry.
@vanzin's suggestion is much more elegant in comparison.
Fixed, thanks for the reviews. @mridulm, the questions about performance are super helpful for us also to keep track of how we're doing wrt other cluster managers and wrt user expectations. Thanks! :) |
Test build #85214 has finished for PR 20032 at commit
|
Merging to master. |
What changes were proposed in this pull request? This PR contains documentation on the usage of Kubernetes scheduler in Spark 2.3, and a shell script to make it easier to build docker images required to use the integration. The changes detailed here are covered by #19717 and #19468 which have merged already. How was this patch tested? The script has been in use for releases on our fork. Rest is documentation. cc rxin mateiz (shepherd) k8s-big-data SIG members & contributors: foxish ash211 mccheah liyinan926 erikerlandson ssuchter varunkatta kimoonkim tnachen ifilonenko reviewers: vanzin felixcheung jiangxb1987 mridulm TODO: - [x] Add dockerfiles directory to built distribution. (#20007) - [x] Change references to docker to instead say "container" (#19995) - [x] Update configuration table. - [x] Modify spark.kubernetes.allocation.batch.delay to take time instead of int (#20032) Author: foxish <ramanathana@google.com> Closes #19946 from foxish/update-k8s-docs.
What changes were proposed in this pull request?
Fixing configuration that was taking an int which should take time. Discussion in #19946 (comment)
Made the granularity milliseconds as opposed to seconds since there's a use-case for sub-second reactions to scale-up rapidly especially with dynamic allocation.
How was this patch tested?
TODO: manual run of integration tests against this PR.
PTAL
cc/ @mccheah @liyinan926 @kimoonkim @vanzin @mridulm @jiangxb1987 @ueshin