Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22845] [Scheduler] Modify spark.kubernetes.allocation.batch.delay to take time instead of int #20032

Closed
wants to merge 3 commits into from

Conversation

foxish
Copy link
Contributor

@foxish foxish commented Dec 20, 2017

What changes were proposed in this pull request?

Fixing configuration that was taking an int which should take time. Discussion in #19946 (comment)
Made the granularity milliseconds as opposed to seconds since there's a use-case for sub-second reactions to scale-up rapidly especially with dynamic allocation.

How was this patch tested?

TODO: manual run of integration tests against this PR.
PTAL

cc/ @mccheah @liyinan926 @kimoonkim @vanzin @mridulm @jiangxb1987 @ueshin

@SparkQA
Copy link

SparkQA commented Dec 20, 2017

Test build #85185 has finished for PR 20032 at commit 48a3326.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -217,7 +217,7 @@ private[spark] class KubernetesClusterSchedulerBackend(
.watch(new ExecutorPodsWatcher()))

allocatorExecutor.scheduleWithFixedDelay(
allocatorRunnable, 0L, podAllocationInterval, TimeUnit.SECONDS)
allocatorRunnable, 0L, podAllocationInterval.toLong, TimeUnit.MILLISECONDS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking this definitely returns ms? Looks good then.

Copy link
Contributor

@mridulm mridulm Dec 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use conf.getTimeAsMs for podAllocationInterval ? It takes care of format specification by users and would be consistent (it is a private var anyway)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. That's much better, thanks!

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any performance consideration due to very rapid allocation requests (due to very low batch delay) ? For example, if set to 1
(And what happens if set to 0/-ve ?)

@foxish
Copy link
Contributor Author

foxish commented Dec 20, 2017

@mridulm

Within the allocator's control loop, it's all asynchronous requests being made for executor pods from the k8s API, so, each iteration doesn't take very long or block. If a user were to set a very low value for the delay, it wouldn't necessarily make more requests because the creation is still rate limited by the time it takes for the executors launched in the previous round to become ready (which is around 1s typically, if not higher). So, in the worst case, we may end up polling the state of our internal data structures quickly adding to the driver's CPU load, but the data structures are updated async by watches, so, it doesn't increase the load on the K8s API server or have to fetch any state over the network. I can add a caveat to the documentation that setting it to a very low number of ms may increase load on the driver.

If the intent with reducing batch internal is for people looking to spin up N executors as quickly as possible, if the resources are present, I'd recommend that they use the batch.size parameter instead and set that equal to num_executors. That would realize the same effect, and schedule all executors in a single run of the allocator's control loop.

@SparkQA
Copy link

SparkQA commented Dec 20, 2017

Test build #85208 has finished for PR 20032 at commit 4adb04b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@foxish
Copy link
Contributor Author

foxish commented Dec 20, 2017

(And what happens if set to 0/-ve ?)

We have a check preventing that in the option itself. The value should be strictly greater than 0 ms.

@@ -86,7 +86,7 @@ private[spark] class KubernetesClusterSchedulerBackend(

private val initialExecutors = SchedulerBackendUtils.getInitialTargetExecutorNumber(conf)

private val podAllocationInterval = conf.get(KUBERNETES_ALLOCATION_BATCH_DELAY)
private val podAllocationInterval = conf.getTimeAsMs(KUBERNETES_ALLOCATION_BATCH_DELAY.key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be just conf.get(KUBERNETES_ALLOCATION_BATCH_DELAY). The config's unit is already ms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the previous state actually. Not sure what the best practice is here. @mridulm, thoughts? getTimeAsMs seemed better as it might protect us from changes in the config if at all that happens. It enforces the contract that it must be milliseconds, which is essential for the allocator to function correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conf.get(KUBERNETES_ALLOCATION_BATCH_DELAY) returns a Long if it's a time conf. That's how time configs are expected to be used.

You don't need podAllocationInterval.toLong later on like you had before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I (incorrectly) assumed KUBERNETES_ALLOCATION_BATCH_DELAY was a String, and not ConfigEntry.
@vanzin's suggestion is much more elegant in comparison.

@foxish
Copy link
Contributor Author

foxish commented Dec 20, 2017

Fixed, thanks for the reviews. @mridulm, the questions about performance are super helpful for us also to keep track of how we're doing wrt other cluster managers and wrt user expectations. Thanks! :)

@SparkQA
Copy link

SparkQA commented Dec 20, 2017

Test build #85214 has finished for PR 20032 at commit 7427204.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Dec 21, 2017

Merging to master.

@asfgit asfgit closed this in 0114c89 Dec 21, 2017
asfgit pushed a commit that referenced this pull request Dec 22, 2017
What changes were proposed in this pull request?

This PR contains documentation on the usage of Kubernetes scheduler in Spark 2.3, and a shell script to make it easier to build docker images required to use the integration. The changes detailed here are covered by #19717 and #19468 which have merged already.

How was this patch tested?
The script has been in use for releases on our fork. Rest is documentation.

cc rxin mateiz (shepherd)
k8s-big-data SIG members & contributors: foxish ash211 mccheah liyinan926 erikerlandson ssuchter varunkatta kimoonkim tnachen ifilonenko
reviewers: vanzin felixcheung jiangxb1987 mridulm

TODO:
- [x] Add dockerfiles directory to built distribution. (#20007)
- [x] Change references to docker to instead say "container" (#19995)
- [x] Update configuration table.
- [x] Modify spark.kubernetes.allocation.batch.delay to take time instead of int (#20032)

Author: foxish <ramanathana@google.com>

Closes #19946 from foxish/update-k8s-docs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants