Build pods and the use of the z2jh user-scheduler #854

consideRatio · 2019-05-23T17:17:00Z

Without using the user-scheduler to schedule the build pods, they will spread out on the available nodes with the default scheduler. After #853 they will be able to and therefor prefer and to schedule on empty user nodes, and therefore block scale down of these unless there are 0 build pods for more than 10 minutes.

In the case of mybinder.orgs typical usage, i think this actually becomes an issue.

We should make them schedule like the user pods are scheduled now that this toleration is in place i think.

/cc: @jhamman @betatim

/ erik from mobile

betatim · 2019-05-23T17:53:56Z

For mybinder.org spreading the build pods around the cluster as much as possible is a feature. A build pod (can) uses a lot of CPU, much more than a typical notebook pod. It is more like a batch job workload (executing code as fast as possible and lots of it) than the interactive workload (human types, thinks, pauses, reads, rarely executes code).

Build pods last (typically) much less long than the timescale of wanting to scale down. Most builds are done in a few minutes. The longer ones take a few hours. The "period" of scale up and down for the cluster as a whole is more quarter of a day or so. Waiting an extra hour to scale down isn't a problem.

We had a situation a while back where build pods (for unknown reasons) preferred to schedule on the same node. This is the PR I made to add antiaffinity between them #834

consideRatio · 2019-05-23T18:11:35Z

I think user-scheduler + anti-affinity is fine, but not using the user scheduler for these pods will add a lot of costs due to a failure of downscaling nodes i think.

Related: #855 to discuss the image locality of user pods.

There is probably some tuning that needs to be done. Is there a good metric on the benefit of spreading the build pods? Perhaps the average build time?

betatim · 2019-05-23T21:11:06Z

not using the user scheduler for these pods will add a lot of costs due to a failure of downscaling nodes i think.

I disagree. The main reason for this is that there doesn't seem to be a problem with this right now. If we can show that downscaling is actually prevented by assigning build pods with the default scheduler, then we can think about how to fix it.

The chances that a downscale is prevented (for an overly long time) by a build pod are small and in practice doesn't seem to happen. There is also a bit of a "conspiracy" happening that we tend to have about as many build pods running as we have nodes in the cluster. A running build pod uses all its allowed resources, where as a user pod uses nearly none of its resources. This is the main reason you want to spread the builds around instead of clumping them together (which tends to be the effect of the current user-scheduler).

If we want to tune the assignment of user pods to nodes based on image locality it also makes sense to me to assign builds to random nodes to prevent all the newer images being on one node, which then preferentially gets all the newly launched pods. Any time we introduce a pattern things get tricky, random assignment/sampling is a surprisingly good and unbiased method for doing stuff.

When we start a build we have a clean slate. I can't think of a reason why a build would complete faster on any of the nodes we have that is not related to CPU utilisation. Launching a user pod is different. There is a reason to prefer nodes over others: the fact that the image is already on the node. For me this means we want to randomly select a node to build on or even better: select the one which has the lowest actual CPU utilisation (not based on limits and guarantees).

consideRatio · 2019-05-25T14:29:10Z

Useful background:

Each new scheduler config requires one additional scheduler deployment setup
The typical cluster autoscaler configuration, that a gke cluster for example cant manage, will scale down nodes that have been empty for 10 consecutive minutes
Using the defualt scheduler without any extra affinities will make the the build pods use the least used node, in other words the empty nodes.
I dont think the kube-scheduler binary can be configured to account for current CPU load, but a custom written scheduler binary could. Note that z2jh uses the k8s provided kube-scheduler with a custom configuration.

About the "as many build pods as the nodes in the cluster". Sounds like a daemonset, or like a hard anti pod affinity makes it so.

Im not understanding the setup good enough, but:
If:

build pods are scheduled to use the least utilized node, as the default scheduler does (current situation).
build pods isnt evictable (they are not daemonset pods right, but rather spawned by binderhub as standalone pods? then they will block scale down)
build pods pop up every 10+X minutes, running for X minutes
we have scaled up earlier, but a lot of users have left and now one node is free of users
Then:
The empty node will always have a build pod running on it or had one running in the latest 10 minutes, and that would block scale down.

So if we observe no issues with downscaling, I figure either my assumptions are a bad approximation of reality, or my logic is flawed. If there are no issues at all, then id be greatly surprised, unless we evict build pods for users from time to time on downscaling nodes and crash their builds, or we simply dont schedule build pods on the user nodes.

betatim · 2019-05-25T16:08:00Z

About the "as many build pods as the nodes in the cluster". Sounds like a daemonset, or like a hard anti pod affinity makes it so.

I think it is just a coincidence that the number of requested builds (at any one time) happens to be roughly equal to (or lower) the number of nodes in the cluster. It could change tomorrow that suddenly a lot more builds are requested or run for a lot longer (or we switched to much bigger instances and hence reduced the number of nodes).

I think the reason we don't see an effect from this is that there just aren't that many build pods running at any given moment in time and that they typically don't run for very long: https://grafana.mybinder.org/d/nDQPwi7mk/node-activity?refresh=1m&panelId=44&fullscreen&orgId=1&from=now-2d&to=now

consideRatio · 2019-09-08T11:54:55Z

Closed in favor of #946.

consideRatio mentioned this issue Sep 8, 2019

Optimizing scheduling of build pods #946

Closed

consideRatio closed this as completed Sep 8, 2019

snyk-bot mentioned this issue Aug 17, 2021

[Snyk] Fix for 8 vulnerabilities MaxMood96/binderhub#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build pods and the use of the z2jh user-scheduler #854

Build pods and the use of the z2jh user-scheduler #854

consideRatio commented May 23, 2019

betatim commented May 23, 2019

consideRatio commented May 23, 2019

betatim commented May 23, 2019

consideRatio commented May 25, 2019 •

edited

Loading

betatim commented May 25, 2019

consideRatio commented Sep 8, 2019

Build pods and the use of the z2jh user-scheduler #854

Build pods and the use of the z2jh user-scheduler #854

Comments

consideRatio commented May 23, 2019

betatim commented May 23, 2019

consideRatio commented May 23, 2019

betatim commented May 23, 2019

consideRatio commented May 25, 2019 • edited Loading

betatim commented May 25, 2019

consideRatio commented Sep 8, 2019

consideRatio commented May 25, 2019 •

edited

Loading