-
Notifications
You must be signed in to change notification settings - Fork 390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build pods and the use of the z2jh user-scheduler #854
Comments
For mybinder.org spreading the build pods around the cluster as much as possible is a feature. A build pod (can) uses a lot of CPU, much more than a typical notebook pod. It is more like a batch job workload (executing code as fast as possible and lots of it) than the interactive workload (human types, thinks, pauses, reads, rarely executes code). Build pods last (typically) much less long than the timescale of wanting to scale down. Most builds are done in a few minutes. The longer ones take a few hours. The "period" of scale up and down for the cluster as a whole is more quarter of a day or so. Waiting an extra hour to scale down isn't a problem. We had a situation a while back where build pods (for unknown reasons) preferred to schedule on the same node. This is the PR I made to add antiaffinity between them #834 |
I think user-scheduler + anti-affinity is fine, but not using the user scheduler for these pods will add a lot of costs due to a failure of downscaling nodes i think. Related: #855 to discuss the image locality of user pods. There is probably some tuning that needs to be done. Is there a good metric on the benefit of spreading the build pods? Perhaps the average build time? |
I disagree. The main reason for this is that there doesn't seem to be a problem with this right now. If we can show that downscaling is actually prevented by assigning build pods with the default scheduler, then we can think about how to fix it. The chances that a downscale is prevented (for an overly long time) by a build pod are small and in practice doesn't seem to happen. There is also a bit of a "conspiracy" happening that we tend to have about as many build pods running as we have nodes in the cluster. A running build pod uses all its allowed resources, where as a user pod uses nearly none of its resources. This is the main reason you want to spread the builds around instead of clumping them together (which tends to be the effect of the current user-scheduler). If we want to tune the assignment of user pods to nodes based on image locality it also makes sense to me to assign builds to random nodes to prevent all the newer images being on one node, which then preferentially gets all the newly launched pods. Any time we introduce a pattern things get tricky, random assignment/sampling is a surprisingly good and unbiased method for doing stuff. When we start a build we have a clean slate. I can't think of a reason why a build would complete faster on any of the nodes we have that is not related to CPU utilisation. Launching a user pod is different. There is a reason to prefer nodes over others: the fact that the image is already on the node. For me this means we want to randomly select a node to build on or even better: select the one which has the lowest actual CPU utilisation (not based on limits and guarantees). |
Useful background:
About the "as many build pods as the nodes in the cluster". Sounds like a daemonset, or like a hard anti pod affinity makes it so. Im not understanding the setup good enough, but:
So if we observe no issues with downscaling, I figure either my assumptions are a bad approximation of reality, or my logic is flawed. If there are no issues at all, then id be greatly surprised, unless we evict build pods for users from time to time on downscaling nodes and crash their builds, or we simply dont schedule build pods on the user nodes. |
I think it is just a coincidence that the number of requested builds (at any one time) happens to be roughly equal to (or lower) the number of nodes in the cluster. It could change tomorrow that suddenly a lot more builds are requested or run for a lot longer (or we switched to much bigger instances and hence reduced the number of nodes). I think the reason we don't see an effect from this is that there just aren't that many build pods running at any given moment in time and that they typically don't run for very long: https://grafana.mybinder.org/d/nDQPwi7mk/node-activity?refresh=1m&panelId=44&fullscreen&orgId=1&from=now-2d&to=now |
Closed in favor of #946. |
Without using the user-scheduler to schedule the build pods, they will spread out on the available nodes with the default scheduler. After #853 they will be able to and therefor prefer and to schedule on empty user nodes, and therefore block scale down of these unless there are 0 build pods for more than 10 minutes.
In the case of mybinder.orgs typical usage, i think this actually becomes an issue.
We should make them schedule like the user pods are scheduled now that this toleration is in place i think.
/cc: @jhamman @betatim
/ erik from mobile
The text was updated successfully, but these errors were encountered: