Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing scheduling of build pods #946

Closed
consideRatio opened this issue Sep 8, 2019 · 5 comments
Closed

Optimizing scheduling of build pods #946

consideRatio opened this issue Sep 8, 2019 · 5 comments

Comments

@consideRatio
Copy link
Member

consideRatio commented Sep 8, 2019

This issue is about how to optimize the scheduling of BinderHub specific Build Pods (BPs).

Out of scope of this issue is the discussion on how to schedule the user pods, which could be done with image locality (ImageLocalityPriority configuration) in mind.

Scheduling goals

  1. Historic repo build node locality (Historic locality)
    It is believed that often a builder pod rebuilding a repo would work faster if it had previously built a repository even though the repo has changed, for example in the README.md file of the repo. So, we want to schedule build pods on nodes where it has previously been scheduled if possible.
  2. Not blocking scale down in autoscaling clusters
    We would also like to avoid schedule pods on nodes that may want to scale down.

What is our desired scheduling practice?

We need to answer how we actually want the build-pods to schedule, it is not a obvious way to do it is typically hard to optimize both for performance and auto-scaling viability for example.

Boilerplate desired scheduling practice

I'll now provide a boilerplate idea to start out from on how to schedule the build pods.

  1. We make the scheduler informed about if this is an autoscaling cluster or not. If it is, we greatly reduce the score of almost empty nodes that we could scale down.
  2. We schedule the build pod on nodes with history of building this.

Technical solution implementation details

We must utilize a non-default scheduler. We could use the default kube-scheduler binary and customize its behavior through configuration, or we could make our own. I think we add too much complexity if we are to make our own though, even though making your own scheduler is certainly possible.

  • We utilize a custom scheduler, but like z2jh's scheduler we deploy a official kube-scheduler binary with a customized configuration that we can reference from the build pods specification using the spec.schedulerName field.

    [spec.schedulerName] If specified, the pod will be dispatched by specified scheduler. If not specified, the pod will be dispatched by default scheduler. --- Kubernetes PodSpec documentation.

  • We customize the kube-scheduler binary, just like in z2jh through a provided config, but we try to use node annotations somehow. For example, we could make the build pod annotate the node it runs on with the repo it attempts to build, and then later the scheduler can attempt to schedule on this repo.

  • We make the BinderHub builder pod annotate the node by communicating with the k8s API. To allow the BinderHub to communicate like this, it will require some RBAC details setup, for example like the z2jh's user-schedulers RBAC. It will need a ServiceAccount, a ClusterRole, and a ClusterRoleBinding, where the ClusterRole will define it is should be allowed to read and write annotations on nodes. For an example of a pod communicating with the k8s API, we can learn relevant parts from the z2jh's image-awaiter which also communicates with the k8s API.

  • We make the BinderHub image-cleaner binary also cleanup associated node annotations along with it cleaning nodes, which would also require associated RBAC permissions like the build pods would need to annotate in the first place.

Knowledge reference

kube-scheduler configuration

https://kubernetes.io/docs/concepts/scheduling/kube-scheduler/#kube-scheduler-implementation

kube-scheduler source code

https://github.com/kubernetes/kubernetes/tree/master/pkg/scheduler

Kubernetes scheduling basics

From this video, you should pick up the role of a scheduler, and that the default binary that can act as a scheduler can consider predicates (aka. filtering) and priorities (aka. scoring).

Kubernetes scheduling deep dives

Past related discussion

#854

@consideRatio
Copy link
Member Author

consideRatio commented Sep 9, 2019

@betatim and I were speaking a lot about this, and these are my notes that can help us implement something.

Notes

Implementation idea (v1)

  • Tie a (Cluster)Role using a (Cluster)RoleBinding to the existing ServiceAccount mounted by the BinderHub pod in order to be allowed to use the k8s api to inspect either the nodes or DinD DaemonSet pods or both to figure out the kubernetes.io/hostname label of all the nodes that DinD pods have scheduled to.
    • Use these k8s api permissions to figure out the available kubernetes.io/hostname label values on the nodes where dind daemonset pods reside and save them to a in-memory list. This needs to be done when a build pod is to be created.
    • Throttle these kinds of request to the k8s api with about 30-60 seconds and fallback to a cache for throttled requests. The throttle timescale should relate to the time it takes to autoscale up and down.
  • Implement a Rendezvous hashing algorithm that makes a deterministic allocation of repo names to nodes where a build pod should be recommended to schedule through a preferred node affinity.
    • Verify its behavior, and unit test it. Hopefully it can be done so that we have X nodes, and adding one would take a fraction of the pods and let the other remain on without switching to a new node.
  • Add a step in the process to create the build pod where two different preferred node label affinities are added weighted to favor one primary and having the other as a backup.

Algorithm pro / cons

  • Stateless!!!
  • Distributes build pods of different repositories at random across nodes, which is good and bad
  • It cannot guarantee historic locality if it autoscales up, but it does a good job without keeping state about where repos has been built in the past.

Future improvements

  • It could keep build state in memory and fallback to rendezvouz hashing.
  • It could do above, but also save it in the node labels (which means to use k8s etcd key value store indirectly) or somewhere else persistent. While this may be a slight improvement in result, it also adds complexity to our codebase and may not be considered an improvement.

Standalone points

  • Container's requested CPU will be available for other containers on a node if they managed to be scheduled there, it is not blocking. In the k8s docs we can read about that they correspond to a docker command line flag --cpu-shares.
  • We want to make the DinD pods utilize as much CPU as possible without taking it all from the user pods, this can be done in two ways as I see it. Either with a low resource request and no limit, then it could end up grabbing as much CPU as another user for example but could be ending up squeezed down to little CPU if it competes with for example 20 other users for 100% CPU. It could also be done with a high cpu request and limited limit, then it would force away CPU from the users but leave some for them by limiting its CPU request to a portion of the node's CPU. I think the best option is to request about the same as a user or two and let it be unlimited. This way, it will at worst steal CPU power from users with peaking CPU load, but at best allow for full utilization of the CPU cores while not hogging all of the CPU for the users. Overall, I think best practice should be to have dind pods with similar requests to the user pods for CPU, low and unlimited.

@betatim
Copy link
Member

betatim commented Sep 10, 2019

It was great discussing this and seeing how we started with a fairly complicated idea like "write a custom scheduler" and now have a much simpler solution!

There is another implementation of rendezvous hashing here.

To get the possible values of the kubernetes.io/hostname label it seems we can describe each of the DIND pods in the daemonset. Their Node field contains a value that is (on our GKE and OVH clusters) the same as the one used in the label. This means that to compute the list of possible node names we get describe the daemonset to get the Selector (e.g. name=ovh-dind), select pods with that (kubectl get pods -l name=ovh-dind) then inspect the Node field of each pod we found. This is nice because we don't need to inspect the nodes themselves so we don't need a cluster level role.

The following would be my suggestion for how to split this into several steps that can be tackled individually:

  1. create a utility class that can be used to get a ranked list of nodes for a given repository URL. This utility class could use a dummy algorithm (random sorting) to figure out all the mechanics of delivering the right information to the class (for example: is the URL of the repo available when we schedule the build pod?)
  2. (depends on 1.) create an implementation of the utility class that uses rendezvous hashing to compute the ranked list
  3. create a function that obtains the list of possible node names and uses the at_most_every decorator from health.py to throttle API calls
  4. combine all previous steps together to assign a preferred node affinity to each build pod. This behaviour should be configurable via a config option and off by default. The current behaviour should be the default.

What do you think? And do you want to tackle one of these already?


Maybe we can start a new issue on "Resource requests for build and DIND pods" to discuss what the options are and how to configure things.

@consideRatio
Copy link
Member Author

@betatim I think this may be quite fun to implement, but I have a long list of things to work on already so I figure I'll leave this to you :)

I'd be very happy to review whenever work is done and continue the discussion on implementation aspects.

@betatim
Copy link
Member

betatim commented Sep 11, 2019

I've started work on 1 (and a little bit of 2). PR coming soon.

@consideRatio consideRatio changed the title Optimizing scheduling of build pods with a custom scheduler Optimizing scheduling of build pods Sep 11, 2019
@betatim
Copy link
Member

betatim commented Nov 15, 2019

I think #949 and follow ups implemented this so I'll close this. Maybe we can make new issues with some of the possible improvements/ideas we had beyond what is implemented.

@betatim betatim closed this as completed Nov 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants