[WIP] Add a secondary scheduler with policy we can tweak #543

yuvipanda · 2018-02-27T19:47:31Z

Still running into kubernetes/kubernetes#60469,
which makes this unusable with an autoscaler.

policy.json content comes from kubernetes/kubernetes#59401

Fixes #542

Make sure we can pick version of scheduler that matches kubernetes version
automatically
Test RBAC to make sure it works properly
Play with various policy.json config to see what works best for us
Sort out what's causing Running second selective copy of kube-scheduler keeps assigning pods to nodes even after nodes are full kubernetes/kubernetes#60469
Use the 1.11 way of configuring the scheduler binary
Make policy tweakable from config.yaml

Still running into kubernetes/kubernetes#60469, which makes this unusable with an autoscaler. policy.json content comes from kubernetes/kubernetes#59401

yuvipanda · 2018-02-27T20:32:20Z

Setting MostRequestedPriority predicate's weight to something like 10 or 100 seems to give us the behavior we want, but everything is stonewalling because of kubernetes/kubernetes#60469

consideRatio · 2018-05-15T09:11:59Z

@yuvipanda KubeCon 2018 videos are out and this was very relevant for me too look at considering implementing a scheduler, especially for the singleuser-server pods! I got excited! :D)

Presentations regarding scheduling

Other stuff regarding scheduling

Implementing advanced scheduling techniques with kubernetes - A thorough blog post with nice infographics.

consideRatio · 2018-06-16T17:07:55Z

jupyterhub/templates/scheduler/rbac.yaml

+roleRef:
+  kind: ClusterRole
+  name: {{ .Chart.Name }}-{{ .Release.Name }}-scheduler
+  apiGroup: rbac.authorization.k8s.io


We can do name: system:kube-scheduler instead, that means we don't need to define our own ClusterRole as we use the already defined ClusterRole.

Picked that up from this presentation. Also found it later in this kubernetes documentation.

But... if we want to support having replicas >1 of our scheduler we should add stuff to the ClusterRole under resourceNames as described in configure multiple schedulers - part 3. So if that is the case, we may need to keep using a custom defined ClusterRole.

consideRatio · 2018-06-17T12:12:17Z

jupyterhub/templates/scheduler/configmap.yaml

+metadata:
+  name: {{ .Chart.Name }}-scheduler-config
+data:
+  policy.json: |


Should this be policy.cfg ?

kube-scheduler documentation:

consideRatio · 2018-06-17T12:33:44Z

jupyterhub/templates/scheduler/deployment.yaml

+        - --leader-elect=true
+        - --scheduler-name={{ .Chart.Name }}-{{ .Release.Name }}-scheduler
+        - --lock-object-namespace={{ .Release.Namespace }}
+        - --lock-object-name={{ .Chart.Name }}-{{ .Release.Name }}-scheduler-lock


Research note

Is the --leader-elect is supposed to be enabled "[...] when running replicated components for high availability." and it is defaulting to true.

Reading in configure multiple schedulers - section 3 they write that if you want to setup leader election, you must update the following...

--leader-elect --lock-object-namespace --lock-object-name

And that one must also add the name of the scheduler to the ClusterRole under rules.apiGroups.resourceNames.

While reading in the kube-scheduler documentation I understood it as the --lock-object-name should be an existing Endpoint object by default, but can also be a ConfigMap if we also pass --leader-elect-resource-lock=configmaps.

Since we don't have an endpoint for our scheduler, since we don't have a Service for it currently and I'm unaware of the need for one, we should specify the configmap we use.

command: - /usr/local/bin/kube-scheduler - --address=0.0.0.0 - --scheduler-name=jupyterhub-scheduler - --policy-configmap=scheduler-config - --policy-configmap-namespace={{ .Release.Namespace }} - --leader-elect=true - --leader-elect-resource-lock=configmaps - --lock-object-name=scheduler-config - --lock-object-namespace={{ .Release.Namespace }} - --v=4

consideRatio · 2018-06-17T12:45:04Z

jupyterhub/templates/scheduler/deployment.yaml

+        - --scheduler-name={{ .Chart.Name }}-{{ .Release.Name }}-scheduler
+        - --lock-object-namespace={{ .Release.Namespace }}
+        - --lock-object-name={{ .Chart.Name }}-{{ .Release.Name }}-scheduler-lock
+        - -v=4


I did not see information about this in the documentation, it some verbosity level i figure. I've seen --v used on the kube-scheduler binary as well (as compared to -v). Does it matter? Hmmm...

consideRatio · 2018-06-17T20:33:18Z

Notes

jupyterhub/kubespawner#190 - Allow configuration of the scheduler.

consideRatio · 2018-06-27T14:04:48Z

TL;DR

I set the image using this helper, allowing for a potential override of the kube-scheduler version but defaulting to the clusters version.

{{- /*
Renders the kube-scheduler's image based on .Values.scheduler.name and
optionally on .Values.scheduler.tag. The default tag is set to the clusters
kubernetes version.
*/}}
{{- define "jupyterhub.scheduler.image" -}}
{{- $name := .Values.scheduler.image.name -}}
{{- $valuesVersion := .Values.scheduler.image.tag -}}
{{- $clusterVersion := (split "-" .Capabilities.KubeVersion.GitVersion)._0 -}}
{{- $tag := $valuesVersion | default $clusterVersion -}}
{{ $name }}:{{ $tag }}
{{- end }}

Regarding the `kube-scheduler` binary version

I found this Helm documentation to be very useful along with the available tags in the gcr image repo.

The goal is to adjust this line.

          image: gcr.io/google_containers/kube-scheduler-amd64:v1.10.4

We could use the .Chart objects provided by Helm for templates that gives access to Chart.yaml information, or use .Capabilities objects related to Kubernetes.

When we use .Capabilities we would be in sync with the actual cluster after a helm upgrade but may get out of sync if a master upgrade was made without a helm upgrade.

Options

.Capabilities
values.yaml
.Chart / Hardcode

The downside of 1. is that we must run helm upgrades to get fresh template renderings. The downside of 2. is that we need manual configuration in values.yaml.

Perhaps the best choice would be to go for a values.yaml override, but with a default that relies on .Capabilities? Waiiit... that makes no sense, that would also require an helm upgrade... But it would allow some configuration that might be useful... I'm quite decided on this now, allow for override, but rely on .Capabilities unless a tag: 1.11.0 override is made or similar.

# values.yaml
scheduler:
  image:
    name: gcr.io/google_containers/kube-scheduler-amd64
    tag:

Research

I fetched some details about what .Capabilities and .Chart actually renders to.

A template file to inspect the output

# inspect-helm-objects.yaml
BEGIN .CAPABILITIES
{{ .Capabilities | toYaml }}
END .CAPABILITIES

BEGING .CHART
{{ .Chart | toYaml }}
END .CHART

A command to inspect

helm install --dry-run jupyterhub -f tools/lint-chart-values.yaml

The output

BEGIN .CAPABILITIES
APIVersions:
  admissionregistration.k8s.io/v1alpha1: {}
  admissionregistration.k8s.io/v1beta1: {}
  apiextensions.k8s.io/v1beta1: {}
  apiregistration.k8s.io/v1: {}
  apiregistration.k8s.io/v1beta1: {}
  apps/v1: {}
  apps/v1beta1: {}
  apps/v1beta2: {}
  authentication.k8s.io/v1: {}
  authentication.k8s.io/v1beta1: {}
  authorization.k8s.io/v1: {}
  authorization.k8s.io/v1beta1: {}
  autoscaling/v1: {}
  autoscaling/v2beta1: {}
  batch/v1: {}
  batch/v1beta1: {}
  batch/v2alpha1: {}
  certificates.k8s.io/v1beta1: {}
  events.k8s.io/v1beta1: {}
  extensions/v1beta1: {}
  metrics.k8s.io/v1beta1: {}
  networking.k8s.io/v1: {}
  policy/v1beta1: {}
  rbac.authorization.k8s.io/v1: {}
  rbac.authorization.k8s.io/v1alpha1: {}
  rbac.authorization.k8s.io/v1beta1: {}
  scalingpolicy.kope.io/v1alpha1: {}
  scheduling.k8s.io/v1alpha1: {}
  settings.k8s.io/v1alpha1: {}
  storage.k8s.io/v1: {}
  storage.k8s.io/v1alpha1: {}
  storage.k8s.io/v1beta1: {}
  v1: {}
KubeVersion:
  buildDate: 2018-06-15T21:48:39Z
  compiler: gc
  gitCommit: eb2e43842aaa21d6f0bb65d6adf5a84bbdc62eaf
  gitTreeState: clean
  gitVersion: v1.10.4-gke.2
  goVersion: go1.9.3b4
  major: "1"
  minor: 10+
  platform: linux/amd64
TillerVersion:
  git_commit: 20adb27c7c5868466912eebdf6664e7390ebe710
  git_tree_state: clean
  sem_ver: v2.9.1

END .CAPABILITIES

BEGIN .CHART
appVersion: v0.9.1
description: Multi-user Jupyter installation
home: https://z2jh.jupyter.org
icon: https://jupyter.org/assets/hublogo.svg
kubeVersion: '>=1.9.0-0'
name: jupyterhub
sources:
- https://github.com/jupyterhub/zero-to-jupyterhub-k8s
tillerVersion: '>=2.9.1-0'
version: v0.7-dev

END .CHART

consideRatio · 2018-06-27T14:58:00Z

About RBAC

I'm not happy about needing to create a ClusterRoleBinding, but I figure we must in order to have a scheduler that works.

About the policies

The scheduler can have performance issues. How can we minimize them? I figure we might be able to remove some node node filters aka. predicates or some node preferences aka. priorities and that would reduce the workload.

The heavy work is probably done in the preferences / priorities.

consideRatio · 2018-06-27T16:41:01Z

About kube-scheduler

https://github.com/kubernetes/community/blob/master/contributors/devel/scheduler.md
https://github.com/kubernetes/community/blob/master/contributors/devel/scheduler_algorithm.md

A custom NodeLabelPriority or NodeLabelPredicate does not bother with the value of the label, just if it is there or not.

The MetadataPriority stuff seem to be a mixed bag of affinities etc.

Resources

consideRatio · 2018-06-27T22:09:47Z

Wieeeeeeeeeeeeee this took some time but your checklist was excellent @yuvipanda !

consideRatio · 2018-07-01T11:59:20Z

kube-scheduler binary's documentation deprecated stuff such as the - --policy-configmap=<name of configmap> without saying what to do instead. But I found this code and figure they will support setting the Policy configmap through a KubeSchedulerConfiguration of the apiGroup: componentconfig/v1alpha1.

That would probably even allow us to use the default scheduler, but that could influence things at a broader level than we want our chart to do, so we should probably still deploy our own.

Additional documentation of the KubeSchedulerConfiguration object

https://godoc.org/k8s.io/kubernetes/pkg/apis/componentconfig#KubeSchedulerConfiguration

consideRatio · 2018-07-25T03:08:08Z

Continued on in #758

Add a secondary scheduler with policy we can tweak

8a1516d

Still running into kubernetes/kubernetes#60469, which makes this unusable with an autoscaler. policy.json content comes from kubernetes/kubernetes#59401

consideRatio mentioned this pull request Jun 14, 2018

[WIP] Autoscaling - a living development documentation #503

Closed

consideRatio reviewed Jun 16, 2018

View reviewed changes

consideRatio reviewed Jun 17, 2018

View reviewed changes

consideRatio mentioned this pull request Jun 17, 2018

Support schedulerName configuration jupyterhub/kubespawner#190

Merged

Merge branch 'master' into pack-scheduler

e24faf4

consideRatio mentioned this pull request Jul 25, 2018

[TO BE SPLIT UP] User scheduler, user placeholders, node pool taints #758

Closed

5 tasks

consideRatio closed this Jul 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add a secondary scheduler with policy we can tweak #543

[WIP] Add a secondary scheduler with policy we can tweak #543

yuvipanda commented Feb 27, 2018 •

edited by consideRatio

Loading

yuvipanda commented Feb 27, 2018

consideRatio commented May 15, 2018 •

edited

Loading

consideRatio Jun 16, 2018 •

edited

Loading

consideRatio Jun 17, 2018 •

edited

Loading

consideRatio Jun 17, 2018

consideRatio Jun 17, 2018

consideRatio Jun 17, 2018

consideRatio Jun 17, 2018 •

edited

Loading

consideRatio commented Jun 17, 2018

consideRatio commented Jun 27, 2018 •

edited

Loading

consideRatio commented Jun 27, 2018

consideRatio commented Jun 27, 2018 •

edited

Loading

consideRatio commented Jun 27, 2018

consideRatio commented Jul 1, 2018 •

edited

Loading

consideRatio commented Jul 25, 2018

[WIP] Add a secondary scheduler with policy we can tweak #543

[WIP] Add a secondary scheduler with policy we can tweak #543

Conversation

yuvipanda commented Feb 27, 2018 • edited by consideRatio Loading

yuvipanda commented Feb 27, 2018

consideRatio commented May 15, 2018 • edited Loading

Presentations regarding scheduling

Other stuff regarding scheduling

consideRatio Jun 16, 2018 • edited Loading

Choose a reason for hiding this comment

consideRatio Jun 17, 2018 • edited Loading

Choose a reason for hiding this comment

consideRatio Jun 17, 2018

Choose a reason for hiding this comment

consideRatio Jun 17, 2018

Choose a reason for hiding this comment

Research note

consideRatio Jun 17, 2018

Choose a reason for hiding this comment

consideRatio Jun 17, 2018 • edited Loading

Choose a reason for hiding this comment

consideRatio commented Jun 17, 2018

Notes

consideRatio commented Jun 27, 2018 • edited Loading

TL;DR

Regarding the kube-scheduler binary version

Options

Research

A template file to inspect the output

A command to inspect

The output

consideRatio commented Jun 27, 2018

About RBAC

About the policies

consideRatio commented Jun 27, 2018 • edited Loading

About kube-scheduler

Resources

consideRatio commented Jun 27, 2018

consideRatio commented Jul 1, 2018 • edited Loading

consideRatio commented Jul 25, 2018

yuvipanda commented Feb 27, 2018 •

edited by consideRatio

Loading

consideRatio commented May 15, 2018 •

edited

Loading

consideRatio Jun 16, 2018 •

edited

Loading

consideRatio Jun 17, 2018 •

edited

Loading

consideRatio Jun 17, 2018 •

edited

Loading

consideRatio commented Jun 27, 2018 •

edited

Loading

Regarding the `kube-scheduler` binary version

consideRatio commented Jun 27, 2018 •

edited

Loading

consideRatio commented Jul 1, 2018 •

edited

Loading