Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add a secondary scheduler with policy we can tweak #543

Closed
wants to merge 2 commits into from

Conversation

yuvipanda
Copy link
Collaborator

@yuvipanda yuvipanda commented Feb 27, 2018

Still running into kubernetes/kubernetes#60469,
which makes this unusable with an autoscaler.

policy.json content comes from kubernetes/kubernetes#59401

Fixes #542

Still running into kubernetes/kubernetes#60469,
which makes this unusable with an autoscaler.

policy.json content comes from kubernetes/kubernetes#59401
@yuvipanda
Copy link
Collaborator Author

Setting MostRequestedPriority predicate's weight to something like 10 or 100 seems to give us the behavior we want, but everything is stonewalling because of kubernetes/kubernetes#60469

@consideRatio
Copy link
Member

consideRatio commented May 15, 2018

@yuvipanda KubeCon 2018 videos are out and this was very relevant for me too look at considering implementing a scheduler, especially for the singleuser-server pods! I got excited! :D)

Presentations regarding scheduling

Other stuff regarding scheduling

roleRef:
kind: ClusterRole
name: {{ .Chart.Name }}-{{ .Release.Name }}-scheduler
apiGroup: rbac.authorization.k8s.io
Copy link
Member

@consideRatio consideRatio Jun 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do name: system:kube-scheduler instead, that means we don't need to define our own ClusterRole as we use the already defined ClusterRole.

Picked that up from this presentation. Also found it later in this kubernetes documentation.

Copy link
Member

@consideRatio consideRatio Jun 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But... if we want to support having replicas >1 of our scheduler we should add stuff to the ClusterRole under resourceNames as described in configure multiple schedulers - part 3. So if that is the case, we may need to keep using a custom defined ClusterRole.

metadata:
name: {{ .Chart.Name }}-scheduler-config
data:
policy.json: |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be policy.cfg ?

kube-scheduler documentation:
image

- --leader-elect=true
- --scheduler-name={{ .Chart.Name }}-{{ .Release.Name }}-scheduler
- --lock-object-namespace={{ .Release.Namespace }}
- --lock-object-name={{ .Chart.Name }}-{{ .Release.Name }}-scheduler-lock
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Research note

Is the --leader-elect is supposed to be enabled "[...] when running replicated components for high availability." and it is defaulting to true.

Reading in configure multiple schedulers - section 3 they write that if you want to setup leader election, you must update the following...

--leader-elect
--lock-object-namespace
--lock-object-name

And that one must also add the name of the scheduler to the ClusterRole under rules.apiGroups.resourceNames.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While reading in the kube-scheduler documentation I understood it as the --lock-object-name should be an existing Endpoint object by default, but can also be a ConfigMap if we also pass --leader-elect-resource-lock=configmaps.

Since we don't have an endpoint for our scheduler, since we don't have a Service for it currently and I'm unaware of the need for one, we should specify the configmap we use.

        command:
        - /usr/local/bin/kube-scheduler
        - --address=0.0.0.0
        - --scheduler-name=jupyterhub-scheduler
        - --policy-configmap=scheduler-config
        - --policy-configmap-namespace={{ .Release.Namespace }}
        - --leader-elect=true
        - --leader-elect-resource-lock=configmaps
        - --lock-object-name=scheduler-config
        - --lock-object-namespace={{ .Release.Namespace }}
        - --v=4

- --scheduler-name={{ .Chart.Name }}-{{ .Release.Name }}-scheduler
- --lock-object-namespace={{ .Release.Namespace }}
- --lock-object-name={{ .Chart.Name }}-{{ .Release.Name }}-scheduler-lock
- -v=4
Copy link
Member

@consideRatio consideRatio Jun 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not see information about this in the documentation, it some verbosity level i figure. I've seen --v used on the kube-scheduler binary as well (as compared to -v). Does it matter? Hmmm...

@consideRatio
Copy link
Member

Notes

@consideRatio
Copy link
Member

consideRatio commented Jun 27, 2018

TL;DR

I set the image using this helper, allowing for a potential override of the kube-scheduler version but defaulting to the clusters version.

{{- /*
Renders the kube-scheduler's image based on .Values.scheduler.name and
optionally on .Values.scheduler.tag. The default tag is set to the clusters
kubernetes version.
*/}}
{{- define "jupyterhub.scheduler.image" -}}
{{- $name := .Values.scheduler.image.name -}}
{{- $valuesVersion := .Values.scheduler.image.tag -}}
{{- $clusterVersion := (split "-" .Capabilities.KubeVersion.GitVersion)._0 -}}
{{- $tag := $valuesVersion | default $clusterVersion -}}
{{ $name }}:{{ $tag }}
{{- end }}

Regarding the kube-scheduler binary version

I found this Helm documentation to be very useful along with the available tags in the gcr image repo.

The goal is to adjust this line.

          image: gcr.io/google_containers/kube-scheduler-amd64:v1.10.4

We could use the .Chart objects provided by Helm for templates that gives access to Chart.yaml information, or use .Capabilities objects related to Kubernetes.

When we use .Capabilities we would be in sync with the actual cluster after a helm upgrade but may get out of sync if a master upgrade was made without a helm upgrade.


Options

  1. .Capabilities
  2. values.yaml
  3. .Chart / Hardcode

The downside of 1. is that we must run helm upgrades to get fresh template renderings. The downside of 2. is that we need manual configuration in values.yaml.

Perhaps the best choice would be to go for a values.yaml override, but with a default that relies on .Capabilities? Waiiit... that makes no sense, that would also require an helm upgrade... But it would allow some configuration that might be useful... I'm quite decided on this now, allow for override, but rely on .Capabilities unless a tag: 1.11.0 override is made or similar.

# values.yaml
scheduler:
  image:
    name: gcr.io/google_containers/kube-scheduler-amd64
    tag:

Research

I fetched some details about what .Capabilities and .Chart actually renders to.

A template file to inspect the output

# inspect-helm-objects.yaml
BEGIN .CAPABILITIES
{{ .Capabilities | toYaml }}
END .CAPABILITIES

BEGING .CHART
{{ .Chart | toYaml }}
END .CHART

A command to inspect

helm install --dry-run jupyterhub -f tools/lint-chart-values.yaml

The output

BEGIN .CAPABILITIES
APIVersions:
  admissionregistration.k8s.io/v1alpha1: {}
  admissionregistration.k8s.io/v1beta1: {}
  apiextensions.k8s.io/v1beta1: {}
  apiregistration.k8s.io/v1: {}
  apiregistration.k8s.io/v1beta1: {}
  apps/v1: {}
  apps/v1beta1: {}
  apps/v1beta2: {}
  authentication.k8s.io/v1: {}
  authentication.k8s.io/v1beta1: {}
  authorization.k8s.io/v1: {}
  authorization.k8s.io/v1beta1: {}
  autoscaling/v1: {}
  autoscaling/v2beta1: {}
  batch/v1: {}
  batch/v1beta1: {}
  batch/v2alpha1: {}
  certificates.k8s.io/v1beta1: {}
  events.k8s.io/v1beta1: {}
  extensions/v1beta1: {}
  metrics.k8s.io/v1beta1: {}
  networking.k8s.io/v1: {}
  policy/v1beta1: {}
  rbac.authorization.k8s.io/v1: {}
  rbac.authorization.k8s.io/v1alpha1: {}
  rbac.authorization.k8s.io/v1beta1: {}
  scalingpolicy.kope.io/v1alpha1: {}
  scheduling.k8s.io/v1alpha1: {}
  settings.k8s.io/v1alpha1: {}
  storage.k8s.io/v1: {}
  storage.k8s.io/v1alpha1: {}
  storage.k8s.io/v1beta1: {}
  v1: {}
KubeVersion:
  buildDate: 2018-06-15T21:48:39Z
  compiler: gc
  gitCommit: eb2e43842aaa21d6f0bb65d6adf5a84bbdc62eaf
  gitTreeState: clean
  gitVersion: v1.10.4-gke.2
  goVersion: go1.9.3b4
  major: "1"
  minor: 10+
  platform: linux/amd64
TillerVersion:
  git_commit: 20adb27c7c5868466912eebdf6664e7390ebe710
  git_tree_state: clean
  sem_ver: v2.9.1

END .CAPABILITIES

BEGIN .CHART
appVersion: v0.9.1
description: Multi-user Jupyter installation
home: https://z2jh.jupyter.org
icon: https://jupyter.org/assets/hublogo.svg
kubeVersion: '>=1.9.0-0'
name: jupyterhub
sources:
- https://github.com/jupyterhub/zero-to-jupyterhub-k8s
tillerVersion: '>=2.9.1-0'
version: v0.7-dev

END .CHART

@consideRatio
Copy link
Member

About RBAC

I'm not happy about needing to create a ClusterRoleBinding, but I figure we must in order to have a scheduler that works.

About the policies

The scheduler can have performance issues. How can we minimize them? I figure we might be able to remove some node node filters aka. predicates or some node preferences aka. priorities and that would reduce the workload.

The heavy work is probably done in the preferences / priorities.

@consideRatio
Copy link
Member

consideRatio commented Jun 27, 2018

About kube-scheduler

https://github.com/kubernetes/community/blob/master/contributors/devel/scheduler.md
https://github.com/kubernetes/community/blob/master/contributors/devel/scheduler_algorithm.md

A custom NodeLabelPriority or NodeLabelPredicate does not bother with the value of the label, just if it is there or not.

The MetadataPriority stuff seem to be a mixed bag of affinities etc.

Resources

@consideRatio
Copy link
Member

Wieeeeeeeeeeeeee this took some time but your checklist was excellent @yuvipanda !

@consideRatio
Copy link
Member

consideRatio commented Jul 1, 2018

kube-scheduler binary's documentation deprecated stuff such as the - --policy-configmap=<name of configmap> without saying what to do instead. But I found this code and figure they will support setting the Policy configmap through a KubeSchedulerConfiguration of the apiGroup: componentconfig/v1alpha1.

That would probably even allow us to use the default scheduler, but that could influence things at a broader level than we want our chart to do, so we should probably still deploy our own.


Additional documentation of the KubeSchedulerConfiguration object

@consideRatio
Copy link
Member

Continued on in #758

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants