[TO BE SPLIT UP] User scheduler, user placeholders, node pool taints #758

consideRatio · 2018-07-10T15:53:04Z

NOTE

This PR will be split up into pieces, see #841

This PR

While a cluster autoscaler which adds and removes nodes can improve cloud resource efficiency, some issues using it with this chart exists. This PR makes cluster autoscaling more robust and efficient to use.

PR Status

Writing Documentation
Considering how to get it merged, breaking apart the PR into various pieces is at least partially plausible, but a lot of the things is entangled.

Fixed and alleviated issues

Scaling down is inefficient as new pods spread out on the nodes
- Solved by having a custom scheduler schedule the users to pack tight instead of spread out
Scaling down nodes often fails because a system pod like kube-dns or a chart pod like hub or proxy has been scheduled on them instead of staying together on one node
- Solved by tainting dedicated user nodes and setting a toleration on the user pods that allows them to still schedule on the tainted nodes. See the kubernetes documentation about taints and tolerations and a google cloud example on how to setup a tainted node pool.
Scaling up just in time forces users to wait several minutes
- Alleviated by JupyterHub 0.9 allowing kubespawner to inform the waiting users on the spawning progress that contains messages about a triggered scale up.
- Solved (requires k8s 1.11) with user-placeholder pods. You can configure that you want for example two placeholder pods, and they will trigger node scale up if they can't schedule just like a user, but when a user is arriving without available room for it to schedule the user-placeholder pods will make room for the real user. See the kubernetes documentation on pod priority and preemption to learn about the details.
It is troublesome to test a cluster autoscaler setup
- Alleviated by the introduction of user-dummy pods which mock real users but is easier to add / remove for testing purposes.

Feature availability

K8s 1.11 is required for pod preemption and pod priority. K8s 1.11 is at the moment I'm writing this not yet available on GKE, and Helm lacks support for k8s 1.11 and we need to await Helm 2.11. We can help them out by by testing k8s 1.11 support.

Test this PR

To test all of the functionality, you must currently use a k8s 1.10 cluster with alpha features enabled. On GKE that requires new cluster to be setup. I've written down the steps needed to do so below.

Tooling setup

# Latest gcloud
sudo apt-get install google-cloud-sdk

# Latest kubectl
sudo apt-get install kubectl

# Latest Helm and Z2JH charts
export DESIRED_VERSION=
curl https://raw.githubusercontent.com/kubernetes/helm/master/scripts/get | bash
helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/
helm repo add consideratio https://consideratio.github.io/helm-chart/
helm repo update

Cluster setup

Setup an alpha-enabled cluster on GKE with the core node pool for the hub / proxy / user-scheduler with a label of hub.jupyter.org/node-purpose: core
Create one node pool for the users with a label of hub.jupyter.org/node-purpose: user and taint it with hub.jupyter.org_dedicated=user:NoSchedule

# 0. Initial configuration...
CLUSTER_NAME=alpha-cluster
ZONE=europe-west4-a
gcloud config set container/cluster "${CLUSTER_NAME}"
gcloud config set compute/zone $ZONE
gcloud config set container/new_scopes_behavior true

# 1. Create the alpha-cluster and the core node pool...
# NOTE: To create a non-alpha-cluster...
# - Remove --enable-kubernetes-alpha
# - Change --no-enable-autorepair to --enable-autorepair
# NOTE: to help Adam with Helm develop k8s 1.11 support, we can specify cluster version 1.11.0-gke.1
VERSION=latest
gcloud beta container clusters create "${CLUSTER_NAME}" \
--machine-type n1-standard-1 \
--num-nodes 1 \
--enable-kubernetes-alpha \
--cluster-version $VERSION \
--no-enable-autorepair \
--enable-ip-alias \
--node-labels hub.jupyter.org/node-purpose=core

# 2. Create the user pool
# NOTE: 
#   The taint "hub.jupyter.org_dedicated" would preferably be "hub.jupyter.org/dedicated"
#   but the gcloud tool does not allow using the character /
gcloud beta container node-pools create user-pool \
--machine-type n1-standard-2 \
--num-nodes 0 \
--enable-autoscaling \
--min-nodes 0 \
--max-nodes 5 \
--no-enable-autorepair \
--node-labels hub.jupyter.org/node-purpose=user \
--node-taints hub.jupyter.org_dedicated=user:NoSchedule

# 3. Allow for a temporary single point of failure of the kube-dns service
# NOTE:
#   Unless you have two core nodes, an additional kube-dns pod to avoid single point of
#   failure is pointless as they will only be allowed to schedule on the core nodes due to
#   the user node pool's taint. It will request about 0.25 CPU which will be too much for
#   requested resources to fit the core pods on a cheap single core node.
kubectl patch configmap --namespace kube-system kube-dns-autoscaler \
--patch '{"data": {"linear": "{\"coresPerReplica\":256,\"nodesPerReplica\":16,\"preventSinglePointFailure\":false}"}}'

Chart installation

# 4. Setup Helm
## 4.1 Init Helm
kubectl --namespace kube-system create serviceaccount tiller
kubectl create clusterrolebinding tiller --clusterrole cluster-admin --serviceaccount=kube-system:tiller
helm init --service-account tiller

## 4.2 Verify Helm
helm version

# 4.3 Secure Helm
kubectl --namespace kube-system patch deployment tiller-deploy --type json \
--patch '[{"op": "add", "path": "/spec/template/spec/containers/0/command", "value": ["/tiller", "--listen=localhost:44134"]}]'

# 5. Additional configuration
NAMESPACE=jhub
kubectl config set-context $(kubectl config current-context) --namespace="${NAMESPACE}"

# 6. Install chart
helm upgrade --install jhub consideratio/jupyterhub --version v0.7

Try it

This PR allows for robust autoscaling allowing nodes to be added before they are required by creating a headroom of placeholder pods that will be evicted if no nodes are available.

By yourself

The demo assumes you have setup the user node pools machine type and the singleusers resource requests to fit 4 users per user node.

#### To monitor whats happening
# Terminal window 1 - User nodes
watch -t -n 0.5 'echo "# User nodes"; echo; kubectl get nodes --selector hub.jupyter.org/node-purpose=user | cut -c 1-56'

# Terminal window 2 - Pending pods
watch -t -n 0.5 'echo "# Pending pods"; echo; kubectl get pods --field-selector=status.phase=Pending | cut -c 1-56'

# Terminal window 3 - Scheduled pods
watch -t -n 0.5 'echo "# Scheduled pods"; echo; kubectl describe node --selector=hub.jupyter.org/node-purpose=user | grep -E "user-placeholder|user-d
ummy|Namespace" | cut -c 1-56'

#### To make things happen
# let two placeholder pods create some headroom, spawning the first user node
kubectl patch statefulset user-placeholder --patch '{"spec": {"replicas": 2}}'

#  let four user dummies fill up a node, evicting the placeholders that now spawns a new user node
kubectl scale sts/user-dummy --replicas 4

# remove all user dummies...
kubectl scale sts/user-dummy --replicas 0

# and reschedule two and notice that they pack tight on the same node where the placeholders are
kubectl scale sts/user-dummy --replicas 2

Look at a demo

TODO

Remaining...

Update the guide documentation
Merge kubespawner#205 and bump it
Allow easier review of this PR
Add warning about activation of podPriority as it could cause old user pods to get preempted by new ones (NOTE: it was not allowed to patch existing pods to get the new priority)

Future...

When k8s 1.11 is required by the chart...

Switch to scheduling.k8s.io/v1beta1 as API group for the PriorityClass resource when this is testable on GKE without an alpha enabled cluster.

References

[REVIEW] Add tolerations and node/pod affinities, add field verifications, add some default values kubespawner#205 - A PR in kubespawner that this is depends on.
[WIP] Add a secondary scheduler with policy we can tweak #543 - @yuvipanda's PR that this is founded upon
Helm troubleshooting and related helm release info #638 - Helm troubleshooting - relevant for upgrade documentation
No response from tiller on k8s 1.11.0-gke.1 helm/helm#4384 - Regarding of Helm's k8s 1.11 support. I can't get tiller to respond on a k8s 1.11.0-gke.1 cluster (running helm version gives me no response from the server). On #helm-dev on the kubernetes slack I learned that helm lacks support for k8s 1.11 and we will need to wait for helm 2.11 for it it seems.

consideRatio · 2018-08-18T17:19:40Z

Closed and tracked in #841

consideRatio · 2018-08-18T19:46:05Z

jupyterhub/templates/scheduling/user-placeholder/statefulset.yaml

+      schedulerName: user-scheduler
+      {{- end }}
+      {{- $_ := merge (dict "podKind" "user") . }}
+      {{- $dummy := include "jupyterhub.prepareScope" $_ }}


This line uses a helper to set information in the $_ variable, this relevant information about the users affinity info for example, can be extracted thanks to the information is provided to the config.yaml but within extraConfig but instead being provided in a explicit fields @minrk .

[Part 1 of #758] Remove pod-culler remnants

[Part 2 of #758] user-scheduler added

[Part 3 of #758] Remove remnant pod-culler image

[Part 4 of #758] Bump k8s resources to v1.9 available API

[Part 5 of #758] Update outdated logic for hub restarts

[Part 6 of #758] Storage labels: configurable extras

[Part 7 of #758] Added `singleuser.tolerations`, for node taints

[Part 8 of #758] Added singleuser affinity configuration

[Part 10 of #758] Added `scheduling.podPriority`

[Part 11 of #758] Added `scheduling.userPlaceholder`

This was referenced Jul 10, 2018

[REVIEW] Add tolerations and node/pod affinities, add field verifications, add some default values jupyterhub/kubespawner#205

Closed

CI bump: k8s 1.9.0 / helm 2.9.1 / minikube 0.28.0 #759

Merged

consideRatio force-pushed the reworked-pack-scheduler-pr branch 2 times, most recently from afbbb45 to 6cab984 Compare July 11, 2018 08:41

consideRatio added a commit to consideRatio/zero-to-jupyterhub-k8s that referenced this pull request Jul 11, 2018

Bugfixes pre merge of PR jupyterhub#758

3c60e26

consideRatio mentioned this pull request Jul 11, 2018

Long scale down time with autoscaling groups pangeo-data/pangeo#322

Closed

consideRatio force-pushed the reworked-pack-scheduler-pr branch from 61e5cd8 to df1bbc1 Compare July 16, 2018 01:04

consideRatio mentioned this pull request Jul 18, 2018

continuous-image-puller not being run in node-pools #766

Closed

consideRatio force-pushed the reworked-pack-scheduler-pr branch from 56db558 to fb4d118 Compare July 19, 2018 16:06

consideRatio added a commit to consideRatio/zero-to-jupyterhub-k8s that referenced this pull request Jul 19, 2018

Bugfixes pre merge of PR jupyterhub#758

ea1e8b0

consideRatio force-pushed the reworked-pack-scheduler-pr branch 3 times, most recently from a3dadf3 to 8fa2e41 Compare July 19, 2018 22:05

consideRatio added this to the 0.7 milestone Jul 21, 2018

consideRatio force-pushed the reworked-pack-scheduler-pr branch from 133dc70 to e92733e Compare July 24, 2018 13:06

This was referenced Jul 25, 2018

Adding a CI step to test the upgrade path #786

Closed

ignore bin/ downloaded by ci/install.sh #778

Merged

[WIP] Add a secondary scheduler with policy we can tweak #543

Closed

consideRatio force-pushed the reworked-pack-scheduler-pr branch from e92733e to b35773f Compare July 25, 2018 03:10

consideRatio added 6 commits August 10, 2018 22:53

Allow for minikube k8s 1.11 installs

408cdbd

Prelim CI - inline documentation

2648065

deploy.sh inline documentation

5fc39ef

mash

6730d16

mash 2

c14a546

mash 3

2b22a49

consideRatio force-pushed the reworked-pack-scheduler-pr branch from e968490 to 2b22a49 Compare August 12, 2018 22:26

consideRatio added 4 commits August 14, 2018 12:08

removed duplicate node-selection

c30873e

--values instead of --chart-values

6f02c2a

improved imagePullSecrets support

38ce8bd

deleted stale files

700f322

consideRatio mentioned this pull request Aug 14, 2018

Already developed - Extracting smaller PRs from #758 #841

Closed

20 tasks

consideRatio changed the title ~~[WIP] User scheduler, user placeholders, node pool taints~~ [TO BE SPLIT UP] User scheduler, user placeholders, node pool taints Aug 14, 2018

consideRatio added a commit to consideRatio/zero-to-jupyterhub-k8s that referenced this pull request Aug 14, 2018

Bugfixes pre merge of PR jupyterhub#758

17f8699

minrk removed this from the 0.7 milestone Aug 16, 2018

consideRatio closed this Aug 18, 2018

consideRatio commented Aug 18, 2018

View reviewed changes

consideRatio added a commit to consideRatio/zero-to-jupyterhub-k8s that referenced this pull request Aug 19, 2018

Bugfixes pre merge of PR jupyterhub#758

ae4e4cd

consideRatio added a commit to consideRatio/zero-to-jupyterhub-k8s that referenced this pull request Aug 19, 2018

Bugfixes pre merge of PR jupyterhub#758

1035b30

minrk added a commit that referenced this pull request Sep 5, 2018

Merge pull request #890 from consideRatio/758-remove-pod-culler

22190bf

[Part 1 of #758] Remove pod-culler remnants

minrk added a commit that referenced this pull request Sep 5, 2018

Merge pull request #891 from consideRatio/758-user-scheduler

1638e6b

[Part 2 of #758] user-scheduler added

minrk added a commit that referenced this pull request Sep 10, 2018

Merge pull request #919 from consideRatio/remnant-code-stripping

e700e22

[Part 3 of #758] Remove remnant pod-culler image

minrk added a commit that referenced this pull request Sep 10, 2018

Merge pull request #920 from consideRatio/758-bump-k8s-api

7174419

[Part 4 of #758] Bump k8s resources to v1.9 available API

minrk added a commit that referenced this pull request Sep 12, 2018

Merge pull request #922 from consideRatio/758-outdated-hub-restart-logic

bf1579c

[Part 5 of #758] Update outdated logic for hub restarts

minrk added a commit that referenced this pull request Sep 20, 2018

Merge pull request #924 from consideRatio/758-6-storage-labels

274741b

[Part 6 of #758] Storage labels: configurable extras

minrk added a commit that referenced this pull request Sep 20, 2018

Merge pull request #925 from consideRatio/758-7-tolerations

89ee137

[Part 7 of #758] Added `singleuser.tolerations`, for node taints

minrk added a commit that referenced this pull request Sep 20, 2018

Merge pull request #926 from consideRatio/758-8-affinity

7d3255f

[Part 8 of #758] Added singleuser affinity configuration

minrk added a commit that referenced this pull request Sep 20, 2018

Merge pull request #928 from consideRatio/758-10-pod-priority

5068c8e

[Part 10 of #758] Added `scheduling.podPriority`

minrk added a commit that referenced this pull request Sep 20, 2018

Merge pull request #929 from consideRatio/758-11-user-placeholder

c0b4dcf

[Part 11 of #758] Added `scheduling.userPlaceholder`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TO BE SPLIT UP] User scheduler, user placeholders, node pool taints #758

[TO BE SPLIT UP] User scheduler, user placeholders, node pool taints #758

consideRatio commented Jul 10, 2018 •

edited

Loading

consideRatio commented Aug 18, 2018

consideRatio Aug 18, 2018

[TO BE SPLIT UP] User scheduler, user placeholders, node pool taints #758

[TO BE SPLIT UP] User scheduler, user placeholders, node pool taints #758

Conversation

consideRatio commented Jul 10, 2018 • edited Loading

NOTE

This PR

PR Status

Fixed and alleviated issues

Feature availability

Test this PR

Tooling setup

Cluster setup

Chart installation

Try it

By yourself

Look at a demo

TODO

Remaining...

Future...

References

consideRatio commented Aug 18, 2018

consideRatio Aug 18, 2018

Choose a reason for hiding this comment

consideRatio commented Jul 10, 2018 •

edited

Loading