Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TO BE SPLIT UP] User scheduler, user placeholders, node pool taints #758

Closed

Conversation

consideRatio
Copy link
Member

@consideRatio consideRatio commented Jul 10, 2018

NOTE

This PR will be split up into pieces, see #841

This PR

While a cluster autoscaler which adds and removes nodes can improve cloud resource efficiency, some issues using it with this chart exists. This PR makes cluster autoscaling more robust and efficient to use.

PR Status

  • Writing Documentation
  • Considering how to get it merged, breaking apart the PR into various pieces is at least partially plausible, but a lot of the things is entangled.

Fixed and alleviated issues

  • Scaling down is inefficient as new pods spread out on the nodes
    • Solved by having a custom scheduler schedule the users to pack tight instead of spread out
  • Scaling down nodes often fails because a system pod like kube-dns or a chart pod like hub or proxy has been scheduled on them instead of staying together on one node
  • Scaling up just in time forces users to wait several minutes
    • Alleviated by JupyterHub 0.9 allowing kubespawner to inform the waiting users on the spawning progress that contains messages about a triggered scale up.
    • Solved (requires k8s 1.11) with user-placeholder pods. You can configure that you want for example two placeholder pods, and they will trigger node scale up if they can't schedule just like a user, but when a user is arriving without available room for it to schedule the user-placeholder pods will make room for the real user. See the kubernetes documentation on pod priority and preemption to learn about the details.
  • It is troublesome to test a cluster autoscaler setup
    • Alleviated by the introduction of user-dummy pods which mock real users but is easier to add / remove for testing purposes.

Feature availability

K8s 1.11 is required for pod preemption and pod priority. K8s 1.11 is at the moment I'm writing this not yet available on GKE, and Helm lacks support for k8s 1.11 and we need to await Helm 2.11. We can help them out by by testing k8s 1.11 support.

Test this PR

To test all of the functionality, you must currently use a k8s 1.10 cluster with alpha features enabled. On GKE that requires new cluster to be setup. I've written down the steps needed to do so below.

Tooling setup

# Latest gcloud
sudo apt-get install google-cloud-sdk

# Latest kubectl
sudo apt-get install kubectl

# Latest Helm and Z2JH charts
export DESIRED_VERSION=
curl https://raw.githubusercontent.com/kubernetes/helm/master/scripts/get | bash
helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/
helm repo add consideratio https://consideratio.github.io/helm-chart/
helm repo update

Cluster setup

  1. Setup an alpha-enabled cluster on GKE with the core node pool for the hub / proxy / user-scheduler with a label of hub.jupyter.org/node-purpose: core
  2. Create one node pool for the users with a label of hub.jupyter.org/node-purpose: user and taint it with hub.jupyter.org_dedicated=user:NoSchedule
# 0. Initial configuration...
CLUSTER_NAME=alpha-cluster
ZONE=europe-west4-a
gcloud config set container/cluster "${CLUSTER_NAME}"
gcloud config set compute/zone $ZONE
gcloud config set container/new_scopes_behavior true

# 1. Create the alpha-cluster and the core node pool...
# NOTE: To create a non-alpha-cluster...
# - Remove --enable-kubernetes-alpha
# - Change --no-enable-autorepair to --enable-autorepair
# NOTE: to help Adam with Helm develop k8s 1.11 support, we can specify cluster version 1.11.0-gke.1
VERSION=latest
gcloud beta container clusters create "${CLUSTER_NAME}" \
--machine-type n1-standard-1 \
--num-nodes 1 \
--enable-kubernetes-alpha \
--cluster-version $VERSION \
--no-enable-autorepair \
--enable-ip-alias \
--node-labels hub.jupyter.org/node-purpose=core

# 2. Create the user pool
# NOTE: 
#   The taint "hub.jupyter.org_dedicated" would preferably be "hub.jupyter.org/dedicated"
#   but the gcloud tool does not allow using the character /
gcloud beta container node-pools create user-pool \
--machine-type n1-standard-2 \
--num-nodes 0 \
--enable-autoscaling \
--min-nodes 0 \
--max-nodes 5 \
--no-enable-autorepair \
--node-labels hub.jupyter.org/node-purpose=user \
--node-taints hub.jupyter.org_dedicated=user:NoSchedule

# 3. Allow for a temporary single point of failure of the kube-dns service
# NOTE:
#   Unless you have two core nodes, an additional kube-dns pod to avoid single point of
#   failure is pointless as they will only be allowed to schedule on the core nodes due to
#   the user node pool's taint. It will request about 0.25 CPU which will be too much for
#   requested resources to fit the core pods on a cheap single core node.
kubectl patch configmap --namespace kube-system kube-dns-autoscaler \
--patch '{"data": {"linear": "{\"coresPerReplica\":256,\"nodesPerReplica\":16,\"preventSinglePointFailure\":false}"}}'

Chart installation

# 4. Setup Helm
## 4.1 Init Helm
kubectl --namespace kube-system create serviceaccount tiller
kubectl create clusterrolebinding tiller --clusterrole cluster-admin --serviceaccount=kube-system:tiller
helm init --service-account tiller

## 4.2 Verify Helm
helm version

# 4.3 Secure Helm
kubectl --namespace kube-system patch deployment tiller-deploy --type json \
--patch '[{"op": "add", "path": "/spec/template/spec/containers/0/command", "value": ["/tiller", "--listen=localhost:44134"]}]'

# 5. Additional configuration
NAMESPACE=jhub
kubectl config set-context $(kubectl config current-context) --namespace="${NAMESPACE}"

# 6. Install chart
helm upgrade --install jhub consideratio/jupyterhub --version v0.7

Try it

This PR allows for robust autoscaling allowing nodes to be added before they are required by creating a headroom of placeholder pods that will be evicted if no nodes are available.

By yourself

The demo assumes you have setup the user node pools machine type and the singleusers resource requests to fit 4 users per user node.

#### To monitor whats happening
# Terminal window 1 - User nodes
watch -t -n 0.5 'echo "# User nodes"; echo; kubectl get nodes --selector hub.jupyter.org/node-purpose=user | cut -c 1-56'

# Terminal window 2 - Pending pods
watch -t -n 0.5 'echo "# Pending pods"; echo; kubectl get pods --field-selector=status.phase=Pending | cut -c 1-56'

# Terminal window 3 - Scheduled pods
watch -t -n 0.5 'echo "# Scheduled pods"; echo; kubectl describe node --selector=hub.jupyter.org/node-purpose=user | grep -E "user-placeholder|user-d
ummy|Namespace" | cut -c 1-56'
#### To make things happen
# let two placeholder pods create some headroom, spawning the first user node
kubectl patch statefulset user-placeholder --patch '{"spec": {"replicas": 2}}'

#  let four user dummies fill up a node, evicting the placeholders that now spawns a new user node
kubectl scale sts/user-dummy --replicas 4

# remove all user dummies...
kubectl scale sts/user-dummy --replicas 0

# and reschedule two and notice that they pack tight on the same node where the placeholders are
kubectl scale sts/user-dummy --replicas 2

Look at a demo

ca

TODO

Remaining...

  • Update the guide documentation
  • Merge kubespawner#205 and bump it
  • Allow easier review of this PR
  • Add warning about activation of podPriority as it could cause old user pods to get preempted by new ones (NOTE: it was not allowed to patch existing pods to get the new priority)

Future...

When k8s 1.11 is required by the chart...

  • Switch to scheduling.k8s.io/v1beta1 as API group for the PriorityClass resource when this is testable on GKE without an alpha enabled cluster.

References

@consideRatio consideRatio force-pushed the reworked-pack-scheduler-pr branch 2 times, most recently from afbbb45 to 6cab984 Compare July 11, 2018 08:41
consideRatio added a commit to consideRatio/zero-to-jupyterhub-k8s that referenced this pull request Jul 11, 2018
@consideRatio consideRatio force-pushed the reworked-pack-scheduler-pr branch from 61e5cd8 to df1bbc1 Compare July 16, 2018 01:04
@consideRatio consideRatio force-pushed the reworked-pack-scheduler-pr branch from 56db558 to fb4d118 Compare July 19, 2018 16:06
consideRatio added a commit to consideRatio/zero-to-jupyterhub-k8s that referenced this pull request Jul 19, 2018
@consideRatio consideRatio force-pushed the reworked-pack-scheduler-pr branch 3 times, most recently from a3dadf3 to 8fa2e41 Compare July 19, 2018 22:05
@consideRatio consideRatio added this to the 0.7 milestone Jul 21, 2018
@consideRatio consideRatio force-pushed the reworked-pack-scheduler-pr branch from 133dc70 to e92733e Compare July 24, 2018 13:06
@consideRatio consideRatio force-pushed the reworked-pack-scheduler-pr branch from e92733e to b35773f Compare July 25, 2018 03:10
@consideRatio consideRatio force-pushed the reworked-pack-scheduler-pr branch from e968490 to 2b22a49 Compare August 12, 2018 22:26
@consideRatio consideRatio changed the title [WIP] User scheduler, user placeholders, node pool taints [TO BE SPLIT UP] User scheduler, user placeholders, node pool taints Aug 14, 2018
consideRatio added a commit to consideRatio/zero-to-jupyterhub-k8s that referenced this pull request Aug 14, 2018
@minrk minrk removed this from the 0.7 milestone Aug 16, 2018
@consideRatio
Copy link
Member Author

Closed and tracked in #841

schedulerName: user-scheduler
{{- end }}
{{- $_ := merge (dict "podKind" "user") . }}
{{- $dummy := include "jupyterhub.prepareScope" $_ }}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line uses a helper to set information in the $_ variable, this relevant information about the users affinity info for example, can be extracted thanks to the information is provided to the config.yaml but within extraConfig but instead being provided in a explicit fields @minrk .

consideRatio added a commit to consideRatio/zero-to-jupyterhub-k8s that referenced this pull request Aug 19, 2018
consideRatio added a commit to consideRatio/zero-to-jupyterhub-k8s that referenced this pull request Aug 19, 2018
minrk added a commit that referenced this pull request Sep 5, 2018
minrk added a commit that referenced this pull request Sep 5, 2018
minrk added a commit that referenced this pull request Sep 10, 2018
[Part 3 of #758] Remove remnant pod-culler image
minrk added a commit that referenced this pull request Sep 10, 2018
[Part 4 of #758] Bump k8s resources to v1.9 available API
minrk added a commit that referenced this pull request Sep 12, 2018
[Part 5 of #758] Update outdated logic for hub restarts
minrk added a commit that referenced this pull request Sep 20, 2018
[Part 6 of #758] Storage labels: configurable extras
minrk added a commit that referenced this pull request Sep 20, 2018
[Part 7 of #758] Added `singleuser.tolerations`, for node taints
minrk added a commit that referenced this pull request Sep 20, 2018
[Part 8 of #758] Added singleuser affinity configuration
minrk added a commit that referenced this pull request Sep 20, 2018
[Part 10 of #758] Added `scheduling.podPriority`
minrk added a commit that referenced this pull request Sep 20, 2018
[Part 11 of #758]  Added `scheduling.userPlaceholder`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants