-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Hub: Carbon Plan #291
Comments
Hi All! Here's some info to fill in the missing info above. What we have right now is something that looks pretty similar to the Pangeo-Cloud-Federation hubs. In fact, https://github.com/carbonplan/hub is a fork of that project with two hubs in it. One in GCP, one in Azure. The Azure hub needs to be moved to a new account so will need a rebuild. The google hub is in more of a maintenance mode and doesn't need too much work beyond updating to the new dask-hub chart. A lot of our devops falls into two main areas: 1) environment management -- we are working on multiple projects that require bespoke environments, and 2) custom resources -- sometimes this means GPUs, sometimes it means large VMs, and other times, its dask-related stuff. We're likely to need more than one image per hub (more like 1+ per project we work on), so that's a consideration worth discussing. Setup Information
Important Information
|
@jhamman can you create AWS users with full access for me (yuvipanda@2i2c.org) and @damianavila (damianavila@2i2c.org)? |
- staging and prod clusters that are exactly the same, with just domain differences - Uses traditional autohttps + LoadBalancer to get traffic into the cluster. Could be nginx-ingress later on if necessary. - Manual DNS entries for staging.carbonplan.2i2c.cloud and carbonplan.2i2c.cloud. Initial manual deploy with `proxy.https.enabled` set to false to complete deployment, fetch externalIP of `proxy-public` service, setup DNS, then re-deploy with `proxy.https.enabled` set to true. Ref 2i2c-org#291
- staging and prod clusters that are exactly the same, with just domain differences - Uses traditional autohttps + LoadBalancer to get traffic into the cluster. Could be nginx-ingress later on if necessary. - Manual DNS entries for staging.carbonplan.2i2c.cloud and carbonplan.2i2c.cloud. Initial manual deploy with `proxy.https.enabled` set to false to complete deployment, fetch externalIP of `proxy-public` service, setup DNS, then re-deploy with `proxy.https.enabled` set to true. Ref 2i2c-org#291
I'm making dask worker instances be spot instances now. Still need to figure out how users can effectively select instance size for dask workers. |
Should users really select instance size rather than pod CPU/memory requests though? I have two suggestions based on experience with learning-2-learn/l2lhub-deployment on AWS, not knowing if they are relevant or not - here goes.
Example suggestion 1 - Configuration of AWS instance groupsThis is from infra/ekctl-cluster-config.yaml. # Important about spot nodes!
#
# "Due to the Cluster Autoscaler’s limitations (more on that in the next
# section) on which Instance type to expand, it’s important to choose
# instances of the same size (vCPU and memory) for each InstanceGroup."
#
# ref: https://medium.com/riskified-technology/run-kubernetes-on-aws-ec2-spot-instances-with-zero-downtime-f7327a95dea
#
- name: worker-xlarge
availabilityZones: [us-west-2d, us-west-2b, us-west-2a]
minSize: 0
maxSize: 20
desiredCapacity: 0
volumeSize: 80
labels:
worker: "true"
taints:
worker: "true:NoSchedule"
tags:
k8s.io/cluster-autoscaler/node-template/label/worker: "true"
k8s.io/cluster-autoscaler/node-template/taint/worker: "true:NoSchedule"
iam:
withAddonPolicies:
autoScaler: true
# Spot instance configuration
instancesDistribution: # ref: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-autoscaling-autoscalinggroup-instancesdistribution.html
instanceTypes:
- m5.xlarge # 57 pods, 4 cpu, 16 GB
- m5a.xlarge # 57 pods, 4 cpu, 16 GB
- m5n.xlarge # 57 pods, 4 cpu, 16 GB
onDemandBaseCapacity: 0
onDemandPercentageAboveBaseCapacity: 0
spotAllocationStrategy: "capacity-optimized" # ref: https://aws.amazon.com/blogs/compute/introducing-the-capacity-optimized-allocation-strategy-for-amazon-ec2-spot-instances/
- name: worker-2xlarge
availabilityZones: [us-west-2d, us-west-2b, us-west-2a]
minSize: 0
maxSize: 20
desiredCapacity: 0
volumeSize: 80
labels:
worker: "true"
taints:
worker: "true:NoSchedule"
tags:
k8s.io/cluster-autoscaler/node-template/label/worker: "true"
k8s.io/cluster-autoscaler/node-template/taint/worker: "true:NoSchedule"
iam:
withAddonPolicies:
autoScaler: true
# Spot instance configuration
instancesDistribution: # ref: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-autoscaling-autoscalinggroup-instancesdistribution.html
instanceTypes:
- m5.2xlarge # 57 pods, 8 cpu, 32 GB
- m5a.2xlarge # 57 pods, 8 cpu, 32 GB
- m5n.2xlarge # 57 pods, 8 cpu, 32 GB
onDemandBaseCapacity: 0
onDemandPercentageAboveBaseCapacity: 0
spotAllocationStrategy: "capacity-optimized" # ref: https://aws.amazon.com/blogs/compute/introducing-the-capacity-optimized-allocation-strategy-for-amazon-ec2-spot-instances/
- name: worker-4xlarge
availabilityZones: [us-west-2d, us-west-2b, us-west-2a]
minSize: 0
maxSize: 20
desiredCapacity: 0
volumeSize: 80
labels:
worker: "true"
taints:
worker: "true:NoSchedule"
tags:
k8s.io/cluster-autoscaler/node-template/label/worker: "true"
k8s.io/cluster-autoscaler/node-template/taint/worker: "true:NoSchedule"
iam:
withAddonPolicies:
autoScaler: true
# Spot instance configuration
instancesDistribution: # ref: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-autoscaling-autoscalinggroup-instancesdistribution.html
instanceTypes:
- m5.4xlarge # 233 pods, 16 cpu, 64 GB
- m5a.4xlarge # 233 pods, 16 cpu, 64 GB
- m5n.4xlarge # 233 pods, 16 cpu, 64 GB
onDemandBaseCapacity: 0
onDemandPercentageAboveBaseCapacity: 0
spotAllocationStrategy: "capacity-optimized" # ref: https://aws.amazon.com/blogs/compute/introducing-the-capacity-optimized-allocation-strategy-for-amazon-ec2-spot-instances/ Example suggestion 2 - Configuration of dask worker requestsThis is Helm values configuring a daskhub Helm chart deployment. daskhub:
jupyterhub:
singleuser:
extraEnv:
# The default worker image matches the singleuser image.
DASK_GATEWAY__CLUSTER__OPTIONS__IMAGE: '{JUPYTER_IMAGE_SPEC}'
DASK_DISTRIBUTED__DASHBOARD_LINK: '/user/{JUPYTERHUB_USER}/proxy/{port}/status'
DASK_LABEXTENSION__FACTORY__MODULE: 'dask_gateway'
DASK_LABEXTENSION__FACTORY__CLASS: 'GatewayCluster'
# Reference on the configuration options:
# https://github.com/dask/dask-gateway/blob/master/resources/helm/dask-gateway/values.yaml
dask-gateway:
gateway:
prefix: "/services/dask-gateway" # Connect to Dask-Gateway through a JupyterHub service.
auth:
type: jupyterhub # Use JupyterHub to authenticate with Dask-Gateway
extraConfig:
# This configuration represents options that can be presented to users
# that want to create a Dask cluster using dask-gateway. For more
# details, see https://gateway.dask.org/cluster-options.html
#
# The goal is to provide a simple configuration that allow the user some
# flexibility while also fitting well well on AWS nodes that are all
# having 1:4 ratio between CPU and GB of memory. By providing the
# username label, we help administrators to track user pods.
option_handler: |
from dask_gateway_server.options import Options, Select, String, Mapping
def cluster_options(user):
def option_handler(options):
if ":" not in options.image:
raise ValueError("When specifying an image you must also provide a tag")
extra_labels = {
"hub.jupyter.org/username": user.name,
}
chosen_worker_cpu = int(options.worker_specification.split("CPU")[0])
chosen_worker_memory = 4 * chosen_worker_cpu
# We multiply the requests by a fraction to ensure that the
# worker fit well within a node that need some resources
# reserved for system pods.
return {
"image": options.image,
"worker_cores": 0.80 * chosen_worker_cpu,
"worker_cores_limit": chosen_worker_cpu,
"worker_memory": "%fG" % (0.90 * chosen_worker_memory),
"worker_memory_limit": "%fG" % chosen_worker_memory,
"scheduler_extra_pod_labels": extra_labels,
"worker_extra_pod_labels": extra_labels,
"environment": options.environment,
}
return Options(
Select(
"worker_specification",
["1CPU, 4GB", "2CPU, 8GB", "4CPU, 16GB", "8CPU, 32GB", "16CPU, 64GB"],
default="1CPU, 4GB",
label="Worker specification",
),
String("image", default="my-custom-image:latest", label="Image"),
Mapping("environment", {}, label="Environment variables"),
handler=option_handler,
)
c.Backend.cluster_options = cluster_options
idle: |
# timeout after 30 minutes of inactivity
c.KubeClusterConfig.idle_timeout = 1800 |
@consideRatio yeah, agree that Need to perhaps figure a way out to keep this config in sync with kubepsawner's profiles. |
Can you expand on these thoughts, @yuvipanda? Thx! |
Hey all - so the Carbon Plan hub exists now right? What else do we need to do before resolving this issue? |
I think we need to:
|
Cool - I've updated the top comment with these new steps (feel free to update it yourself in general if you like!) |
@jhamman try again? I'll have a PR up shortly but deploy seems to help |
They apparently have less than 250G total allocatable space, despite having 256G total RAM Ref 2i2c-org#291
(#430) |
Seems to be working now. Thanks @yuvipanda! |
Hey folks! A few updates / requests / questions after using the hub for a few weeks: Things are generally working great@orianac, @tcchiao, and I have been using the hubs daily. The hub has been quite stable and feature complete, so nice work on the initial rollout! Config update requestsA few things we'd like to change in the configuration:
Questions
|
Currently, users have to set these manually, so limiting it to 32 prevents provisioning instances larger than that. Ref 2i2c-org#291 (comment)
- Requested by Joe in 2i2c-org#291 (comment) - Refresh auth credentials, they had expired. Fixed in 2i2c-org#381
Done
Done, I removed the upper limit
Same set of instances you see when you create a new user server. Dask and notebook nodes mirror config.
Good chunk of fiddling happening on the GCP setup, not so much on AWS yet. |
@jhamman in a recent meeting we decided to close out this issue and consider it "finished" since it is just the "first deployment" issue, and the CP hub has been running for a couple months now. We've got the two extra issues about spot instances and dask workers on our deliverables board and can track those improvements there. I'll close this, but if you've got a strong objection to that feel free to speak up and we can discuss! |
Background
CarbonPlan is a non-profit that does work at the intersection of data science / climate modeling / advocacy. @jhamman has been running a few JupyterHubs for CP for a while now, and he's like to transfer operational duties to 2i2c.
This is likely a bit more complex than the "standard Pangeo hubs" we have set up. I believe that @jhamman has a couple of hubs that they run (perhaps he can provide context below).
Setup Information
Important Information
Deploy To Do
Follow up issues
The text was updated successfully, but these errors were encountered: