Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Hub: CarbonPlan / Azure #800

Closed
9 tasks done
jhamman opened this issue Oct 31, 2021 · 34 comments
Closed
9 tasks done

New Hub: CarbonPlan / Azure #800

jhamman opened this issue Oct 31, 2021 · 34 comments
Assignees

Comments

@jhamman
Copy link

jhamman commented Oct 31, 2021

Hub Description

CarbonPlan is a non-profit organization that builds open data and tools to accelerate the deployment of quality and transparent climate solutions. We are working on a new project that is utilizing the Microsoft Azure cloud platform. The hub we'd like to see deployed is a "Pangeo-style" hub, including spot instances for Dask workloads and GPU instances for some single-user sessions.

Community Representative(s)

Important dates

No specific dates. We'll be using this hub for exploratory data analysis and prototyping in November, data production in December, and continued research through at least April 2022.

Target start date

11/5/2021

Preferred Cloud Provider

Microsoft Azure

Do you have your own billing account?

  • Yes, I have my own billing account.

Hub Authentication Type

GitHub Authentication (e.g., @MyGitHubHandle)

Hub logo

No response

Hub logo URL

https://carbonplan.org/

Hub image service

Dockerhub / quay.io

Hub image

tbd, for now: pangeo/pangeo-notebook:latest

Extra features you'd like to enable

Other relevant information

No response

Hub ID

carbonplan-azure

Hub Cluster

(New!) carbonplan-azure

Hub URL

azure.carbonplan.2i2c.cloud

Hub Template

daskhub

Tasks to deploy the hub

@sgibson91
Copy link
Member

Confirming that @jhamman has given me access to the Azure subscription to be used 🎉

@sgibson91 sgibson91 moved this from Todo 👍 to In Progress ⚡ in Sprint Board Nov 15, 2021
@sgibson91
Copy link
Member

I have opened a PR to deploy a new cluster on Azure here: #833

@sgibson91
Copy link
Member

PR for new hubs is here: #838

@sgibson91
Copy link
Member

I opened #840 to track the object storage connection

@sgibson91
Copy link
Member

Hey @jhamman the hubs are up!

@jhamman
Copy link
Author

jhamman commented Nov 22, 2021

Woohoo!

The main thing we need to change to get us functional is to swap out the user image to carbonplan/cmip6-downscaling-single-user:latest.

From there, we'll be able to start using the hub in earnest and can give further feedback as it comes up.

@sgibson91
Copy link
Member

@jhamman new image was applied in #844

@choldgraf
Copy link
Member

I can't see any Grafana dashboards for the Azure hub, here's a screenshot of the dashboards page:

image

I feel like I have run into this before, but can't remember how to make the dashboards pop up. Can somebody point me in the right direction?

@sgibson91
Copy link
Member

@choldgraf I haven't installed them yet... Whoops!

@sgibson91
Copy link
Member

@choldgraf the grafana charts now exist

@sgibson91
Copy link
Member

@jhamman The first step in me tackling #840 and #841, I need to configure terraform to create a service principal. To create that, I believe I need to be promoted from Contributer to Owner on the subscription. Is it ok to do that?

@jhamman
Copy link
Author

jhamman commented Dec 1, 2021

I need to be promoted from Contributer to Owner on the subscription.

Done!

@jhamman
Copy link
Author

jhamman commented Dec 2, 2021

Question for @sgibson91 - we've been mostly running "Small" profile instances but we'd like to scale up to the larger instances configured for our hub. However, none of the larger instances seem to successfully launch. Here's an example log after attempting to launch a "Large" instance:

2021-12-02T18:35:22.576262Z [Warning] 0/5 nodes are available: 1 Insufficient memory, 4 node(s) didn't match Pod's node affinity.

Event log
Server requested
2021-12-02T18:32:59.161361Z [Warning] 0/4 nodes are available: 4 node(s) didn't match Pod's node affinity.
2021-12-02T18:32:59.172820Z [Warning] 0/4 nodes are available: 4 node(s) didn't match Pod's node affinity.
2021-12-02T18:33:09Z [Normal] pod triggered scale-up: [{aks-nblarge-34239724-vmss 0->1 (max: 20)}]
2021-12-02T18:34:50.570896Z [Warning] 0/5 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity.
2021-12-02T18:35:22.576262Z [Warning] 0/5 nodes are available: 1 Insufficient memory, 4 node(s) didn't match Pod's node affinity.
2021-12-02T18:38:11Z [Normal] pod didn't trigger scale-up: 6 node(s) didn't match Pod's node affinity, 6 node(s) had taint {k8s.dask.org_dedicated: worker}, that the pod didn't tolerate, 1 Insufficient memory
Spawn failed: Timeout

What is curious here is that k8s does add a node to the cluster but afterward, the pod does seem to fit (insufficient memory).

cc @orianac

@sgibson91
Copy link
Member

sgibson91 commented Dec 7, 2021

Sorry @jhamman - I have not forgotten about this, I have just been swamped with the incredibly tight deadlines for the AGU meeting next week.

I suspect that we are running into this issue on the other nodepools and I will need to play with the memory settings below until the spawned pods fit on the nodes

profileList:
# The mem-guarantees are here so k8s doesn't schedule other pods
# on these nodes.
- display_name: "Small: E2s v4"
description: "~2 CPU, ~15G RAM"
kubespawner_override:
# Explicitly unset mem_limit, so it overrides the default memory limit we set in
# basehub/values.yaml
mem_limit: null
mem_guarantee: 12G
node_selector:
hub.jupyter.org/node-size: Standard_E2s_v4
- display_name: "Medium: E4s v4"
description: "~4 CPU, ~30G RAM"
kubespawner_override:
mem_limit: null
mem_guarantee: 29G
node_selector:
hub.jupyter.org/node-size: Standard_E4s_v4
- display_name: "Large: E8s v4"
description: "~8 CPU, ~60G RAM"
kubespawner_override:
mem_limit: null
mem_guarantee: 60G
node_selector:
hub.jupyter.org/node-size: Standard_E8s_v4
- display_name: "Huge: E32s v4"
description: "~32 CPU, ~256G RAM"
kubespawner_override:
mem_limit: null
mem_guarantee: 240G
node_selector:
hub.jupyter.org/node-size: Standard_E32s_v4
- display_name: "Very Huge: M64s v2"
description: "~64 CPU, ~1024G RAM"
kubespawner_override:
mem_limit: null
mem_guarantee: 990G
node_selector:
hub.jupyter.org/node-size: Standard_M64s_v2
- display_name: "Very Very Huge: M128s v2"
description: "~128 CPU, ~2048G RAM"
kubespawner_override:
mem_limit: null
mem_guarantee: 2000G
node_selector:
hub.jupyter.org/node-size: Standard_M182s_v2

@sgibson91
Copy link
Member

Also, porting conversation around not being able to create a Service Principal from this PR

Hmmmm, seems like I still don't have enough privileges - @\jhamman?

Error: Could not list existing service principals

with azuread_service_principal.service_principal[0],
on service-principal.tf line 1, in resource "azuread_service_principal" "service_principal":
1: resource "azuread_service_principal" "service_principal" {

ServicePrincipalsClient.BaseClient.Get(): unexpected status 403 with OData error: Authorization_RequestDenied: Insufficient privileges to complete the
operation.

-- @sgibson91

Thanks @sgibson91 for working on this. Happy to adjust permissions as needed on our side. Confirming that right now, you do have the Owner role in the subscription being used for our deployment.

-- @jhamman

I think just giving me "Owner" role on the subscription wasn't enough, and there will also be some permissions in the linked Azure AD tenancy around registering applications that will also need to be enabled.

@sgibson91
Copy link
Member

sgibson91 commented Dec 7, 2021

Quick update on the nodepool scalings:

I have the medium, large and huge nodepools working on the staging hub by reducing the memory guarantee. (Draft PR #878) Basically the problem was the combination of the memory required for k8s admin stuff + the memory requested for the user server was greater than the memory of the node.

I'm still having problems with the vhuge and vvhuge nodepools and get a different error for those.

pod didn't trigger scale-up: 6 node(s) didn't match Pod's node affinity, 6 node(s) had taint {k8s.dask.org_dedicated: worker}, that the pod didn't tolerate, 1 in backoff after failed scale-up

These are also different machine types though (M-series vs E-series for the other nodes). So I'm wondering if we're hitting a quota of "cores per cluster" or something there.

@sgibson91
Copy link
Member

Also referring to #871, we think updating the Azure Fileshare to use NFS v4.1 protocol would resolve this issue and this is supported in terraform by the enabled_protocol attribute. My concern is that making this change forces the resource to recreate and therefore delete any work saved there. I may have to temporarily spin up another cluster to test this on. If it works, we may need to plan how to save any work locally from the Fileshare so it is not destroyed during the recreate.

@jhamman
Copy link
Author

jhamman commented Dec 7, 2021

Thanks @sgibson91!

These are also different machine types though (M-series vs E-series for the other nodes). So I'm wondering if we're hitting a quota of "cores per cluster" or something there.

I can help check on this. Just having the medium, large, and huge machines working is a great start through.

@jhamman
Copy link
Author

jhamman commented Dec 7, 2021

These are also different machine types though (M-series vs E-series for the other nodes). So I'm wondering if we're hitting a quota of "cores per cluster" or something there.

Our quota for Standard MSv2 Family vCPUs was 0 in West Europe. I've requested an increase to 400.

@sgibson91
Copy link
Member

Our quota for Standard MSv2 Family vCPUs was 0 in West Europe. I've requested an increase to 400.

Great, but we're using the westus2 location, no?

@jhamman
Copy link
Author

jhamman commented Dec 7, 2021

Wow! I should have caught this much sooner 🤦 . We are currently in westus2 but we should be in West Europe. I should have included the region we needed to be in when I opened this issue.

This actually explains some performance/latency issues I've been seeing when using the cluster.

@sgibson91
Copy link
Member

Whoops! I'll get on it first thing tomorrow! Hopefully, transferring shouldn't be too disruptive 🤞🏻

@sgibson91
Copy link
Member

@jhamman I think moving the location of the resources will be very disruptive as I get this error message when I try to change the location input in terraform

Error: Get "http://localhost/api/v1/namespaces/azure-file": dial tcp [::1]:80: connect: connection refused

with kubernetes_namespace.homes,
on storage.tf line 16, in resource "kubernetes_namespace" "homes":
16: resource "kubernetes_namespace" "homes" {

So I suspect I'll have to destroy and recreate everything. Good news is, I can probably resolve #871 at the same time. Bad news is that home directories will need to be saved locally as they'll be wiped. How do you want to proceed?

@jhamman
Copy link
Author

jhamman commented Dec 8, 2021

@sgibson91 - we can download our own home directories today. That should make starting over much easier. I'll update here when that is done.

@sgibson91
Copy link
Member

Thank you @jhamman!

@jhamman
Copy link
Author

jhamman commented Dec 8, 2021

@sgibson91 - we're all buttoned up. Feel free to destroy / recreate whenever you need to.

@sgibson91
Copy link
Member

This is great, thanks! Will probably do it tomorrow :)

@sgibson91
Copy link
Member

WIP PR trying to bring the hubs back online #887

It implements NFS protocol on the Fileshare which we hope will fix #871 but at the minute, I'm struggling to get it to mount

@sgibson91
Copy link
Member

I will get it right this time! 782dc63

@sgibson91
Copy link
Member

sgibson91 commented Dec 10, 2021

@jhamman the hubs are up and running again in the correct location and I can confirm that the NFS protocol switch fixed #871!!! 🎉🎉🎉

Screenshot 2021-12-10 at 10 34 24

However I am still seeing this error on the vhuge and vvhuge nodes

I'm still having problems with the vhuge and vvhuge nodepools and get a different error for those.

 pod didn't trigger scale-up: 6 node(s) didn't match Pod's node affinity, 6 node(s) had taint {k8s.dask.org_dedicated: worker}, that the pod didn't tolerate, 1 in backoff after failed scale-up

Rather than a "number of M-series machines in the location" quota, I wonder if there's a "maximum amount of CPU/Memory available in the cluster" quota we're hitting, as this discourse thread suggests https://discourse.jupyter.org/t/backoff-after-failed-scale-up/3331

@jhamman
Copy link
Author

jhamman commented Dec 10, 2021

@sgibson91 - this is fantastic. We'll start testing now.

I'll also look at the quota issue again now that we have a hub in the correct region.

@sgibson91
Copy link
Member

Hey @jhamman - if it's ok with you, I will close this issue and we can use support@2i2c.org to surface any new work that needs to happen on this hub?

@jhamman
Copy link
Author

jhamman commented Dec 17, 2021

Sounds good to me @sgibson91! Thanks for all the work so far.

@jhamman jhamman closed this as completed Dec 17, 2021
Repository owner moved this from In Progress ⚡ to Done 🎉 in Sprint Board Dec 17, 2021
@choldgraf
Copy link
Member

thanks so much @sgibson91 for all of your work getting this set up, and improving our Azure deployment infrastructure in the process ✨

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

No branches or pull requests

3 participants