New Hub: CarbonPlan / Azure #800

jhamman · 2021-10-31T20:33:36Z

Hub Description

CarbonPlan is a non-profit organization that builds open data and tools to accelerate the deployment of quality and transparent climate solutions. We are working on a new project that is utilizing the Microsoft Azure cloud platform. The hub we'd like to see deployed is a "Pangeo-style" hub, including spot instances for Dask workloads and GPU instances for some single-user sessions.

Community Representative(s)

Important dates

No specific dates. We'll be using this hub for exploratory data analysis and prototyping in November, data production in December, and continued research through at least April 2022.

Target start date

11/5/2021

Preferred Cloud Provider

Microsoft Azure

Do you have your own billing account?

Yes, I have my own billing account.

Hub Authentication Type

GitHub Authentication (e.g., @MyGitHubHandle)

Hub logo

No response

Hub logo URL

https://carbonplan.org/

Hub image service

Dockerhub / quay.io

Hub image

tbd, for now: pangeo/pangeo-notebook:latest

Extra features you'd like to enable

Dedicated Scalable Dask Cluster with Dask Gateway

Other relevant information

No response

Hub ID

carbonplan-azure

Hub Cluster

(New!) carbonplan-azure

Hub URL

azure.carbonplan.2i2c.cloud

Hub Template

daskhub

Tasks to deploy the hub

Engineer who will deploy the hub is assigned
Deploy information filled in above
Initial Cluster deployment PR: Create cluster on Azure for CarbonPlan #833
Initial Hub deployment PR: Deploying Hubs to Azure cluster for CarbonPlan #838
Administrators able to log on
Community Representative satisfied with hub environment
Hub now in steady-state

The text was updated successfully, but these errors were encountered:

sgibson91 · 2021-11-11T16:13:30Z

Confirming that @jhamman has given me access to the Azure subscription to be used 🎉

sgibson91 · 2021-11-15T14:10:27Z

I have opened a PR to deploy a new cluster on Azure here: #833

sgibson91 · 2021-11-18T15:18:28Z

PR for new hubs is here: #838

sgibson91 · 2021-11-19T16:27:40Z

I opened #840 to track the object storage connection

sgibson91 · 2021-11-19T16:58:44Z

Hey @jhamman the hubs are up!

Staging: https://staging.azure.carbonplan.2i2c.cloud/
Prod: https://prod.azure.carbonplan.2i2c.cloud/

jhamman · 2021-11-22T18:22:33Z

Woohoo!

The main thing we need to change to get us functional is to swap out the user image to carbonplan/cmip6-downscaling-single-user:latest.

From there, we'll be able to start using the hub in earnest and can give further feedback as it comes up.

sgibson91 · 2021-11-23T11:01:21Z

@jhamman new image was applied in #844

choldgraf · 2021-11-29T23:29:12Z

I can't see any Grafana dashboards for the Azure hub, here's a screenshot of the dashboards page:

I feel like I have run into this before, but can't remember how to make the dashboards pop up. Can somebody point me in the right direction?

sgibson91 · 2021-11-30T01:00:14Z

@choldgraf I haven't installed them yet... Whoops!

sgibson91 · 2021-11-30T10:15:17Z

@choldgraf the grafana charts now exist

sgibson91 · 2021-11-30T10:53:51Z

@jhamman The first step in me tackling #840 and #841, I need to configure terraform to create a service principal. To create that, I believe I need to be promoted from Contributer to Owner on the subscription. Is it ok to do that?

jhamman · 2021-12-01T04:54:52Z

I need to be promoted from Contributer to Owner on the subscription.

Done!

jhamman · 2021-12-02T18:40:05Z

Question for @sgibson91 - we've been mostly running "Small" profile instances but we'd like to scale up to the larger instances configured for our hub. However, none of the larger instances seem to successfully launch. Here's an example log after attempting to launch a "Large" instance:

2021-12-02T18:35:22.576262Z [Warning] 0/5 nodes are available: 1 Insufficient memory, 4 node(s) didn't match Pod's node affinity.

Event log
Server requested
2021-12-02T18:32:59.161361Z [Warning] 0/4 nodes are available: 4 node(s) didn't match Pod's node affinity.
2021-12-02T18:32:59.172820Z [Warning] 0/4 nodes are available: 4 node(s) didn't match Pod's node affinity.
2021-12-02T18:33:09Z [Normal] pod triggered scale-up: [{aks-nblarge-34239724-vmss 0->1 (max: 20)}]
2021-12-02T18:34:50.570896Z [Warning] 0/5 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity.
2021-12-02T18:35:22.576262Z [Warning] 0/5 nodes are available: 1 Insufficient memory, 4 node(s) didn't match Pod's node affinity.
2021-12-02T18:38:11Z [Normal] pod didn't trigger scale-up: 6 node(s) didn't match Pod's node affinity, 6 node(s) had taint {k8s.dask.org_dedicated: worker}, that the pod didn't tolerate, 1 Insufficient memory
Spawn failed: Timeout

What is curious here is that k8s does add a node to the cluster but afterward, the pod does seem to fit (insufficient memory).

cc @orianac

sgibson91 · 2021-12-07T10:32:38Z

Sorry @jhamman - I have not forgotten about this, I have just been swamped with the incredibly tight deadlines for the AGU meeting next week.

I suspect that we are running into this issue on the other nodepools and I will need to play with the memory settings below until the spawned pods fit on the nodes

infrastructure/config/hubs/carbonplan-azure.cluster.yaml

Lines 59 to 105 in 26752d0

    
           profileList: 
        
             # The mem-guarantees are here so k8s doesn't schedule other pods 
        
             # on these nodes. 
        
             - display_name: "Small: E2s v4" 
        
               description: "~2 CPU, ~15G RAM" 
        
               kubespawner_override: 
        
                 # Explicitly unset mem_limit, so it overrides the default memory limit we set in 
        
                 # basehub/values.yaml 
        
                 mem_limit: null 
        
                 mem_guarantee: 12G 
        
                 node_selector: 
        
                   hub.jupyter.org/node-size: Standard_E2s_v4 
        
             - display_name: "Medium: E4s v4" 
        
               description: "~4 CPU, ~30G RAM" 
        
               kubespawner_override: 
        
                 mem_limit: null 
        
                 mem_guarantee: 29G 
        
                 node_selector: 
        
                   hub.jupyter.org/node-size: Standard_E4s_v4 
        
             - display_name: "Large: E8s v4" 
        
               description: "~8 CPU, ~60G RAM" 
        
               kubespawner_override: 
        
                 mem_limit: null 
        
                 mem_guarantee: 60G 
        
                 node_selector: 
        
                   hub.jupyter.org/node-size: Standard_E8s_v4 
        
             - display_name: "Huge: E32s v4" 
        
               description: "~32 CPU, ~256G RAM" 
        
               kubespawner_override: 
        
                 mem_limit: null 
        
                 mem_guarantee: 240G 
        
                 node_selector: 
        
                   hub.jupyter.org/node-size: Standard_E32s_v4 
        
             - display_name: "Very Huge: M64s v2" 
        
               description: "~64 CPU, ~1024G RAM" 
        
               kubespawner_override: 
        
                 mem_limit: null 
        
                 mem_guarantee: 990G 
        
                 node_selector: 
        
                   hub.jupyter.org/node-size: Standard_M64s_v2 
        
             - display_name: "Very Very Huge: M128s v2" 
        
               description: "~128 CPU, ~2048G RAM" 
        
               kubespawner_override: 
        
                 mem_limit: null 
        
                 mem_guarantee: 2000G 
        
                 node_selector: 
        
                   hub.jupyter.org/node-size: Standard_M182s_v2

sgibson91 · 2021-12-07T11:02:30Z

Also, porting conversation around not being able to create a Service Principal from this PR

Hmmmm, seems like I still don't have enough privileges - @\jhamman?

Error: Could not list existing service principals

with azuread_service_principal.service_principal[0],
on service-principal.tf line 1, in resource "azuread_service_principal" "service_principal":
1: resource "azuread_service_principal" "service_principal" {

ServicePrincipalsClient.BaseClient.Get(): unexpected status 403 with OData error: Authorization_RequestDenied: Insufficient privileges to complete the
operation.

-- @sgibson91

Thanks @sgibson91 for working on this. Happy to adjust permissions as needed on our side. Confirming that right now, you do have the Owner role in the subscription being used for our deployment.

-- @jhamman

I think just giving me "Owner" role on the subscription wasn't enough, and there will also be some permissions in the linked Azure AD tenancy around registering applications that will also need to be enabled.

sgibson91 · 2021-12-07T17:16:31Z

Quick update on the nodepool scalings:

I have the medium, large and huge nodepools working on the staging hub by reducing the memory guarantee. (Draft PR #878) Basically the problem was the combination of the memory required for k8s admin stuff + the memory requested for the user server was greater than the memory of the node.

I'm still having problems with the vhuge and vvhuge nodepools and get a different error for those.

pod didn't trigger scale-up: 6 node(s) didn't match Pod's node affinity, 6 node(s) had taint {k8s.dask.org_dedicated: worker}, that the pod didn't tolerate, 1 in backoff after failed scale-up

These are also different machine types though (M-series vs E-series for the other nodes). So I'm wondering if we're hitting a quota of "cores per cluster" or something there.

sgibson91 · 2021-12-07T17:35:02Z

Also referring to #871, we think updating the Azure Fileshare to use NFS v4.1 protocol would resolve this issue and this is supported in terraform by the enabled_protocol attribute. My concern is that making this change forces the resource to recreate and therefore delete any work saved there. I may have to temporarily spin up another cluster to test this on. If it works, we may need to plan how to save any work locally from the Fileshare so it is not destroyed during the recreate.

jhamman · 2021-12-07T17:36:51Z

Thanks @sgibson91!

These are also different machine types though (M-series vs E-series for the other nodes). So I'm wondering if we're hitting a quota of "cores per cluster" or something there.

I can help check on this. Just having the medium, large, and huge machines working is a great start through.

jhamman · 2021-12-07T17:46:58Z

These are also different machine types though (M-series vs E-series for the other nodes). So I'm wondering if we're hitting a quota of "cores per cluster" or something there.

Our quota for Standard MSv2 Family vCPUs was 0 in West Europe. I've requested an increase to 400.

sgibson91 · 2021-12-07T18:28:18Z

Our quota for Standard MSv2 Family vCPUs was 0 in West Europe. I've requested an increase to 400.

Great, but we're using the westus2 location, no?

jhamman · 2021-12-07T18:50:17Z

Wow! I should have caught this much sooner 🤦 . We are currently in westus2 but we should be in West Europe. I should have included the region we needed to be in when I opened this issue.

This actually explains some performance/latency issues I've been seeing when using the cluster.

sgibson91 · 2021-12-07T19:05:22Z

Whoops! I'll get on it first thing tomorrow! Hopefully, transferring shouldn't be too disruptive 🤞🏻

sgibson91 · 2021-12-08T10:42:55Z

@jhamman I think moving the location of the resources will be very disruptive as I get this error message when I try to change the location input in terraform

Error: Get "http://localhost/api/v1/namespaces/azure-file": dial tcp [::1]:80: connect: connection refused

with kubernetes_namespace.homes,
on storage.tf line 16, in resource "kubernetes_namespace" "homes":
16: resource "kubernetes_namespace" "homes" {

So I suspect I'll have to destroy and recreate everything. Good news is, I can probably resolve #871 at the same time. Bad news is that home directories will need to be saved locally as they'll be wiped. How do you want to proceed?

jhamman · 2021-12-08T16:23:28Z

@sgibson91 - we can download our own home directories today. That should make starting over much easier. I'll update here when that is done.

sgibson91 · 2021-12-08T16:24:15Z

Thank you @jhamman!

jhamman · 2021-12-08T17:52:50Z

@sgibson91 - we're all buttoned up. Feel free to destroy / recreate whenever you need to.

sgibson91 · 2021-12-08T17:58:55Z

This is great, thanks! Will probably do it tomorrow :)

sgibson91 · 2021-12-09T15:40:41Z

WIP PR trying to bring the hubs back online #887

It implements NFS protocol on the Fileshare which we hope will fix #871 but at the minute, I'm struggling to get it to mount

sgibson91 · 2021-12-09T19:48:21Z

I will get it right this time! 782dc63

sgibson91 · 2021-12-10T15:07:46Z

@jhamman the hubs are up and running again in the correct location and I can confirm that the NFS protocol switch fixed #871!!! 🎉🎉🎉

However I am still seeing this error on the vhuge and vvhuge nodes

I'm still having problems with the vhuge and vvhuge nodepools and get a different error for those.
 pod didn't trigger scale-up: 6 node(s) didn't match Pod's node affinity, 6 node(s) had taint {k8s.dask.org_dedicated: worker}, that the pod didn't tolerate, 1 in backoff after failed scale-up

Rather than a "number of M-series machines in the location" quota, I wonder if there's a "maximum amount of CPU/Memory available in the cluster" quota we're hitting, as this discourse thread suggests https://discourse.jupyter.org/t/backoff-after-failed-scale-up/3331

jhamman · 2021-12-10T16:13:19Z

@sgibson91 - this is fantastic. We'll start testing now.

I'll also look at the quota issue again now that we have a hub in the correct region.

sgibson91 · 2021-12-17T12:16:54Z

Hey @jhamman - if it's ok with you, I will close this issue and we can use support@2i2c.org to surface any new work that needs to happen on this hub?

jhamman · 2021-12-17T17:22:49Z

Sounds good to me @sgibson91! Thanks for all the work so far.

choldgraf · 2021-12-17T20:11:33Z

thanks so much @sgibson91 for all of your work getting this set up, and improving our Azure deployment infrastructure in the process ✨

jhamman added the type: hub label Oct 31, 2021

choldgraf mentioned this issue Nov 5, 2021

Migrate UToronto hub to our pilot-hubs repository #638

Closed

15 tasks

choldgraf added this to Sprint Board Nov 10, 2021

choldgraf added the support label Nov 10, 2021

choldgraf moved this to Todo 👍 in Sprint Board Nov 10, 2021

sgibson91 self-assigned this Nov 10, 2021

sgibson91 moved this from Todo 👍 to In Progress ⚡ in Sprint Board Nov 15, 2021

sgibson91 mentioned this issue Nov 19, 2021

Setting up read/write access to blob storage for Azure #840

Closed

5 tasks

sgibson91 mentioned this issue Nov 22, 2021

Team Sync - Monday, November 22nd 2i2c-org/team-compass#310

Closed

sgibson91 mentioned this issue Nov 23, 2021

Change CarbonPlan's user image on Azure hubs #844

Merged

This was referenced Dec 3, 2021

Add utoronto staging hub #866

Merged

Deploy a new cluster for the U. Toronto community #853

Closed

sgibson91 mentioned this issue Dec 7, 2021

Tweak memory guarantees for CarbonPlan's Azure hubs #878

Merged

choldgraf added this to DEPRECATED Engineering and Product Backlog Dec 7, 2021

sgibson91 mentioned this issue Dec 8, 2021

Update New Hub template: Add a new field asking about resource location #881

Merged

choldgraf moved this to In progress in DEPRECATED Engineering and Product Backlog Dec 8, 2021

jhamman closed this as completed Dec 17, 2021

Repository owner moved this from In Progress ⚡ to Done 🎉 in Sprint Board Dec 17, 2021

choldgraf moved this from In progress to Complete in DEPRECATED Engineering and Product Backlog Jan 7, 2022

sgibson91 mentioned this issue Aug 10, 2022

[Decommission Hub] Azure Hub for Carbon Plan #1618

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Hub: CarbonPlan / Azure #800

New Hub: CarbonPlan / Azure #800

jhamman commented Oct 31, 2021 •

edited

Loading

sgibson91 commented Nov 11, 2021

sgibson91 commented Nov 15, 2021

sgibson91 commented Nov 18, 2021

sgibson91 commented Nov 19, 2021

sgibson91 commented Nov 19, 2021

jhamman commented Nov 22, 2021

sgibson91 commented Nov 23, 2021

choldgraf commented Nov 29, 2021

sgibson91 commented Nov 30, 2021

sgibson91 commented Nov 30, 2021

sgibson91 commented Nov 30, 2021

jhamman commented Dec 1, 2021

jhamman commented Dec 2, 2021 •

edited

Loading

sgibson91 commented Dec 7, 2021 •

edited

Loading

sgibson91 commented Dec 7, 2021

sgibson91 commented Dec 7, 2021 •

edited

Loading

sgibson91 commented Dec 7, 2021

jhamman commented Dec 7, 2021

jhamman commented Dec 7, 2021

sgibson91 commented Dec 7, 2021

jhamman commented Dec 7, 2021

sgibson91 commented Dec 7, 2021

sgibson91 commented Dec 8, 2021

jhamman commented Dec 8, 2021

sgibson91 commented Dec 8, 2021

jhamman commented Dec 8, 2021

sgibson91 commented Dec 8, 2021

sgibson91 commented Dec 9, 2021

sgibson91 commented Dec 9, 2021

sgibson91 commented Dec 10, 2021 •

edited

Loading

jhamman commented Dec 10, 2021

sgibson91 commented Dec 17, 2021

jhamman commented Dec 17, 2021

choldgraf commented Dec 17, 2021

New Hub: CarbonPlan / Azure #800

New Hub: CarbonPlan / Azure #800

Comments

jhamman commented Oct 31, 2021 • edited Loading

Hub Description

Community Representative(s)

Important dates

Target start date

Preferred Cloud Provider

Do you have your own billing account?

Hub Authentication Type

Hub logo

Hub logo URL

Hub image service

Hub image

Extra features you'd like to enable

Other relevant information

Hub ID

Hub Cluster

Hub URL

Hub Template

Tasks to deploy the hub

sgibson91 commented Nov 11, 2021

sgibson91 commented Nov 15, 2021

sgibson91 commented Nov 18, 2021

sgibson91 commented Nov 19, 2021

sgibson91 commented Nov 19, 2021

jhamman commented Nov 22, 2021

sgibson91 commented Nov 23, 2021

choldgraf commented Nov 29, 2021

sgibson91 commented Nov 30, 2021

sgibson91 commented Nov 30, 2021

sgibson91 commented Nov 30, 2021

jhamman commented Dec 1, 2021

jhamman commented Dec 2, 2021 • edited Loading

sgibson91 commented Dec 7, 2021 • edited Loading

sgibson91 commented Dec 7, 2021

sgibson91 commented Dec 7, 2021 • edited Loading

sgibson91 commented Dec 7, 2021

jhamman commented Dec 7, 2021

jhamman commented Dec 7, 2021

sgibson91 commented Dec 7, 2021

jhamman commented Dec 7, 2021

sgibson91 commented Dec 7, 2021

sgibson91 commented Dec 8, 2021

jhamman commented Dec 8, 2021

sgibson91 commented Dec 8, 2021

jhamman commented Dec 8, 2021

sgibson91 commented Dec 8, 2021

sgibson91 commented Dec 9, 2021

sgibson91 commented Dec 9, 2021

sgibson91 commented Dec 10, 2021 • edited Loading

jhamman commented Dec 10, 2021

sgibson91 commented Dec 17, 2021

jhamman commented Dec 17, 2021

choldgraf commented Dec 17, 2021

jhamman commented Oct 31, 2021 •

edited

Loading

jhamman commented Dec 2, 2021 •

edited

Loading

sgibson91 commented Dec 7, 2021 •

edited

Loading

sgibson91 commented Dec 7, 2021 •

edited

Loading

sgibson91 commented Dec 10, 2021 •

edited

Loading