-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Hub: CarbonPlan / Azure #800
Comments
Confirming that @jhamman has given me access to the Azure subscription to be used 🎉 |
I have opened a PR to deploy a new cluster on Azure here: #833 |
PR for new hubs is here: #838 |
I opened #840 to track the object storage connection |
Hey @jhamman the hubs are up! |
Woohoo! The main thing we need to change to get us functional is to swap out the user image to From there, we'll be able to start using the hub in earnest and can give further feedback as it comes up. |
@choldgraf I haven't installed them yet... Whoops! |
@choldgraf the grafana charts now exist |
Done! |
Question for @sgibson91 - we've been mostly running "Small" profile instances but we'd like to scale up to the larger instances configured for our hub. However, none of the larger instances seem to successfully launch. Here's an example log after attempting to launch a "Large" instance:
What is curious here is that k8s does add a node to the cluster but afterward, the pod does seem to fit (insufficient memory). cc @orianac |
Sorry @jhamman - I have not forgotten about this, I have just been swamped with the incredibly tight deadlines for the AGU meeting next week. I suspect that we are running into this issue on the other nodepools and I will need to play with the memory settings below until the spawned pods fit on the nodes infrastructure/config/hubs/carbonplan-azure.cluster.yaml Lines 59 to 105 in 26752d0
|
Also, porting conversation around not being able to create a Service Principal from this PR
I think just giving me "Owner" role on the subscription wasn't enough, and there will also be some permissions in the linked Azure AD tenancy around registering applications that will also need to be enabled. |
Quick update on the nodepool scalings: I have the I'm still having problems with the
These are also different machine types though (M-series vs E-series for the other nodes). So I'm wondering if we're hitting a quota of "cores per cluster" or something there. |
Also referring to #871, we think updating the Azure Fileshare to use NFS v4.1 protocol would resolve this issue and this is supported in terraform by the |
Thanks @sgibson91!
I can help check on this. Just having the |
Our quota for |
Great, but we're using the |
Wow! I should have caught this much sooner 🤦 . We are currently in This actually explains some performance/latency issues I've been seeing when using the cluster. |
Whoops! I'll get on it first thing tomorrow! Hopefully, transferring shouldn't be too disruptive 🤞🏻 |
@jhamman I think moving the location of the resources will be very disruptive as I get this error message when I try to change the location input in terraform
So I suspect I'll have to destroy and recreate everything. Good news is, I can probably resolve #871 at the same time. Bad news is that home directories will need to be saved locally as they'll be wiped. How do you want to proceed? |
@sgibson91 - we can download our own home directories today. That should make starting over much easier. I'll update here when that is done. |
Thank you @jhamman! |
@sgibson91 - we're all buttoned up. Feel free to destroy / recreate whenever you need to. |
This is great, thanks! Will probably do it tomorrow :) |
I will get it right this time! 782dc63 |
@jhamman the hubs are up and running again in the correct location and I can confirm that the NFS protocol switch fixed #871!!! 🎉🎉🎉 However I am still seeing this error on the
Rather than a "number of M-series machines in the location" quota, I wonder if there's a "maximum amount of CPU/Memory available in the cluster" quota we're hitting, as this discourse thread suggests https://discourse.jupyter.org/t/backoff-after-failed-scale-up/3331 |
@sgibson91 - this is fantastic. We'll start testing now. I'll also look at the quota issue again now that we have a hub in the correct region. |
Hey @jhamman - if it's ok with you, I will close this issue and we can use |
Sounds good to me @sgibson91! Thanks for all the work so far. |
thanks so much @sgibson91 for all of your work getting this set up, and improving our Azure deployment infrastructure in the process ✨ |
Hub Description
CarbonPlan is a non-profit organization that builds open data and tools to accelerate the deployment of quality and transparent climate solutions. We are working on a new project that is utilizing the Microsoft Azure cloud platform. The hub we'd like to see deployed is a "Pangeo-style" hub, including spot instances for Dask workloads and GPU instances for some single-user sessions.
Community Representative(s)
Important dates
No specific dates. We'll be using this hub for exploratory data analysis and prototyping in November, data production in December, and continued research through at least April 2022.
Target start date
11/5/2021
Preferred Cloud Provider
Microsoft Azure
Do you have your own billing account?
Hub Authentication Type
GitHub Authentication (e.g., @MyGitHubHandle)
Hub logo
No response
Hub logo URL
https://carbonplan.org/
Hub image service
Dockerhub / quay.io
Hub image
tbd, for now: pangeo/pangeo-notebook:latest
Extra features you'd like to enable
Other relevant information
No response
Hub ID
carbonplan-azure
Hub Cluster
(New!)
carbonplan-azure
Hub URL
azure.carbonplan.2i2c.cloud
Hub Template
daskhub
Tasks to deploy the hub
The text was updated successfully, but these errors were encountered: