-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ocean.pangeo.io maintenance hack session #622
Comments
I'll be around for most of the session, but will have to pop out for a couple calls. |
I'm looking forward to this hack session today. |
Let's jump in https://whereby.com/pangeo to kick things off. |
Some working notes here: https://hackmd.io/@U4W-olO3TX-hc-cvbjNe4A/r13p_PRaL/edit |
For " write documentation explaining deprecation of Dask Kubernetes and how to use Dask Gateway" we can pull content from https://medium.com/pangeo/pangeo-with-dask-gateway-4b638825f105, specifically https://medium.com/pangeo/pangeo-with-dask-gateway-4b638825f105#af22 for explaining how to transition. |
I can look into logging / monitoring things. Both jupyterhub and Dask expose prometheus metrics, and mybinder has details on capturing & visualizing them: https://mybinder-sre.readthedocs.io/en/latest/components/metrics.html |
That's awesome! What we would like most is to be able to run a query to find a how much time an individual user has accumulated over a given period on both jupyter and dask. |
My update from day 1:
Still to do:
|
This will be useful for just pointing to existing images on DockerHub Or if images continue to be built in this repo, it makes sense to put them on DockerHub rather than aws or gcp registries which are harder for people to get to. So could revisit berkeley-dsep-infra/hubploy#24 |
The account migration is in progress. Those with credentials can see the backed up homedirs here: https://console.cloud.google.com/storage/browser/pangeo-homedir-backup There is a long tail of very large home directories on ocean that will take a very long time to complete. |
For reference, the backup scripts are here: https://gist.github.com/rabernat/c9b352de926756342e86da662a0eadf9 |
I think we're hoping to still upload to GCP / AWS to keep the startup times as small as possible when an image does need to be downloaded. |
Today I'll work on standing up a test cluster and testing that Linux hack to enforce user storage limits. |
@salvis2 @rabernat - before you dive into the storage limits, do you have a solution for dealing with the fact that every user has the same uid and gid (1000,1000)? This has come up a few times before #384 (comment) #25 |
My idea was to try to do the quota-ing from within the user's jupyter pod. Basically, this pod is a unix system with one user--jovyan (1000,1000)--whose home directory is mounted from an nfs server. Is is possible to make this unix instance enforce a quota on that one user? It doesn't have to know about all the other users or address the challenge of duplicated uid / gid. It just has to prevent jovyan from creating more than 10GB of files in /home/jovyan. Seems like it should be possible to me, but I have likely overlooked something. |
ok. definitely sounds like something worth exploring! One more idea/request on the topic of "update hubploy / circleci configs". I think it would be great to drop circleci in favor of github actions. Hubploy now works with github actions (for example https://github.com/ICESAT-2HackWeek/jupyterhub-2020). And we could make use of organization level secrets to reduce scattering in various places. https://github.blog/changelog/2020-05-14-organization-secrets/. |
💯 x 👍 |
Home directory backup is complete. Should I just |
@rabernat - let's leave it for a few days. I actually think we'll want to create a new (smaller) nfs service so we may just remove the existing one all together. |
@jhamman -- let me know when you're ready for me to transfer the migrated ocean.pangeo.io users to the new NFS server. |
What's the status today? Are we ready to starting bringing up the new cluster? For DNS, I suggest we go with the region-based names, i.e. |
Update... @TomAugspurger and I have been working on standing up the new hub. This is going well and we should be ready for the user home directories now at the following NFS location:
@rabernat - we're also ready to configure Auth0 and the DNS record. I can't do this because my access to the Pangeo Auth0 account is still broken. The branch to work off right now is: #626 |
Do the GCP clusters use NFS Provisioner for making new user home directories? There is a way to run the binary apparently that can enforce user quotas: https://github.com/kubernetes-incubator/external-storage/blob/master/nfs/docs/deployment.md#outside-of-kubernetes---binary This doesn't appear to be an option in NFS-Client Provisioner. I'm a little fuzzy on the distinction between the two, but the first link is the only thing I could find on quotas. Linux hacking has yet to yield anything useful. |
I'm not sure. All I know is that they use NFS for home directories. The chart is in #262
On it.
Do we have an IP address for the DNS record? |
I have hit a challenge with the NFS server permissions, described in #627. Any ideas would be appreciated. |
Home directories are now (or will soon be) working. |
The dask side of things is up now. I'm not familiar with how we did DNS before. Do we need to reserve some address in GCP? RIght now the hub's IP is 34.69.173.244. |
Telemetry stuff seems to work at a glance. We'll need to talk about what if anything should be public. If you want to mess with grafana the steps currently are
Then login with username: |
There is some way to convert this to a permanent IP address. |
Also, DNS is up (http://staging.us-central1-b.gcp.pangeo.io/) but https is not yet configured. Does anyone know how to do this? |
HTTPs may just be a matter of uncommenting
https://github.com/pangeo-data/pangeo-cloud-federation/blob/0c33675fa235fdf4a9c88f8daf6ec00ee01d22ad/deployments/gcp-uscentral1b/config/staging.yaml#L4-L8?
Maybe updating the email?
…On Fri, Jun 26, 2020 at 2:19 PM Ryan Abernathey ***@***.***> wrote:
Also, DNS is up (http://staging.us-central1-b.gcp.pangeo.io/) but https
is not yet configured. Does anyone know how to do this?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#622 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIVZX3NHRPEGNTQHMN3RYTYDXANCNFSM4OA475YA>
.
|
I believe you are supposed to first get the hub up-and-running without HTTPS, do some DNS pointing, then enable HTTPS. https://zero-to-jupyterhub.readthedocs.io/en/latest/administrator/security.html#https It looks like prod had the HTTPS block always enabled: https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/deployments/gcp-uscentral1b/config/prod.yaml#L5-L8 |
If HTTPS doesnt configure itself properly, I know that it could be needed to delete a secret named something like hub-proxy-tls and then delete the autohttps pod. |
My update from today: staginghttps://staging.us-central1-b.gcp.pangeo.io/ is now live and is using Pangeo's Auth0 account. For the staging hub, the main thing to sort out is the dask gateway service. @rabernat and I were getting the following error when we took the hub for a test drive:
prodI added the config for https://us-central1-b.gcp.pangeo.io/ to the
@TomAugspurger - this looks familiar to what we saw yesterday, no? |
I thought I fixed the 503 error for gateway. Can you make sure you pulled staging before helm deploying? |
I redeployed from staging. Things seem to be OK. Not sure about prod right now. |
Is there a public endpoint for the grafana dashboards? |
Grafana should have an External-IP / service. I know you can put a DNS address to point to it but I'm still fuzzy on doing HTTPS with it through JupyterHub. @consideRatio could probably speak to that more if you are curious. You can enable anonymous logins for Grafana and configure what anonymous users are able to see via settings on their organization role. |
Ah ok I I just figured out how to see grafana locally (actually read @TomAugspurger's comment in . #622 (comment)). I can now see a basic Grafana interface, but it doesn't have any dashboards and I don't know how to create one. Is there an issue to discuss that? |
No public dashboard yet. We’ll need to decide if there’s anything that shouldn’t be public.
Right now the dashboards seem to be lost on each helm deploy. Haven’t figured out how to persist them yet.
… On Jun 27, 2020, at 17:02, Ryan Abernathey ***@***.***> wrote:
Ah ok I I just figured out how to see grafana locally (actually read @TomAugspurger's comment in . #622 (comment)).
I can now see a basic Grafana interface, but it doesn't have any dashboards and I don't know how to create one. Is there an issue to discuss that?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I think you need to build the dashboards into the Helm release. It's not super clear, but this seems to be somewhere to start: https://github.com/helm/charts/tree/master/stable/grafana#import-dashboards |
https://us-central1-b.gcp.pangeo.io is now up
@consideRatio - do you know if it is possible (or what it would take) to put grafana behind the admin permissions of a jupyterhub service? |
Tomorrow morning I plan to send an email to the users of the new cluster to let them know it's on. |
@jhamman do you know what's left to do for getting things hooked up to hubploy? |
I think we just need to:
|
I'm about to push a big update to pangeo.io with documentation about the new setup. |
@TomAugspurger - any idea what is up with these Pending pods:
|
Not sure. Probably safe to just delete? |
tried that. they just come back in the same state. |
See pangeo-data/pangeo#780 for documentation update. I'd appreciate a review there. |
Another question: the dask widget is still set up to launch kubeclusters. I think we should not allow kubecluster on the new cluster. So what do we do about the widget? Can we make it launch dask_gateway clusters? |
I believe that's coming from the |
So we need to open an issue in dask-labextension? |
Opened dask/dask-labextension#135
…On Mon, Jun 29, 2020 at 1:12 PM Ryan Abernathey ***@***.***> wrote:
But dask-gateway needs to create the intermediate Gateway object.
So we need to open an issue in dask-labextension?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#622 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIRRBLG4DDNOOAWU6FLRZDKRNANCNFSM4OA475YA>
.
|
Thanks for your work everyone! The new cluster is launched. Whenever you get time @TomAugspurger, I would love if you could explain to me how to use grafana / prometheus to gather the information I need about usage. |
Hi pangeo folks, apologies for stalking but found this issue while googling for whether there was some way to configure storage quotas when using NFS on GCP. If anyone found a solution to that I'd be very grateful for a pointer. |
As discussed in #616 and https://discourse.pangeo.io/t/migration-of-ocean-pangeo-io-user-accounts/644/15, we will be doing maintenance on ocean.pangeo.io and other GCP clusters next week. @jhamman and I have blocked off Monday, June 22, 2-5pm EDT for a sprint on this. I invite everyone, and in particular @TomAugspurger, @scottyhq, @salvis2, @consideRatio, and @yuvipanda to help us out with this.
Some of the things we need to do are:
What am I missing from this list?
The text was updated successfully, but these errors were encountered: