ocean.pangeo.io maintenance hack session #622

rabernat · 2020-06-17T19:38:22Z

As discussed in #616 and https://discourse.pangeo.io/t/migration-of-ocean-pangeo-io-user-accounts/644/15, we will be doing maintenance on ocean.pangeo.io and other GCP clusters next week. @jhamman and I have blocked off Monday, June 22, 2-5pm EDT for a sprint on this. I invite everyone, and in particular @TomAugspurger, @scottyhq, @salvis2, @consideRatio, and @yuvipanda to help us out with this.

Some of the things we need to do are:

review the results of the account migration form and decide which user accounts will be migrated
write a script to migrate ORCID to github user IDs
delete non-migrated home directories
switch ocean.pangeo.io authentication to auth0
write documentation explaining deprecation of Dask Kubernetes and how to use Dask Gateway
set up some sort of logging (see logging for production clusters #72) so we can better track usage statistics. Pinging @yuvipanda to get the latest on what is the best practice here
change image specification, building to use upstream pieces from pangeo-docker-images

What am I missing from this list?

TomAugspurger · 2020-06-22T13:15:50Z

I'll be around for most of the session, but will have to pop out for a couple calls.

rabernat · 2020-06-22T13:16:33Z

I'm looking forward to this hack session today.

jhamman · 2020-06-22T18:00:58Z

Let's jump in https://whereby.com/pangeo to kick things off.

jhamman · 2020-06-22T18:01:46Z

Some working notes here: https://hackmd.io/@U4W-olO3TX-hc-cvbjNe4A/r13p_PRaL/edit

TomAugspurger · 2020-06-22T20:46:35Z

For " write documentation explaining deprecation of Dask Kubernetes and how to use Dask Gateway" we can pull content from https://medium.com/pangeo/pangeo-with-dask-gateway-4b638825f105, specifically https://medium.com/pangeo/pangeo-with-dask-gateway-4b638825f105#af22 for explaining how to transition.

TomAugspurger · 2020-06-22T20:52:33Z

I can look into logging / monitoring things. Both jupyterhub and Dask expose prometheus metrics, and mybinder has details on capturing & visualizing them: https://mybinder-sre.readthedocs.io/en/latest/components/metrics.html

rabernat · 2020-06-22T20:54:23Z

I can look into logging / monitoring things. Both jupyterhub and Dask expose prometheus metrics

That's awesome! What we would like most is to be able to run a query to find a how much time an individual user has accumulated over a given period on both jupyter and dask.

jhamman · 2020-06-22T22:24:55Z

My update from day 1:

Tear down of all three JupyterHub's in GCP
Reorg this repository to include a single GCP hub (renamed ocean to gcp-uscentral1b), see Refactor part 1 #625
Develop simple Makefile for standing up new k8s cluster that uses auto nodepool provisioning, see Refactor part 1 #625

Still to do:

sort out some details related to rbac/service accounts in Refactor part 1 #625
test deployment of pangeo helm chart
update hubploy / circleci configs

scottyhq · 2020-06-22T22:48:29Z

update hubploy / circleci configs

This will be useful for just pointing to existing images on DockerHub
berkeley-dsep-infra/hubploy#75

Or if images continue to be built in this repo, it makes sense to put them on DockerHub rather than aws or gcp registries which are harder for people to get to. So could revisit berkeley-dsep-infra/hubploy#24

rabernat · 2020-06-23T11:16:08Z

The account migration is in progress. Those with credentials can see the backed up homedirs here: https://console.cloud.google.com/storage/browser/pangeo-homedir-backup

There is a long tail of very large home directories on ocean that will take a very long time to complete.

rabernat · 2020-06-23T11:17:39Z

For reference, the backup scripts are here: https://gist.github.com/rabernat/c9b352de926756342e86da662a0eadf9

TomAugspurger · 2020-06-23T13:04:29Z

Or if images continue to be built in this repo, it makes sense to put them on DockerHub rather than aws or gcp registries which are harder for people to get to.

I think we're hoping to still upload to GCP / AWS to keep the startup times as small as possible when an image does need to be downloaded.

salvis2 · 2020-06-23T17:19:52Z

Today I'll work on standing up a test cluster and testing that Linux hack to enforce user storage limits.

scottyhq · 2020-06-23T17:34:14Z

@salvis2 @rabernat - before you dive into the storage limits, do you have a solution for dealing with the fact that every user has the same uid and gid (1000,1000)? This has come up a few times before #384 (comment) #25

rabernat · 2020-06-23T17:38:07Z

My idea was to try to do the quota-ing from within the user's jupyter pod. Basically, this pod is a unix system with one user--jovyan (1000,1000)--whose home directory is mounted from an nfs server.

Is is possible to make this unix instance enforce a quota on that one user? It doesn't have to know about all the other users or address the challenge of duplicated uid / gid. It just has to prevent jovyan from creating more than 10GB of files in /home/jovyan.

Seems like it should be possible to me, but I have likely overlooked something.

scottyhq · 2020-06-23T18:24:03Z

ok. definitely sounds like something worth exploring!

One more idea/request on the topic of "update hubploy / circleci configs". I think it would be great to drop circleci in favor of github actions. Hubploy now works with github actions (for example https://github.com/ICESAT-2HackWeek/jupyterhub-2020). And we could make use of organization level secrets to reduce scattering in various places. https://github.blog/changelog/2020-05-14-organization-secrets/.

rabernat · 2020-06-23T18:26:22Z

I think it would be great to drop circleci in favor of github actions. Hubploy now works with github actions

💯 x 👍

rabernat · 2020-06-24T01:16:08Z

Home directory backup is complete. Should I just rm -rf * the NFS volume?

jhamman · 2020-06-24T05:14:06Z

Should I just rm -rf * the NFS volume?

@rabernat - let's leave it for a few days. I actually think we'll want to create a new (smaller) nfs service so we may just remove the existing one all together.

rabernat · 2020-06-24T16:50:45Z

@jhamman -- let me know when you're ready for me to transfer the migrated ocean.pangeo.io users to the new NFS server.

rabernat · 2020-06-25T17:11:00Z

What's the status today? Are we ready to starting bringing up the new cluster?

For DNS, I suggest we go with the region-based names, i.e. us-central-1b.gcp.pangeo.io.

jhamman · 2020-06-25T17:48:58Z

Update...

@TomAugspurger and I have been working on standing up the new hub. This is going well and we should be ready for the user home directories now at the following NFS location:

10.126.142.50:/home/uscentral1b/{GITHUB_USER}

@rabernat - we're also ready to configure Auth0 and the DNS record. I can't do this because my access to the Pangeo Auth0 account is still broken.

The branch to work off right now is: #626

salvis2 · 2020-06-25T17:58:44Z

Do the GCP clusters use NFS Provisioner for making new user home directories? There is a way to run the binary apparently that can enforce user quotas: https://github.com/kubernetes-incubator/external-storage/blob/master/nfs/docs/deployment.md#outside-of-kubernetes---binary

This doesn't appear to be an option in NFS-Client Provisioner. I'm a little fuzzy on the distinction between the two, but the first link is the only thing I could find on quotas. Linux hacking has yet to yield anything useful.

rabernat · 2020-06-26T10:59:54Z

Do the GCP clusters use NFS Provisioner for making new user home directories

I'm not sure. All I know is that they use NFS for home directories. The chart is in #262

we should be ready for the user home directories now at the following NFS location:

On it.

we're also ready to configure Auth0 and the DNS record

Do we have an IP address for the DNS record?

rabernat · 2020-06-26T16:27:13Z

I have hit a challenge with the NFS server permissions, described in #627. Any ideas would be appreciated.

rabernat · 2020-06-26T16:46:42Z

Home directories are now (or will soon be) working.

TomAugspurger · 2020-06-26T17:39:50Z

The dask side of things is up now.

I'm not familiar with how we did DNS before. Do we need to reserve some address in GCP? RIght now the hub's IP is 34.69.173.244.

TomAugspurger · 2020-06-26T17:45:54Z

Telemetry stuff seems to work at a glance. We'll need to talk about what if anything should be public.

If you want to mess with grafana the steps currently are

cd deployments/gcp-uscentral1b
# get the password
# remove the | pbcopy if you aren't on a Mac
make print-grafana-password | pbcopy
# tunnel into the grafana server
make forward-grafana

Then login with username: admin and the password that should be on your clipboard. I think we'll eventually hook grafana up to some auth system like GitHub.

rabernat · 2020-06-26T17:57:15Z

I'm not familiar with how we did DNS before. Do we need to reserve some address in GCP? RIght now the hub's IP is 34.69.173.244.

There is some way to convert this to a permanent IP address.

rabernat · 2020-06-26T19:19:09Z

Also, DNS is up (http://staging.us-central1-b.gcp.pangeo.io/) but https is not yet configured. Does anyone know how to do this?

TomAugspurger · 2020-06-26T19:43:06Z

HTTPs may just be a matter of uncommenting https://github.com/pangeo-data/pangeo-cloud-federation/blob/0c33675fa235fdf4a9c88f8daf6ec00ee01d22ad/deployments/gcp-uscentral1b/config/staging.yaml#L4-L8? Maybe updating the email?

…

On Fri, Jun 26, 2020 at 2:19 PM Ryan Abernathey ***@***.***> wrote: Also, DNS is up (http://staging.us-central1-b.gcp.pangeo.io/) but https is not yet configured. Does anyone know how to do this? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#622 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIVZX3NHRPEGNTQHMN3RYTYDXANCNFSM4OA475YA> .

salvis2 · 2020-06-26T20:55:21Z

I believe you are supposed to first get the hub up-and-running without HTTPS, do some DNS pointing, then enable HTTPS. https://zero-to-jupyterhub.readthedocs.io/en/latest/administrator/security.html#https

It looks like prod had the HTTPS block always enabled: https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/deployments/gcp-uscentral1b/config/prod.yaml#L5-L8

consideRatio · 2020-06-26T22:42:09Z

If HTTPS doesnt configure itself properly, I know that it could be needed to delete a secret named something like hub-proxy-tls and then delete the autohttps pod.

jhamman · 2020-06-26T23:42:25Z

My update from today:

staging

https://staging.us-central1-b.gcp.pangeo.io/ is now live and is using Pangeo's Auth0 account.

For the staging hub, the main thing to sort out is the dask gateway service. @rabernat and I were getting the following error when we took the hub for a test drive:

ClientResponseError: 503, message='Service Unavailable', url=URL('https://staging.us-central1-b.gcp.pangeo.io/services/dask-gateway/api/v1/clusters/')

prod

I added the config for https://us-central1-b.gcp.pangeo.io/ to the staging branch but I didn't manage to get it deployed. Currently running into an issue with lingering k8s resources:

Error: rendered manifests contain a resource that already exists. Unable to continue with install: existing resource conflict: namespace: , name: us-central1b-prod-grafana, existing_kind: policy/v1beta1, Kind=PodSecurityPolicy, new_kind: policy/v1beta1, Kind=PodSecurityPolicy

@TomAugspurger - this looks familiar to what we saw yesterday, no?

TomAugspurger · 2020-06-26T23:49:47Z

I thought I fixed the 503 error for gateway. Can you make sure you pulled staging before helm deploying?

TomAugspurger · 2020-06-27T00:46:32Z

I redeployed from staging. Things seem to be OK.

Not sure about prod right now.

rabernat · 2020-06-27T18:27:46Z

Is there a public endpoint for the grafana dashboards?

salvis2 · 2020-06-27T20:29:59Z

Is there a public endpoint for the grafana dashboards?

Grafana should have an External-IP / service. I know you can put a DNS address to point to it but I'm still fuzzy on doing HTTPS with it through JupyterHub. @consideRatio could probably speak to that more if you are curious.

You can enable anonymous logins for Grafana and configure what anonymous users are able to see via settings on their organization role.

rabernat · 2020-06-27T22:02:02Z

Ah ok I I just figured out how to see grafana locally (actually read @TomAugspurger's comment in . #622 (comment)).

I can now see a basic Grafana interface, but it doesn't have any dashboards and I don't know how to create one. Is there an issue to discuss that?

TomAugspurger · 2020-06-27T22:33:13Z

No public dashboard yet. We’ll need to decide if there’s anything that shouldn’t be public. Right now the dashboards seem to be lost on each helm deploy. Haven’t figured out how to persist them yet.

…

On Jun 27, 2020, at 17:02, Ryan Abernathey ***@***.***> wrote: Ah ok I I just figured out how to see grafana locally (actually read @TomAugspurger's comment in . #622 (comment)). I can now see a basic Grafana interface, but it doesn't have any dashboards and I don't know how to create one. Is there an issue to discuss that? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

salvis2 · 2020-06-27T23:33:03Z

I think you need to build the dashboards into the Helm release. It's not super clear, but this seems to be somewhere to start: https://github.com/helm/charts/tree/master/stable/grafana#import-dashboards

jhamman · 2020-06-28T05:50:49Z

https://us-central1-b.gcp.pangeo.io is now up

No public dashboard yet. We’ll need to decide if there’s anything that shouldn’t be public.

@consideRatio - do you know if it is possible (or what it would take) to put grafana behind the admin permissions of a jupyterhub service?

rabernat · 2020-06-29T00:11:24Z

Tomorrow morning I plan to send an email to the users of the new cluster to let them know it's on.

TomAugspurger · 2020-06-29T13:42:28Z

@jhamman do you know what's left to do for getting things hooked up to hubploy?

jhamman · 2020-06-29T15:04:30Z

I think we just need to:

uncomment the various pieces of the circleci config, e.g.

pangeo-cloud-federation/.circleci/config.yml

Lines 177 to 182 in 461606e

    
               # Temporarily disabled 
        
               #   - run: 
        
               #       name: Deploy gcp-uscentral1b.pangeo.io 
        
               #       when: always 
        
               #       command: | 
        
               #         hubploy deploy gcp-uscentral1b pangeo-deploy ${CIRCLE_BRANCH} --cleanup-on-fail

repopulate the gcloud service key: https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/deployments/gcp-uscentral1b/secrets/gcloud-service-key.json

rabernat · 2020-06-29T15:12:51Z

I'm about to push a big update to pangeo.io with documentation about the new setup.

jhamman · 2020-06-29T15:53:12Z

@TomAugspurger - any idea what is up with these Pending pods:

$ kubectl get pod -n prod | grep Pending
us-central1b-prod-prometheus-node-exporter-2n97g                  0/1     Pending   0          42h
us-central1b-prod-prometheus-node-exporter-dw689                  0/1     Pending   0          42h
us-central1b-prod-prometheus-node-exporter-j42ms                  0/1     Pending   0          42h
us-central1b-prod-prometheus-node-exporter-wnsjv                  0/1     Pending   0          42h

TomAugspurger · 2020-06-29T16:01:00Z

Not sure. Probably safe to just delete?

jhamman · 2020-06-29T16:19:44Z

Not sure. Probably safe to just delete?

tried that. they just come back in the same state.

rabernat · 2020-06-29T16:32:22Z

See pangeo-data/pangeo#780 for documentation update. I'd appreciate a review there.

rabernat · 2020-06-29T17:15:31Z

Another question: the dask widget is still set up to launch kubeclusters. I think we should not allow kubecluster on the new cluster. So what do we do about the widget? Can we make it launch dask_gateway clusters?

TomAugspurger · 2020-06-29T18:08:19Z

I believe that's coming from the dask_config.yaml that's baked into the docker images at https://github.com/pangeo-data/pangeo-docker-images/blob/6ba7997b5246440c0f1b92512cb133b98c6b976d/base-image/dask_config.yml#L58-L63. Just switching that to dask-gateway won't work out of the box since the labextension is only set up to create a cluster like class(*args, **kwargs). But dask-gateway needs to create the intermediate Gateway object.

rabernat · 2020-06-29T18:12:24Z

But dask-gateway needs to create the intermediate Gateway object.

So we need to open an issue in dask-labextension?

TomAugspurger · 2020-06-29T21:49:47Z

Opened dask/dask-labextension#135

…

On Mon, Jun 29, 2020 at 1:12 PM Ryan Abernathey ***@***.***> wrote: But dask-gateway needs to create the intermediate Gateway object. So we need to open an issue in dask-labextension? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#622 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIRRBLG4DDNOOAWU6FLRZDKRNANCNFSM4OA475YA> .

rabernat · 2020-07-02T13:17:32Z

Thanks for your work everyone! The new cluster is launched.

Whenever you get time @TomAugspurger, I would love if you could explain to me how to use grafana / prometheus to gather the information I need about usage.

alimanfoo · 2021-07-08T15:48:30Z

Hi pangeo folks, apologies for stalking but found this issue while googling for whether there was some way to configure storage quotas when using NFS on GCP. If anyone found a solution to that I'd be very grateful for a pointer.

rabernat mentioned this issue Jun 26, 2020

Auth0 for Authentication on all hubs? #611

Open

rabernat mentioned this issue Jun 29, 2020

Mega update of website content pangeo-data/pangeo#780

Merged

TomAugspurger mentioned this issue Jul 1, 2020

hubploy #635

Merged

rabernat closed this as completed Jul 2, 2020

ocean.pangeo.io maintenance hack session #622

ocean.pangeo.io maintenance hack session #622

Comments

rabernat commented Jun 17, 2020 • edited Loading

TomAugspurger commented Jun 22, 2020

rabernat commented Jun 22, 2020

jhamman commented Jun 22, 2020

jhamman commented Jun 22, 2020

TomAugspurger commented Jun 22, 2020

TomAugspurger commented Jun 22, 2020

rabernat commented Jun 22, 2020

jhamman commented Jun 22, 2020

scottyhq commented Jun 22, 2020 • edited Loading

rabernat commented Jun 23, 2020

rabernat commented Jun 23, 2020

TomAugspurger commented Jun 23, 2020

salvis2 commented Jun 23, 2020

scottyhq commented Jun 23, 2020

rabernat commented Jun 23, 2020

scottyhq commented Jun 23, 2020

rabernat commented Jun 23, 2020

rabernat commented Jun 24, 2020

jhamman commented Jun 24, 2020

rabernat commented Jun 24, 2020

rabernat commented Jun 25, 2020

jhamman commented Jun 25, 2020

salvis2 commented Jun 25, 2020

rabernat commented Jun 26, 2020

rabernat commented Jun 26, 2020

rabernat commented Jun 26, 2020

TomAugspurger commented Jun 26, 2020

TomAugspurger commented Jun 26, 2020 • edited Loading

rabernat commented Jun 26, 2020

rabernat commented Jun 26, 2020

TomAugspurger commented Jun 26, 2020 via email

salvis2 commented Jun 26, 2020 • edited Loading

consideRatio commented Jun 26, 2020

jhamman commented Jun 26, 2020

staging

prod

TomAugspurger commented Jun 26, 2020

TomAugspurger commented Jun 27, 2020

rabernat commented Jun 27, 2020

salvis2 commented Jun 27, 2020

rabernat commented Jun 27, 2020

TomAugspurger commented Jun 27, 2020 via email

salvis2 commented Jun 27, 2020

jhamman commented Jun 28, 2020 • edited Loading

rabernat commented Jun 29, 2020

TomAugspurger commented Jun 29, 2020

jhamman commented Jun 29, 2020

rabernat commented Jun 29, 2020

jhamman commented Jun 29, 2020

TomAugspurger commented Jun 29, 2020

jhamman commented Jun 29, 2020

rabernat commented Jun 29, 2020

rabernat commented Jun 29, 2020

TomAugspurger commented Jun 29, 2020

rabernat commented Jun 29, 2020

TomAugspurger commented Jun 29, 2020 via email

rabernat commented Jul 2, 2020

alimanfoo commented Jul 8, 2021

rabernat commented Jun 17, 2020 •

edited

Loading

scottyhq commented Jun 22, 2020 •

edited

Loading

TomAugspurger commented Jun 26, 2020 •

edited

Loading

salvis2 commented Jun 26, 2020 •

edited

Loading

jhamman commented Jun 28, 2020 •

edited

Loading