Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ocean.pangeo.io maintenance hack session #622

Closed
7 tasks done
rabernat opened this issue Jun 17, 2020 · 64 comments
Closed
7 tasks done

ocean.pangeo.io maintenance hack session #622

rabernat opened this issue Jun 17, 2020 · 64 comments

Comments

@rabernat
Copy link
Member

rabernat commented Jun 17, 2020

As discussed in #616 and https://discourse.pangeo.io/t/migration-of-ocean-pangeo-io-user-accounts/644/15, we will be doing maintenance on ocean.pangeo.io and other GCP clusters next week. @jhamman and I have blocked off Monday, June 22, 2-5pm EDT for a sprint on this. I invite everyone, and in particular @TomAugspurger, @scottyhq, @salvis2, @consideRatio, and @yuvipanda to help us out with this.

Some of the things we need to do are:

  • review the results of the account migration form and decide which user accounts will be migrated
  • write a script to migrate ORCID to github user IDs
  • delete non-migrated home directories
  • switch ocean.pangeo.io authentication to auth0
  • write documentation explaining deprecation of Dask Kubernetes and how to use Dask Gateway
  • set up some sort of logging (see logging for production clusters #72) so we can better track usage statistics. Pinging @yuvipanda to get the latest on what is the best practice here
  • change image specification, building to use upstream pieces from pangeo-docker-images

What am I missing from this list?

@TomAugspurger
Copy link
Member

I'll be around for most of the session, but will have to pop out for a couple calls.

@rabernat
Copy link
Member Author

I'm looking forward to this hack session today.

@jhamman
Copy link
Member

jhamman commented Jun 22, 2020

Let's jump in https://whereby.com/pangeo to kick things off.

@jhamman
Copy link
Member

jhamman commented Jun 22, 2020

Some working notes here: https://hackmd.io/@U4W-olO3TX-hc-cvbjNe4A/r13p_PRaL/edit

@TomAugspurger
Copy link
Member

For " write documentation explaining deprecation of Dask Kubernetes and how to use Dask Gateway" we can pull content from https://medium.com/pangeo/pangeo-with-dask-gateway-4b638825f105, specifically https://medium.com/pangeo/pangeo-with-dask-gateway-4b638825f105#af22 for explaining how to transition.

@TomAugspurger
Copy link
Member

I can look into logging / monitoring things. Both jupyterhub and Dask expose prometheus metrics, and mybinder has details on capturing & visualizing them: https://mybinder-sre.readthedocs.io/en/latest/components/metrics.html

@rabernat
Copy link
Member Author

I can look into logging / monitoring things. Both jupyterhub and Dask expose prometheus metrics

That's awesome! What we would like most is to be able to run a query to find a how much time an individual user has accumulated over a given period on both jupyter and dask.

@jhamman
Copy link
Member

jhamman commented Jun 22, 2020

My update from day 1:

  • Tear down of all three JupyterHub's in GCP
  • Reorg this repository to include a single GCP hub (renamed ocean to gcp-uscentral1b), see Refactor part 1 #625
  • Develop simple Makefile for standing up new k8s cluster that uses auto nodepool provisioning, see Refactor part 1 #625

Still to do:

  • sort out some details related to rbac/service accounts in Refactor part 1 #625
  • test deployment of pangeo helm chart
  • update hubploy / circleci configs

@scottyhq
Copy link
Member

scottyhq commented Jun 22, 2020

update hubploy / circleci configs

This will be useful for just pointing to existing images on DockerHub
berkeley-dsep-infra/hubploy#75

Or if images continue to be built in this repo, it makes sense to put them on DockerHub rather than aws or gcp registries which are harder for people to get to. So could revisit berkeley-dsep-infra/hubploy#24

@rabernat
Copy link
Member Author

The account migration is in progress. Those with credentials can see the backed up homedirs here: https://console.cloud.google.com/storage/browser/pangeo-homedir-backup

There is a long tail of very large home directories on ocean that will take a very long time to complete.

@rabernat
Copy link
Member Author

For reference, the backup scripts are here: https://gist.github.com/rabernat/c9b352de926756342e86da662a0eadf9

@TomAugspurger
Copy link
Member

Or if images continue to be built in this repo, it makes sense to put them on DockerHub rather than aws or gcp registries which are harder for people to get to.

I think we're hoping to still upload to GCP / AWS to keep the startup times as small as possible when an image does need to be downloaded.

@salvis2
Copy link
Member

salvis2 commented Jun 23, 2020

Today I'll work on standing up a test cluster and testing that Linux hack to enforce user storage limits.

@scottyhq
Copy link
Member

@salvis2 @rabernat - before you dive into the storage limits, do you have a solution for dealing with the fact that every user has the same uid and gid (1000,1000)? This has come up a few times before #384 (comment) #25

@rabernat
Copy link
Member Author

My idea was to try to do the quota-ing from within the user's jupyter pod. Basically, this pod is a unix system with one user--jovyan (1000,1000)--whose home directory is mounted from an nfs server.

Is is possible to make this unix instance enforce a quota on that one user? It doesn't have to know about all the other users or address the challenge of duplicated uid / gid. It just has to prevent jovyan from creating more than 10GB of files in /home/jovyan.

Seems like it should be possible to me, but I have likely overlooked something.

@scottyhq
Copy link
Member

ok. definitely sounds like something worth exploring!

One more idea/request on the topic of "update hubploy / circleci configs". I think it would be great to drop circleci in favor of github actions. Hubploy now works with github actions (for example https://github.com/ICESAT-2HackWeek/jupyterhub-2020). And we could make use of organization level secrets to reduce scattering in various places. https://github.blog/changelog/2020-05-14-organization-secrets/.

@rabernat
Copy link
Member Author

I think it would be great to drop circleci in favor of github actions. Hubploy now works with github actions

💯 x 👍

@rabernat
Copy link
Member Author

Home directory backup is complete. Should I just rm -rf * the NFS volume?

@jhamman
Copy link
Member

jhamman commented Jun 24, 2020

Should I just rm -rf * the NFS volume?

@rabernat - let's leave it for a few days. I actually think we'll want to create a new (smaller) nfs service so we may just remove the existing one all together.

@rabernat
Copy link
Member Author

@jhamman -- let me know when you're ready for me to transfer the migrated ocean.pangeo.io users to the new NFS server.

@rabernat
Copy link
Member Author

What's the status today? Are we ready to starting bringing up the new cluster?

For DNS, I suggest we go with the region-based names, i.e. us-central-1b.gcp.pangeo.io.

@jhamman
Copy link
Member

jhamman commented Jun 25, 2020

Update...

@TomAugspurger and I have been working on standing up the new hub. This is going well and we should be ready for the user home directories now at the following NFS location:

10.126.142.50:/home/uscentral1b/{GITHUB_USER}

@rabernat - we're also ready to configure Auth0 and the DNS record. I can't do this because my access to the Pangeo Auth0 account is still broken.

The branch to work off right now is: #626

@salvis2
Copy link
Member

salvis2 commented Jun 25, 2020

Do the GCP clusters use NFS Provisioner for making new user home directories? There is a way to run the binary apparently that can enforce user quotas: https://github.com/kubernetes-incubator/external-storage/blob/master/nfs/docs/deployment.md#outside-of-kubernetes---binary

This doesn't appear to be an option in NFS-Client Provisioner. I'm a little fuzzy on the distinction between the two, but the first link is the only thing I could find on quotas. Linux hacking has yet to yield anything useful.

@rabernat
Copy link
Member Author

Do the GCP clusters use NFS Provisioner for making new user home directories

I'm not sure. All I know is that they use NFS for home directories. The chart is in #262

we should be ready for the user home directories now at the following NFS location:

On it.

we're also ready to configure Auth0 and the DNS record

Do we have an IP address for the DNS record?

@rabernat
Copy link
Member Author

I have hit a challenge with the NFS server permissions, described in #627. Any ideas would be appreciated.

@rabernat
Copy link
Member Author

Home directories are now (or will soon be) working.

@TomAugspurger
Copy link
Member

The dask side of things is up now.

I'm not familiar with how we did DNS before. Do we need to reserve some address in GCP? RIght now the hub's IP is 34.69.173.244.

@TomAugspurger
Copy link
Member

TomAugspurger commented Jun 26, 2020

Telemetry stuff seems to work at a glance. We'll need to talk about what if anything should be public.

If you want to mess with grafana the steps currently are

cd deployments/gcp-uscentral1b
# get the password
# remove the | pbcopy if you aren't on a Mac
make print-grafana-password | pbcopy
# tunnel into the grafana server
make forward-grafana

Then login with username: admin and the password that should be on your clipboard. I think we'll eventually hook grafana up to some auth system like GitHub.

@rabernat
Copy link
Member Author

I'm not familiar with how we did DNS before. Do we need to reserve some address in GCP? RIght now the hub's IP is 34.69.173.244.

There is some way to convert this to a permanent IP address.

@rabernat
Copy link
Member Author

Also, DNS is up (http://staging.us-central1-b.gcp.pangeo.io/) but https is not yet configured. Does anyone know how to do this?

@TomAugspurger
Copy link
Member

TomAugspurger commented Jun 26, 2020 via email

@salvis2
Copy link
Member

salvis2 commented Jun 26, 2020

I believe you are supposed to first get the hub up-and-running without HTTPS, do some DNS pointing, then enable HTTPS. https://zero-to-jupyterhub.readthedocs.io/en/latest/administrator/security.html#https

It looks like prod had the HTTPS block always enabled: https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/deployments/gcp-uscentral1b/config/prod.yaml#L5-L8

@consideRatio
Copy link
Member

If HTTPS doesnt configure itself properly, I know that it could be needed to delete a secret named something like hub-proxy-tls and then delete the autohttps pod.

@jhamman
Copy link
Member

jhamman commented Jun 26, 2020

My update from today:

staging

https://staging.us-central1-b.gcp.pangeo.io/ is now live and is using Pangeo's Auth0 account.

For the staging hub, the main thing to sort out is the dask gateway service. @rabernat and I were getting the following error when we took the hub for a test drive:

ClientResponseError: 503, message='Service Unavailable', url=URL('https://staging.us-central1-b.gcp.pangeo.io/services/dask-gateway/api/v1/clusters/')

prod

I added the config for https://us-central1-b.gcp.pangeo.io/ to the staging branch but I didn't manage to get it deployed. Currently running into an issue with lingering k8s resources:

Error: rendered manifests contain a resource that already exists. Unable to continue with install: existing resource conflict: namespace: , name: us-central1b-prod-grafana, existing_kind: policy/v1beta1, Kind=PodSecurityPolicy, new_kind: policy/v1beta1, Kind=PodSecurityPolicy

@TomAugspurger - this looks familiar to what we saw yesterday, no?

@TomAugspurger
Copy link
Member

I thought I fixed the 503 error for gateway. Can you make sure you pulled staging before helm deploying?

@TomAugspurger
Copy link
Member

I redeployed from staging. Things seem to be OK.

Not sure about prod right now.

@rabernat
Copy link
Member Author

Is there a public endpoint for the grafana dashboards?

@salvis2
Copy link
Member

salvis2 commented Jun 27, 2020

Is there a public endpoint for the grafana dashboards?

Grafana should have an External-IP / service. I know you can put a DNS address to point to it but I'm still fuzzy on doing HTTPS with it through JupyterHub. @consideRatio could probably speak to that more if you are curious.

You can enable anonymous logins for Grafana and configure what anonymous users are able to see via settings on their organization role.

@rabernat
Copy link
Member Author

Ah ok I I just figured out how to see grafana locally (actually read @TomAugspurger's comment in . #622 (comment)).

I can now see a basic Grafana interface, but it doesn't have any dashboards and I don't know how to create one. Is there an issue to discuss that?

@TomAugspurger
Copy link
Member

TomAugspurger commented Jun 27, 2020 via email

@salvis2
Copy link
Member

salvis2 commented Jun 27, 2020

I think you need to build the dashboards into the Helm release. It's not super clear, but this seems to be somewhere to start: https://github.com/helm/charts/tree/master/stable/grafana#import-dashboards

@jhamman
Copy link
Member

jhamman commented Jun 28, 2020

https://us-central1-b.gcp.pangeo.io is now up

No public dashboard yet. We’ll need to decide if there’s anything that shouldn’t be public.

@consideRatio - do you know if it is possible (or what it would take) to put grafana behind the admin permissions of a jupyterhub service?

@rabernat
Copy link
Member Author

Tomorrow morning I plan to send an email to the users of the new cluster to let them know it's on.

@TomAugspurger
Copy link
Member

@jhamman do you know what's left to do for getting things hooked up to hubploy?

@jhamman
Copy link
Member

jhamman commented Jun 29, 2020

I think we just need to:

@rabernat
Copy link
Member Author

I'm about to push a big update to pangeo.io with documentation about the new setup.

@jhamman
Copy link
Member

jhamman commented Jun 29, 2020

@TomAugspurger - any idea what is up with these Pending pods:

$ kubectl get pod -n prod | grep Pending
us-central1b-prod-prometheus-node-exporter-2n97g                  0/1     Pending   0          42h
us-central1b-prod-prometheus-node-exporter-dw689                  0/1     Pending   0          42h
us-central1b-prod-prometheus-node-exporter-j42ms                  0/1     Pending   0          42h
us-central1b-prod-prometheus-node-exporter-wnsjv                  0/1     Pending   0          42h

@TomAugspurger
Copy link
Member

Not sure. Probably safe to just delete?

@jhamman
Copy link
Member

jhamman commented Jun 29, 2020

Not sure. Probably safe to just delete?

tried that. they just come back in the same state.

@rabernat
Copy link
Member Author

See pangeo-data/pangeo#780 for documentation update. I'd appreciate a review there.

@rabernat
Copy link
Member Author

Another question: the dask widget is still set up to launch kubeclusters. I think we should not allow kubecluster on the new cluster. So what do we do about the widget? Can we make it launch dask_gateway clusters?

@TomAugspurger
Copy link
Member

I believe that's coming from the dask_config.yaml that's baked into the docker images at https://github.com/pangeo-data/pangeo-docker-images/blob/6ba7997b5246440c0f1b92512cb133b98c6b976d/base-image/dask_config.yml#L58-L63. Just switching that to dask-gateway won't work out of the box since the labextension is only set up to create a cluster like class(*args, **kwargs). But dask-gateway needs to create the intermediate Gateway object.

@rabernat
Copy link
Member Author

But dask-gateway needs to create the intermediate Gateway object.

So we need to open an issue in dask-labextension?

@TomAugspurger
Copy link
Member

TomAugspurger commented Jun 29, 2020 via email

@TomAugspurger TomAugspurger mentioned this issue Jul 1, 2020
@rabernat
Copy link
Member Author

rabernat commented Jul 2, 2020

Thanks for your work everyone! The new cluster is launched.

Whenever you get time @TomAugspurger, I would love if you could explain to me how to use grafana / prometheus to gather the information I need about usage.

@rabernat rabernat closed this as completed Jul 2, 2020
@alimanfoo
Copy link

Hi pangeo folks, apologies for stalking but found this issue while googling for whether there was some way to configure storage quotas when using NFS on GCP. If anyone found a solution to that I'd be very grateful for a pointer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants