Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Staging Hub deployment for Pangeo #599

Closed
4 of 7 tasks
Tracked by #482
choldgraf opened this issue Aug 10, 2021 · 19 comments · Fixed by #597 or #651
Closed
4 of 7 tasks
Tracked by #482

Staging Hub deployment for Pangeo #599

choldgraf opened this issue Aug 10, 2021 · 19 comments · Fixed by #597 or #651
Assignees

Comments

@choldgraf
Copy link
Member

choldgraf commented Aug 10, 2021

Description

We should deploy a staging hub for Pangeo that has the same infrastructure setup on less-costly infrastructure. This may also generate some other tasks that we need to accomplish in order to get the base infrastructure running.

Benefit

This will help us iterate more quickly and get feedback from the Pangeo team. It will also be a place where we can stage changes in the future without affecting prod, since Pangeo is a more complex and dynamic setup than most of our community hubs.

Tasks to complete


Updates

  • 2021-08-24: We've got a hub deployed but the user servers aren't being created properly due to some NFS errors. We've agreed to try fixing this for two weeks (@yuvipanda will give this a shot). If we cannot fix it after that time, we will:
    • Re-deploy the Pangeo hub using Google File Storage
    • Track the deployment of in-cluster NFS as a separate enhancement here: Run NFS servers in-cluster #50
  • 2021-08-31: in the sprint planning meeting today, we discussed that, now that NFS is ready to go (Run NFS servers in-cluster #50) we should be ready to review this PR and merge it in, and then ask Pangeo folks to take a look at the hub and make sure it looks good. In a future step, we will finish up Authenticate users with GitHub Teams membership in Pangeo Hub #598 and deploy it, but that's not necessary for the initial deployment
@sgibson91
Copy link
Member

I've deployed a hub... sort of. k8s isn't able to mount the NFS server and I'm not sure if it's because I missed a step or because of the private cluster #597 (comment)

@choldgraf
Copy link
Member Author

I believe this is no longer blocked. Now we need to have a team discussion about whether the NFS strategy used in the PR is the right strategy to use in general. I've updated this issue to mark it as-such. Check out @sgibson91's main question here:

#597 (comment)

@sgibson91
Copy link
Member

A staging hub exists https://staging.pangeo.2i2c.cloud/

But spawning of the user server fails which means the NFS still needs some tweaking. Not sure if that needs to happen in #597 or #613

@choldgraf choldgraf changed the title Initial staging Hub deployment for Pangeo Staging Hub deployment for Pangeo Aug 23, 2021
@choldgraf choldgraf removed the blocked label Aug 23, 2021
@choldgraf
Copy link
Member Author

choldgraf commented Aug 23, 2021

congrats @sgibson91 :-) 🚀

could we define a hand-off plan for this issue while you're away? I tried updating the top comment so it's clear what the next steps are...what's the information that could make it easiest for somebody else to finish up the NFS stuff?

@sgibson91
Copy link
Member

sgibson91 commented Aug 24, 2021

The first thing that needs to be done is fixing the spawn failure #597 (comment)

There's some discussion going on here about behaviour, but I think that needs a decision before it can be implemented #597 (comment)

We should also figure out if that work needs to happen in #597 or #613. If it can go in #597, then I think #613 could be merged. Or maybe at this point it's just better to open up a new PR and start afresh anyway.

@choldgraf
Copy link
Member Author

update: in the sprint planning meeting today, we discussed that, now that NFS is ready to go (#50) we should be ready to review this PR and merge it in, and then ask Pangeo folks to take a look at the hub and make sure it looks good.

In a future step, we will finish up #598 and deploy it, but that's not necessary for the initial deployment

@yuvipanda
Copy link
Member

This actually fails (it didn't used to!)

2021-09-01T21:38:01Z [Warning] MountVolume.SetUp failed for volume "pvc-958d82c1-5383-45e1-85e8-011091b2ae0f" : mount failed: exit status 1 Mounting command: /home/kubernetes/containerized_mounter/mounter Mounting arguments: mount -t nfs -o noatime,soft,vers=4.2 10.12.4.188:/export/pvc-958d82c1-5383-45e1-85e8-011091b2ae0f /var/lib/kubelet/pods/faeae25f-2156-4322-b6d4-cd00c438821b/volumes/kubernetes.io~nfs/pvc-958d82c1-5383-45e1-85e8-011091b2ae0f Output: Mount failed: mount failed: exit status 32 Mounting command: chroot Mounting arguments: [/home/kubernetes/containerized_mounter/rootfs mount -t nfs -o noatime,soft,vers=4.2 10.12.4.188:/export/pvc-958d82c1-5383-45e1-85e8-011091b2ae0f /var/lib/kubelet/pods/faeae25f-2156-4322-b6d4-cd00c438821b/volumes/kubernetes.io~nfs/pvc-958d82c1-5383-45e1-85e8-011091b2ae0f] Output: mount.nfs: mounting 10.12.4.188:/export/pvc-958d82c1-5383-45e1-85e8-011091b2ae0f failed, reason given by server: No such file or directory 

Trying to deploy-support fails with:

Error: UPGRADE FAILED: cannot patch "support-nfs-server-provisioner" with kind StatefulSet: StatefulSet.apps "support-nfs-server-provisioner" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden

I think our timebox for using google file store expired, so i'm going to abandon in-cluster NFS and go that way.

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Sep 1, 2021
@rabernat
Copy link
Contributor

rabernat commented Sep 2, 2021

Thanks so much for all the hard work here!

After Yuvi's ping on Slack, I just tried logging in. I clicked login and got redirected to authorize a new github app (iam-login-something). Once redirected back to https://staging.pangeo.2i2c.cloud/hub/oauth_callback?code=..., I got met with

403 : Forbidden

If your email address has NOT been added to the list of allowed users for this hub, please contact the hub administrators.

image

Our previous cluster was configured to allow all users from the group https://github.com/orgs/pangeo-data/teams/us-central1-b-gcp to be able to log in. It would be great to use that same group here.

Let me know how I can help.

@choldgraf
Copy link
Member Author

@rabernat just a note that we are tracking the GitHub teams auth here: #598

@choldgraf
Copy link
Member Author

I think we need to add @rabernat here

https://github.com/2i2c-org/pilot-hubs/blob/ffceb3d397bdd76a3ae9b9fc8ecfd1811da71ef4/config/hubs/pangeo-hubs.cluster.yaml#L70

And then he can add other admins etc just until we get the GitHub teams auth working

@rabernat
Copy link
Contributor

rabernat commented Sep 2, 2021

Ah ok, thanks for clarifying. No worries.

@yuvipanda
Copy link
Member

@rabernat try now

@rabernat
Copy link
Contributor

rabernat commented Sep 2, 2021

Ok so I will continue to post feedback on this issue, as suggested by Chris.

Item 1: There are no choices of machine type on startup. Compare this to the Profile List on https://us-central1-b.gcp.pangeo.io/. This is important because some users (like my class) just need a small machine while others (like researchers) need lots of memory.

@rabernat
Copy link
Contributor

rabernat commented Sep 2, 2021

Item 2: My home directory is not there. It would be great if we could migrate over the home directories from the old cluster. Since both clusters are using GC Filestore, perhaps this is trivial: just mount the same volume on the new cluster. But since it lives in a different project, maybe that doesn't work.

@rabernat
Copy link
Contributor

rabernat commented Sep 2, 2021

Item 3: Hub is not configured for requester-pays access to cloud data.

I discovered this by running the first few cells of this notebook, specifically

from intake import open_catalog
cat = open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/ocean.yaml")
ds  = cat["sea_surface_height"].to_dask()

raises

OSError: Forbidden: https://storage.googleapis.com/download/storage/v1/b/pangeo-cmems-duacs/o/.zmetadata?alt=media
Caller does not have serviceusage.services.use access to the Google Cloud project.

@choldgraf
Copy link
Member Author

choldgraf commented Sep 2, 2021

I'll try to capture some of @rabernat's suggestions in subsequent issues so that we don't lose track of them.

Note that when I try to log-in I'm running into a "scale-up" error:

image

(I selected the smallest machine type)

@choldgraf
Copy link
Member Author

Another note - if I go to Services -> Dask Gateway (https://staging.pangeo.2i2c.cloud/services/dask-gateway/) then I get a blank page with 404 Not Found.

@yuvipanda
Copy link
Member

@choldgraf ah, the second smallest one works for me. Let's isolate and tweak the sizes until they all work. Can we use #652 to track and close this?

@choldgraf
Copy link
Member Author

choldgraf commented Sep 2, 2021

@yuvipanda sounds good - I think that once #651 is merged we can consider this one closed (actually it should close automatically), and can then focus on specific improvements to the staging hub in separate issues

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Sep 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants