Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Solved] Pangeo Hub has several critical issues #815

Closed
4 tasks done
choldgraf opened this issue Nov 8, 2021 · 13 comments
Closed
4 tasks done

[Solved] Pangeo Hub has several critical issues #815

choldgraf opened this issue Nov 8, 2021 · 13 comments
Assignees

Comments

@choldgraf
Copy link
Member

choldgraf commented Nov 8, 2021

Summary

There are a variety of critical issues that have been reported on the Pangeo JupyterHub.

Certificate errors

Some users reported a certificate error when connecting to the hub. Here's an example of the error message:

image

FreshDesk tickets:

JupyterLab and kernal usability errors

Many users on the Pangeo hub are reporting slugging behavior. In particular, the following two actions take upwards of 30 seconds to complete, or do not complete at all:

  • Starting Kernels
  • Opening Terminals

FreshDesk tickets:

Grafana is not reachable

I tried to go to grafana.us-central1-b.gcp.pangeo.io but received a "your connection is not private" error. So I haven't been able to look at any dashboards to understand what might be going on.


After-action report

What went wrong

We setup a redirect between two URLs where one was a CNAME for the other. This turned out to be a Very Bad Idea ™️ . In PR #812, we replaced the pangeo.2i2c.cloud address with us-central1-b.gcp.pangeo.io meaning that cert-manager was no longer issuing certificates for pangeo.2i2c.cloud and our load balancer would no longer accept traffic from pangeo.2i2c.cloud. All issues were resolved by undoing the redirect and visiting the us-central1-b.gcp.pangeo.io address instead.

Pangeo has been special-cased in that it has had active users before setup development was complete and I think the switch in URL is what confused people. Normally we would only invite users after the DNS has been set and so I don't see the issue arising again.

Action items

Documentation improvements

  1. Pull grafana links from hub config files and add them to documentation sites so it's clear which grafana URL goes with which hub: Add grafana links to hubs list #817

Actions

  • Incident has been dealt with or is over
  • Sections above are filled out
  • Incident title and after-action report is cleaned up
  • All actionable items above have linked GitHub Issues
@sgibson91
Copy link
Member

sgibson91 commented Nov 9, 2021

Grafana is not reachable

I tried to go to grafana.us-central1-b.gcp.pangeo.io but received a "your connection is not private" error. So I haven't been able to look at any dashboards to understand what might be going on.

That is not the correct grafana URL - this is the one listed in the config and is reachable https://pangeo-grafana.pangeo.2i2c.cloud

@sgibson91
Copy link
Member

I suspect the certificate issues are due to the redirect Ryan asked me to setup last night.

Setup

We have two DNS zones: 2i2c.cloud managed by us through Namecheap, pangeo.io managed by the Pangeo community through Hurricane Electric (though I have access).

In 2i2c.cloud, we have a pangeo.2i2c.cloud A record that points to our LoadBalancer IP address.

In pangeo.io, we have us-central1-b.gcp.pangeo.io that is a CNAME for pangeo.2i2c.cloud.

It is setup this way such that if our LoadBalancer IP changes, we only need to edit the A record in 2i2c.cloud and pangeo.io will inherit the change through the CNAME.

The Redirect

We only assign one domain name to our hubs to avoid confusion, this means that once the CNAME for us-central1-b.gcp.pangeo.io was setup, pangeo.2i2c.cloud begins returning a 404 since ingress-nginx now only accepts traffic from the pangeo.io domain.

Hence Ryan asked me to setup a redirect from pangeo.2i2c.cloud to us-central1-b.gcp.pangeo.io, which I did here #482 (comment)

What I suspect is happening

I don't think the certificates are able to resolve properly because they're trying to get a response from ...pangeo.io which is a CNAME for pangeo.2i2c.cloud which is then redirecting back to ...pangeo.io --> vicious loop of nothing giving a correct response.

What I'm going to try

  • Remove the redirect in Namecheap
  • Wait some time to allow the DNS to resolve itself
  • Redeploy support on the pangeo-hubs cluster to trigger any certificate reissues

@sgibson91 sgibson91 moved this from Todo 👍 to In Progress ⚡ in Sprint Board Nov 9, 2021
@sgibson91
Copy link
Member

I did the above and logged into the production hub in a private browser. All certificates were present and the connection was private. So the certificates issue is now resolved.

@sgibson91
Copy link
Member

sgibson91 commented Nov 9, 2021

JupyterLab and kernal usability errors

Many users on the Pangeo hub are reporting slugging behavior. In particular, the following two actions take upwards of 30 seconds to complete, or do not complete at all:

  • Starting Kernels
  • Opening Terminals

I could not replicate this so I suspect it was all a certificates/traffic problem, but I'm happy to be proven wrong if someone can provide concrete steps to demonstrate the problem?

@sgibson91
Copy link
Member

sgibson91 commented Nov 9, 2021

I think there has been some confusion regarding the certificates on the pangeo.2i2c.cloud URL.

We stopped supporting multiple domains for a single hub to reduce complexity. See these PRs: #460 and #496

Hence when #812 was merged, we stopped issuing certificates for pangeo.2i2c.cloud and the load balancer stopped accepting traffic from there. Instead, we issue certificates for us-central1-b.gcp.pangeo.io and accept traffic from there. As mentioned above, the pangeo.2i2c.cloud address is only used so we can update the IP address of the load balancer if required in the cases where we don't have access to the desired domain.

There are no certificate issues if folks use the us-central1-b.gcp.pangeo.io address, which I mentioned here #482 (comment) But instead we got waylaid by redirects.

I think the only reason we've had this confusion is because the hub had users throughout the setup process. Normally, we would not have users until after this point.

@choldgraf
Copy link
Member Author

Just a note that the following works as-expected for me:

  • Go to https://us-central1-b.gcp.pangeo.io , start a session, and start a kernel. This was much snappier than when I tried yesterday.
  • Go to pangeo-grafana.pangeo.2i2c.cloud - this correctly brought up the grafana so that I could log in!

Quick thoughts:

  • Do we anticipate moving pangeo-grafana.pangeo.2i2c.cloud to pangeo-grafana.us-central1-b.gcp.pangeo.io ? Or will the pangeo.2i2c.cloud only exist in order to serve pangeo-grafana.pangeo.2i2c.cloud?
  • Another hub we could to for inspiration is utoronto.2i2c.cloud, which redirects to https://jupyter.utoronto.ca/hub/login?next=%2Fhub%2F - not sure if that's the same setup or not, but just noting it in case it helps with redirection

@sgibson91
Copy link
Member

sgibson91 commented Nov 9, 2021

Do we anticipate moving pangeo-grafana.pangeo.2i2c.cloud to pangeo-grafana.us-central1-b.gcp.pangeo.io ? Or will the pangeo.2i2c.cloud only exist in order to serve pangeo-grafana.pangeo.2i2c.cloud?

There's a bit of a name-clash for grafana atm since I wasn't very clever when setting up the COESSING hub.

# This domain should be updated to just grafana.pangeo.2i2c.cloud
# after the COESSING hub has been brought down and no longer requires
# grafana.pangeo.2i2c.cloud
- pangeo-grafana.pangeo.2i2c.cloud

So my plan was:

  • Tear down the COESSING hub
  • Move pangeo-grafana.pangeo.2i2c.cloud to grafana.pangeo.2i2c.cloud
  • Update our A record in Namecheap to be a wildcard record (*.pangeo) since pangeo-grafana.pangeo.2i2c.cloud, staging.pangeo.2i2c.cloud and pangeo.2i2c.cloud all point to the same IP address. This will be simpler to maintain.

I had no intentions to point grafana at grafana.us-central1-b.gcp.pangeo.io unless the community specifically need it or we consider it best practice?

If we move forward with #427 at some point, I had also considered making these URLs *.pangeo-gcp so we could have *.pangeo-aws in the future if need be.

@sgibson91
Copy link
Member

sgibson91 commented Nov 9, 2021

  • Another hub we could to for inspiration is utoronto.2i2c.cloud, which redirects to jupyter.utoronto.ca/hub/login?next=%2Fhub%2F - not sure if that's the same setup or not, but just noting it in case it helps with redirection

Just checked out Namecheap for this. We have an A record utoronto in the 2i2c.cloud domain that points at an IP address (I assume the load balancer), and that's it. The redirect setup must be happening on the utoronto.ca end.

In which case, I wonder if I took the wrong approach by trying to setup the redirect from Namecheap instead of in Hurricane Electric? Update: Had a quick look through Hurricane Electric and it wasn't obvious to me how to do this.

@yuvipanda
Copy link
Member

I'm simultaneously proud of and ashamed of how the utoronto redirect works - it works via JS in the homepage! https://github.com/utoronto-2i2c/homepage/blob/master/extra-assets/js/login.js. It only works if you land on the homepage - if you're already logged in it has no effect. Do not recommend.

@choldgraf
Copy link
Member Author

@yuvipanda LOL that is amazing

@sgibson91 sgibson91 changed the title [Incident] Pangeo Hub has several critical issues [Solved] Pangeo Hub has several critical issues Nov 10, 2021
@sgibson91
Copy link
Member

I've tidied up the top comment. I don't think there's anything actionable left here so I'm going to close this.

Repository owner moved this from In Progress ⚡ to Done 🎉 in Sprint Board Nov 10, 2021
@damianavila
Copy link
Contributor

I'm simultaneously proud of and ashamed of how the utoronto redirect works - it works via JS in the homepage! https://github.com/utoronto-2i2c/homepage/blob/master/extra-assets/js/login.js. It only works if you land on the homepage - if you're already logged in it has no effect. Do not recommend.

It is fun to realize I actually thought about something along these lines when I was thinking about possible workarounds 😜

@choldgraf
Copy link
Member Author

Many thanks @sgibson91 for being awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

No branches or pull requests

4 participants