Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Status page for our clusters #609

Closed
3 tasks
yuvipanda opened this issue Feb 11, 2021 · 11 comments
Closed
3 tasks

Status page for our clusters #609

yuvipanda opened this issue Feb 11, 2021 · 11 comments

Comments

@yuvipanda
Copy link
Member

Summary

There are many cases where our clusters might be down for one reason or another (e.g. upgrades, outages, etc). In those cases, it's helpful if there is a source of truth for "is 2i2c's infrastructure down, or is it just me?". We should have a place to point users to so that they can quickly answer this question.

Important information

The most common service I've seen for this is statuspage.io, which even has a non-profit discount.

Tasks to complete

  • Decide what kind of service we'd like to use for a status page
  • ...figure out steps to implement this
  • Document the new status page in our user and team documentation
@yuvipanda
Copy link
Member Author

yuvipanda commented Feb 11, 2021

Actually, maybe let's just get a twitter account? statuspage.io and stuff are expensive! '2i2cstatus' or something like that? What do you think, @choldgraf?

@choldgraf
Copy link
Member

choldgraf commented Feb 11, 2021

@yuvipanda wanna close this one as a dupe of #52 ?

I agree this would be great! Twitter account seems fine, though probably less accessable to many. What if we just made a grafana plot that was public and linked it in our docs somewhere?

@yuvipanda
Copy link
Member Author

Yep, that's a dupe!

Grafana is also running on our infrastructure. I just want it to be something we don't maintain, and contain human readable messages we explicitly set. So you can say things like 'maintenance today between 9pm and 11pm UTC, possible disruption expected', and then 'DONE' or something like that. See https://sal.toolforge.org/production for an example

@choldgraf
Copy link
Member

Could that be something as simple as a programmatically generated Sphinx site that gets built and auto-deployed on a GitHub Action CRON job? Something that could both be manually edited to update a status, or something that could automatically update itself according to a status API or something? I'm thinking about this for example:

https://mybinder-sre.readthedocs.io/en/latest/operation_guide/federation.html

OK maybe not that simple but I don't think it'd be much work!

@yuvipanda
Copy link
Member Author

I'm curious to hear what you think of as the complexities of using Twitter for this. Users won't need to log in - they can just view it without logging in, and that should be good enough. This is a fairly common pattern - see https://twitter.com/githubstatus, https://twitter.com/cratesiostatus, https://twitter.com/SpotifyStatus, https://twitter.com/SlackStatus, https://twitter.com/DOStatus, https://twitter.com/gitlabstatus and many others. They all seem to contain explicit messages human beings have written, and it's completely independent of our infrastructure.

@choldgraf
Copy link
Member

choldgraf commented Feb 12, 2021

I guess I'm just assuming most people have never used Twitter. So I'm thinking that, rather than having a webpage with a link on it that redirects to a twitter bot w/ a status, why not just embed that status directly on a web-page? (unless it became ubiquitous enough that everybody thought to go to the twitter bot page before going to something like status.2i2c.org...the pattern I usually assume is that status.foo.org exists)

@yuvipanda
Copy link
Member Author

Hmm, so the general approach is that if there's an issue, anything that runs on our infrastructure should not be relied upon to communicate to our users. Hence even big organizations like dropbox use statuspage.io for this. Basically when you are dealing with outage / issue, you don't also want to actually deal with the possibility that the thing you are using to communicate about the issue with your users is also down, as part of the outage you are dealing with. So it needs to be as uncomplicated as possible. I guess I think of this as extra infrastructure we need to maintain, and I want to avoid that as much as possible for this particular thing. Plus, if there's an outage, there's a difference between posting something that takes 1min vs making a git commit, pushing, waiting for CI, etc.

The goal is to put the link for status checks everywhere, not something that's only shown during outages. We can link to it from our error pages, from home pages, etc. We can also redirect status.2i2c.org to it, so we can switch it out later if we want to.

I don't think we expect users to know how twitter works - I don't expect them to 'follow us' there, for instance. However, I think the content there is open to the public, can be accessed without login, and has enough context to be useful.

The ideal would actually be something like statuspage.io, since it can integrate with grafana, etc when needed. However, it's 29$ a month minimum - but there's a 75% discount for non-profits. Maybe 29$ a month is insignifcant in the long run - should we just use that instead?

@choldgraf
Copy link
Member

choldgraf commented Feb 13, 2021

I was imagining something like:

  • Use a Sphinx website that runs a simple Javascript check for uptime of something in 2i2c's infrastructure. That way any time someone visited the page, they'd get a notice about whether the service was up. Serve it from GH-pages.

or

  • Use the same process but with a CRON job running rebuilding the website on GitHub actions, that way we could do more complex checks and interfaces if need be.

then redirect status.2i2c.org to that page running on gh-pages.

I think that in the long run, $29/mo is definitely worth it if we get a useful service across all of our hubs, so maybe worth looking into that.

@yuvipanda
Copy link
Member Author

yuvipanda commented Feb 17, 2021

utoronto-2i2c/jupyterhub-deploy#83 (comment) is an example of the kinda stuff I think we should have. Check out https://www.systemstatus.utoronto.ca/

@choldgraf choldgraf transferred this issue from 2i2c-org/docs Aug 16, 2021
@choldgraf choldgraf changed the title Create a 'status page' Status page for our clusters Aug 16, 2021
@abkfenris
Copy link
Contributor

I've spent some time poking around status page options for some non JupyterHub infrastructure that my team runs, looking for free & low management options.

  • Instatus has a pretty full featured free plan, and can integrate with various monitoring services (and is adding their own monitoring soon)
  • Upptime for open source and Github Actions powered.
  • I've been using UptimeRobot for monitoring, and while it can do a very simple status page, but you can't add incidents to it.

I've been using UptimeRobot mainly as a internal dashboard with my team, but seeing this issue inspired me to get around to configure Instatus as a more public page with both UptimeRobot and Healthchecks.io reporting in to it.

@choldgraf
Copy link
Member

choldgraf commented May 2, 2023

Is it possible to get a minimal version of this with Grafana?

I'm noting that this issue hasn't had movement in quite a long time, even though it remains an important part of making our communities confident that the infrastructure we run is reliable (as well as helping them debug whether there's something wrong with just them or with everybody).

I note that for the mybinder.org service, we simply use a Grafana graph that shows uptime over the last several hours:

image

Could we design a similar kind of plot for either our main 2i2c cluster, or somehow as an aggregate across all clusters, that just showed something like "% user session launch successes"? I know it's not quite as good of a metric as it is for Binder, which has lots of shorter launches, but something like this could serve as a placeholder until we have a more robust system in place...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Needs Shaping / Refinement
Development

No branches or pull requests

3 participants