-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Status page for our clusters #609
Comments
Actually, maybe let's just get a twitter account? statuspage.io and stuff are expensive! '2i2cstatus' or something like that? What do you think, @choldgraf? |
@yuvipanda wanna close this one as a dupe of #52 ? I agree this would be great! Twitter account seems fine, though probably less accessable to many. What if we just made a grafana plot that was public and linked it in our docs somewhere? |
Yep, that's a dupe! Grafana is also running on our infrastructure. I just want it to be something we don't maintain, and contain human readable messages we explicitly set. So you can say things like 'maintenance today between 9pm and 11pm UTC, possible disruption expected', and then 'DONE' or something like that. See https://sal.toolforge.org/production for an example |
Could that be something as simple as a programmatically generated Sphinx site that gets built and auto-deployed on a GitHub Action CRON job? Something that could both be manually edited to update a status, or something that could automatically update itself according to a status API or something? I'm thinking about this for example: https://mybinder-sre.readthedocs.io/en/latest/operation_guide/federation.html OK maybe not that simple but I don't think it'd be much work! |
I'm curious to hear what you think of as the complexities of using Twitter for this. Users won't need to log in - they can just view it without logging in, and that should be good enough. This is a fairly common pattern - see https://twitter.com/githubstatus, https://twitter.com/cratesiostatus, https://twitter.com/SpotifyStatus, https://twitter.com/SlackStatus, https://twitter.com/DOStatus, https://twitter.com/gitlabstatus and many others. They all seem to contain explicit messages human beings have written, and it's completely independent of our infrastructure. |
I guess I'm just assuming most people have never used Twitter. So I'm thinking that, rather than having a webpage with a link on it that redirects to a twitter bot w/ a status, why not just embed that status directly on a web-page? (unless it became ubiquitous enough that everybody thought to go to the twitter bot page before going to something like |
Hmm, so the general approach is that if there's an issue, anything that runs on our infrastructure should not be relied upon to communicate to our users. Hence even big organizations like dropbox use statuspage.io for this. Basically when you are dealing with outage / issue, you don't also want to actually deal with the possibility that the thing you are using to communicate about the issue with your users is also down, as part of the outage you are dealing with. So it needs to be as uncomplicated as possible. I guess I think of this as extra infrastructure we need to maintain, and I want to avoid that as much as possible for this particular thing. Plus, if there's an outage, there's a difference between posting something that takes 1min vs making a git commit, pushing, waiting for CI, etc. The goal is to put the link for status checks everywhere, not something that's only shown during outages. We can link to it from our error pages, from home pages, etc. We can also redirect status.2i2c.org to it, so we can switch it out later if we want to. I don't think we expect users to know how twitter works - I don't expect them to 'follow us' there, for instance. However, I think the content there is open to the public, can be accessed without login, and has enough context to be useful. The ideal would actually be something like statuspage.io, since it can integrate with grafana, etc when needed. However, it's 29$ a month minimum - but there's a 75% discount for non-profits. Maybe 29$ a month is insignifcant in the long run - should we just use that instead? |
I was imagining something like:
or
then redirect I think that in the long run, $29/mo is definitely worth it if we get a useful service across all of our hubs, so maybe worth looking into that. |
utoronto-2i2c/jupyterhub-deploy#83 (comment) is an example of the kinda stuff I think we should have. Check out https://www.systemstatus.utoronto.ca/ |
I've spent some time poking around status page options for some non JupyterHub infrastructure that my team runs, looking for free & low management options.
I've been using UptimeRobot mainly as a internal dashboard with my team, but seeing this issue inspired me to get around to configure Instatus as a more public page with both UptimeRobot and Healthchecks.io reporting in to it. |
Is it possible to get a minimal version of this with Grafana?I'm noting that this issue hasn't had movement in quite a long time, even though it remains an important part of making our communities confident that the infrastructure we run is reliable (as well as helping them debug whether there's something wrong with just them or with everybody). I note that for the mybinder.org service, we simply use a Grafana graph that shows uptime over the last several hours: Could we design a similar kind of plot for either our main 2i2c cluster, or somehow as an aggregate across all clusters, that just showed something like "% user session launch successes"? I know it's not quite as good of a metric as it is for Binder, which has lots of shorter launches, but something like this could serve as a placeholder until we have a more robust system in place... |
Summary
There are many cases where our clusters might be down for one reason or another (e.g. upgrades, outages, etc). In those cases, it's helpful if there is a source of truth for "is 2i2c's infrastructure down, or is it just me?". We should have a place to point users to so that they can quickly answer this question.
Important information
The most common service I've seen for this is
statuspage.io
, which even has a non-profit discount.Tasks to complete
The text was updated successfully, but these errors were encountered: