-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[alerting] Alerting for consistent server spawn failures on a hub #2267
Open
4 tasks
Tracked by
#1804
Comments
6 tasks
damianavila
changed the title
Alerts for consistent server start failures on a hub. If a JupyterHub is somehow failing to start user pods consistently, we should have an alert for that that hopefully reaches us before a user does. Note that this is fairly rare. The use case is to just provide enough evidence that this rare thing isn't happening right now
Alerts for consistent server start failures on a hub
Feb 27, 2023
damianavila
moved this to Needs Shaping / Refinement
in DEPRECATED Engineering and Product Backlog
Feb 27, 2023
pnasrat
changed the title
Alerts for consistent server start failures on a hub
Q2 2023: Goal Alerting for consistent server spawn failures on a hub
Mar 28, 2023
pnasrat
changed the title
Q2 2023: Goal Alerting for consistent server spawn failures on a hub
[alerting] Q2 2023: Goal Alerting for consistent server spawn failures on a hub
Apr 3, 2023
Me and @pnasrat discussed moving this to next quarter, based on capacity of the team this quarter. I think we should! Thoughts, @damianavila? |
Closed
I concur with the move, although we should evaluate it alongside the other potential goals for Q3 and decide if we really want to include it for the next quarter. |
choldgraf
changed the title
[alerting] Q2 2023: Goal Alerting for consistent server spawn failures on a hub
[alerting] Alerting for consistent server spawn failures on a hub
Jun 16, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Automated alerts for server spawn failures
If a JupyterHub is somehow failing to start user pods consistently, we should have an alert for that that hopefully reaches us before a user does. Note that this is fairly rare. The use case is to just provide enough evidence that this rare thing isn't happening right now.
Parent issue: #1804
Rationale: This is the most common way people report ‘the hub is not working’ that affects all the users, and requires immediate attention. This metric is already collected by prometheus. Setting this up will also lay the groundwork for future alerts based on prometheus metrics. This is also a useful alert, as it alerts on specific user facing symptoms, rather than causes that may have user specific symptoms.
Proposal
Definition of done:
Out of scope:
Updates and actions
2023-03-28: @pnasrat update to be a Q2 goal
The text was updated successfully, but these errors were encountered: