[alerting] Alerting for consistent server spawn failures on a hub #2267

damianavila · 2023-02-27T14:00:21Z

Automated alerts for server spawn failures

If a JupyterHub is somehow failing to start user pods consistently, we should have an alert for that that hopefully reaches us before a user does. Note that this is fairly rare. The use case is to just provide enough evidence that this rare thing isn't happening right now.

Parent issue: #1804

Rationale: This is the most common way people report ‘the hub is not working’ that affects all the users, and requires immediate attention. This metric is already collected by prometheus. Setting this up will also lay the groundwork for future alerts based on prometheus metrics. This is also a useful alert, as it alerts on specific user facing symptoms, rather than causes that may have user specific symptoms.

Proposal

Definition of done:

Define way to add alerts for hubs and clusters
Create tuneable alerts for ‘server failed to start’, where thresholds can be set
Runbooks on what people responding to these alerts should try, and how to navigate possible outages.
Method where we can intentionally cause a hub to have failed spawns, and this triggers an alert to test and for training.

Out of scope:

Paging people - alerts should currently perhaps go to slack, but definitely nobody should be getting called or actively paged.

Updates and actions

2023-03-28: @pnasrat update to be a Q2 goal

yuvipanda · 2023-05-22T06:33:17Z

Me and @pnasrat discussed moving this to next quarter, based on capacity of the team this quarter. I think we should! Thoughts, @damianavila?

damianavila · 2023-06-07T16:13:00Z

Thoughts, @damianavila?

I concur with the move, although we should evaluate it alongside the other potential goals for Q3 and decide if we really want to include it for the next quarter.

damianavila mentioned this issue Feb 27, 2023

Q4 2022 Goal: Monitoring & alerting for our infrastructure #1804

Closed

6 tasks

damianavila added this to DEPRECATED Engineering and Product Backlog Feb 27, 2023

damianavila moved this to Needs Shaping / Refinement in DEPRECATED Engineering and Product Backlog Feb 27, 2023

pnasrat self-assigned this Mar 28, 2023

pnasrat changed the title ~~Alerts for consistent server start failures on a hub~~ Q2 2023: Goal Alerting for consistent server spawn failures on a hub Mar 28, 2023

pnasrat changed the title ~~Q2 2023: Goal Alerting for consistent server spawn failures on a hub~~ [alerting] Q2 2023: Goal Alerting for consistent server spawn failures on a hub Apr 3, 2023

damianavila added this to Organizational Backlog Apr 4, 2023

damianavila assigned yuvipanda Apr 12, 2023

choldgraf added this to Quarterly goals board Apr 24, 2023

pnasrat mentioned this issue May 23, 2023

Q2 goals checkpoint #2543

Closed

choldgraf moved this to Active goal in Quarterly goals board Jun 7, 2023

choldgraf changed the title ~~[alerting] Q2 2023: Goal Alerting for consistent server spawn failures on a hub~~ [alerting] Alerting for consistent server spawn failures on a hub Jun 16, 2023

choldgraf moved this from Active goal to Proto-goal in Quarterly goals board Jun 16, 2023

yuvipanda removed their assignment Jun 20, 2023

pnasrat removed their assignment Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[alerting] Alerting for consistent server spawn failures on a hub #2267

[alerting] Alerting for consistent server spawn failures on a hub #2267

damianavila commented Feb 27, 2023 •

edited by pnasrat

Loading

yuvipanda commented May 22, 2023

damianavila commented Jun 7, 2023

[alerting] Alerting for consistent server spawn failures on a hub #2267

[alerting] Alerting for consistent server spawn failures on a hub #2267

Comments

damianavila commented Feb 27, 2023 • edited by pnasrat Loading

Automated alerts for server spawn failures

Proposal

Updates and actions

yuvipanda commented May 22, 2023

damianavila commented Jun 7, 2023

damianavila commented Feb 27, 2023 •

edited by pnasrat

Loading