Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[alerting] Alerting for consistent server spawn failures on a hub #2267

Open
4 tasks
Tracked by #1804
damianavila opened this issue Feb 27, 2023 · 2 comments
Open
4 tasks
Tracked by #1804

[alerting] Alerting for consistent server spawn failures on a hub #2267

damianavila opened this issue Feb 27, 2023 · 2 comments

Comments

@damianavila
Copy link
Contributor

damianavila commented Feb 27, 2023

Automated alerts for server spawn failures

If a JupyterHub is somehow failing to start user pods consistently, we should have an alert for that that hopefully reaches us before a user does. Note that this is fairly rare. The use case is to just provide enough evidence that this rare thing isn't happening right now.

Parent issue: #1804

Rationale: This is the most common way people report ‘the hub is not working’ that affects all the users, and requires immediate attention. This metric is already collected by prometheus. Setting this up will also lay the groundwork for future alerts based on prometheus metrics. This is also a useful alert, as it alerts on specific user facing symptoms, rather than causes that may have user specific symptoms.

Proposal

Definition of done:

  • Define way to add alerts for hubs and clusters
  • Create tuneable alerts for ‘server failed to start’, where thresholds can be set
  • Runbooks on what people responding to these alerts should try, and how to navigate possible outages.
  • Method where we can intentionally cause a hub to have failed spawns, and this triggers an alert to test and for training.

Out of scope:

  • Paging people - alerts should currently perhaps go to slack, but definitely nobody should be getting called or actively paged.

Updates and actions

2023-03-28: @pnasrat update to be a Q2 goal

@damianavila damianavila changed the title Alerts for consistent server start failures on a hub. If a JupyterHub is somehow failing to start user pods consistently, we should have an alert for that that hopefully reaches us before a user does. Note that this is fairly rare. The use case is to just provide enough evidence that this rare thing isn't happening right now Alerts for consistent server start failures on a hub Feb 27, 2023
@damianavila damianavila moved this to Needs Shaping / Refinement in DEPRECATED Engineering and Product Backlog Feb 27, 2023
@pnasrat pnasrat self-assigned this Mar 28, 2023
@pnasrat pnasrat changed the title Alerts for consistent server start failures on a hub Q2 2023: Goal Alerting for consistent server spawn failures on a hub Mar 28, 2023
@pnasrat pnasrat changed the title Q2 2023: Goal Alerting for consistent server spawn failures on a hub [alerting] Q2 2023: Goal Alerting for consistent server spawn failures on a hub Apr 3, 2023
@yuvipanda
Copy link
Member

Me and @pnasrat discussed moving this to next quarter, based on capacity of the team this quarter. I think we should! Thoughts, @damianavila?

@choldgraf choldgraf moved this to Active goal in Quarterly goals board Jun 7, 2023
@damianavila
Copy link
Contributor Author

Thoughts, @damianavila?

I concur with the move, although we should evaluate it alongside the other potential goals for Q3 and decide if we really want to include it for the next quarter.

@choldgraf choldgraf changed the title [alerting] Q2 2023: Goal Alerting for consistent server spawn failures on a hub [alerting] Alerting for consistent server spawn failures on a hub Jun 16, 2023
@choldgraf choldgraf moved this from Active goal to Proto-goal in Quarterly goals board Jun 16, 2023
@yuvipanda yuvipanda removed their assignment Jun 20, 2023
@pnasrat pnasrat removed their assignment Jul 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Status: Needs Shaping / Refinement
Status: Proto-goal
Development

No branches or pull requests

3 participants