Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine and expose cluster health #2029

Closed
8 tasks
vishalbollu opened this issue Mar 30, 2021 · 0 comments · Fixed by #2313
Closed
8 tasks

Determine and expose cluster health #2029

vishalbollu opened this issue Mar 30, 2021 · 0 comments · Fixed by #2313
Assignees
Labels
research Determine technical constraints timecapped Assigned a limited amount of time
Milestone

Comments

@vishalbollu
Copy link
Contributor

vishalbollu commented Mar 30, 2021

Add a new command and/or a new section to cortex cluster info that aggregates the health of Cortex processes.

A user might have the perception that everything it's okay with the cluster when in fact a specific component might be failing silently. An example would be prometheus not being deployed correctly and therefore preventing the autoscaler and grafana from working correctly.

Here are a few resources that can be scanned to determine overall cluster health.

  • verify that all of the critical Cortex pods are running
  • batch, task crons should be running as expected
  • operator
  • prometheus
  • grafana
  • autoscaler
  • cluster autoscaler
  • events in istio resources such as the service and loadbalancer

API autoscaler crons can be rolled into their respective API statuses.

One potential design can be:

cortex cluster status

# operator: live
# prometheus: live
# grafana: live
# autoscaler: live
# (...)
@RobertLucian RobertLucian added this to the v0.33 milestone Apr 2, 2021
@deliahu deliahu removed this from the v0.33 milestone Apr 13, 2021
@vishalbollu vishalbollu added research Determine technical constraints timecapped Assigned a limited amount of time labels Apr 28, 2021
@miguelvr miguelvr self-assigned this Jul 1, 2021
@miguelvr miguelvr mentioned this issue Jul 7, 2021
2 tasks
@deliahu deliahu added this to the v0.39 milestone Jul 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
research Determine technical constraints timecapped Assigned a limited amount of time
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants