Determine and expose cluster health #2029

vishalbollu · 2021-03-30T22:13:04Z

Add a new command and/or a new section to cortex cluster info that aggregates the health of Cortex processes.

A user might have the perception that everything it's okay with the cluster when in fact a specific component might be failing silently. An example would be prometheus not being deployed correctly and therefore preventing the autoscaler and grafana from working correctly.

Here are a few resources that can be scanned to determine overall cluster health.

API autoscaler crons can be rolled into their respective API statuses.

One potential design can be:

cortex cluster status

# operator: live
# prometheus: live
# grafana: live
# autoscaler: live
# (...)

The text was updated successfully, but these errors were encountered:

RobertLucian added this to the v0.33 milestone Apr 2, 2021

deliahu removed this from the v0.33 milestone Apr 13, 2021

vishalbollu added research Determine technical constraints timecapped Assigned a limited amount of time labels Apr 28, 2021

miguelvr self-assigned this Jul 1, 2021

miguelvr mentioned this issue Jul 7, 2021

Cluster health command #2313

Merged

2 tasks

miguelvr closed this as completed in #2313 Jul 9, 2021

deliahu added this to the v0.39 milestone Jul 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determine and expose cluster health #2029

Determine and expose cluster health #2029

vishalbollu commented Mar 30, 2021 •

edited

Loading

Determine and expose cluster health #2029

Determine and expose cluster health #2029

Comments

vishalbollu commented Mar 30, 2021 • edited Loading

vishalbollu commented Mar 30, 2021 •

edited

Loading