Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add useful Operator metrics #690

Open
janhoy opened this issue Mar 6, 2024 · 0 comments
Open

Add useful Operator metrics #690

janhoy opened this issue Mar 6, 2024 · 0 comments

Comments

@janhoy
Copy link
Contributor

janhoy commented Mar 6, 2024

Since #307 we now have generic go metrics, like mem, gc, threads etc.

Let's add application level metrics for the operator iself, that could be useful for Grafana Board and alerts. Suggestions:

  • Gauge of nuber of currently managed CRD instances for SolrClouds, SolrBackups, SolrPrometheusExporter
  • Gauge for CRDs currently in a failure state
  • Reconcile stats
    • Successful vs failed reconcile events, broken down to what kind of event
    • Size of pending operations in reconcile queue (if such a thing)
  • Operation stats
    • For each operation type (install, upgrade, delete, backup etc) counts and status

Goal would be to make a simple Grafana board where you can filter on namespace etc to see raw operator health, and at a glance whether some operations are in failure state etc. Futher filter by labels like SolrCloud name, so you can see number of failed operations towards each cluster, and when they happened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant