Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine performance metrics: Meter currently-executing tasks #7677

Open
crusaderky opened this issue Mar 17, 2023 · 0 comments
Open

Fine performance metrics: Meter currently-executing tasks #7677

crusaderky opened this issue Mar 17, 2023 · 0 comments

Comments

@crusaderky
Copy link
Collaborator

When looking at a real-time plot in Grafana or similar tools, it's useful to know what the cluster is currently doing.

In the Worker heartbeat, add a bit to the metrics, in addition to those collected from Worker.digests_total_new, that is the coarse time of the tasks that are currently executing, as of the moment the heartbeat was fired, and record it under the label ("execute", <prefix>, "currently-running", "seconds").

When a task finishes, subtract all time logged so far for the task (which may be less than the total task runtime; you'll need to keep track of the timestamp of the latest heartbeat). Whenever the worker is idle, "currently-running" must be zero by construction.

Note that, unlike all other metrics in Worker.digests_total, this is not monotonically increasing; in Prometheus terms, it's a Gauge, not a Counter.

This ticket solves a problem in Prometheus metrics where the cluster utilization plot over time looks "spikey" whenever you have tasks that start before a scrape point and finish after it. On Coiled, this means any task that lasts longer than 5s (the Prometheus scraping interval).

Of course, this fix is perfect only as long as you do not display the detail of the activity. Once you break down by activity (task-cpu, etc.) in Grafana you will see, for all running tasks, a large positive spike in the actual activities (task-cpu, etc.) matching the task that just finished, which may send the plot well above the number of threads on the cluster, and a matching negative spike in currently-executing.
This ticket does not expect to make the by-activity grafana plot nicer than this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant