Fine performance metrics: Meter currently-executing tasks #7677

crusaderky · 2023-03-17T17:07:13Z

Part of Fine performance metrics meta-issue #7665
Blocked by Fine performance metrics: Store data on the Scheduler #7666

When looking at a real-time plot in Grafana or similar tools, it's useful to know what the cluster is currently doing.

In the Worker heartbeat, add a bit to the metrics, in addition to those collected from Worker.digests_total_new, that is the coarse time of the tasks that are currently executing, as of the moment the heartbeat was fired, and record it under the label ("execute", <prefix>, "currently-running", "seconds").

When a task finishes, subtract all time logged so far for the task (which may be less than the total task runtime; you'll need to keep track of the timestamp of the latest heartbeat). Whenever the worker is idle, "currently-running" must be zero by construction.

Note that, unlike all other metrics in Worker.digests_total, this is not monotonically increasing; in Prometheus terms, it's a Gauge, not a Counter.

This ticket solves a problem in Prometheus metrics where the cluster utilization plot over time looks "spikey" whenever you have tasks that start before a scrape point and finish after it. On Coiled, this means any task that lasts longer than 5s (the Prometheus scraping interval).

Of course, this fix is perfect only as long as you do not display the detail of the activity. Once you break down by activity (task-cpu, etc.) in Grafana you will see, for all running tasks, a large positive spike in the actual activities (task-cpu, etc.) matching the task that just finished, which may send the plot well above the number of threads on the cluster, and a matching negative spike in currently-executing.
This ticket does not expect to make the by-activity grafana plot nicer than this.

The text was updated successfully, but these errors were encountered:

crusaderky added the diagnostics label Mar 17, 2023

This was referenced Mar 17, 2023

Fine performance metrics: Meter wasted partial compute time after losing a worker #7678

Open

Fine performance metrics meta-issue #7665

Open

crusaderky mentioned this issue Jun 21, 2023

Add idle time to fine performance metrics #7938

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine performance metrics: Meter currently-executing tasks #7677

Fine performance metrics: Meter currently-executing tasks #7677

crusaderky commented Mar 17, 2023

Fine performance metrics: Meter currently-executing tasks #7677

Fine performance metrics: Meter currently-executing tasks #7677

Comments

crusaderky commented Mar 17, 2023