You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When looking at a real-time plot in Grafana or similar tools, it's useful to know what the cluster is currently doing.
In the Worker heartbeat, add a bit to the metrics, in addition to those collected from Worker.digests_total_new, that is the coarse time of the tasks that are currently executing, as of the moment the heartbeat was fired, and record it under the label ("execute", <prefix>, "currently-running", "seconds").
When a task finishes, subtract all time logged so far for the task (which may be less than the total task runtime; you'll need to keep track of the timestamp of the latest heartbeat). Whenever the worker is idle, "currently-running" must be zero by construction.
Note that, unlike all other metrics in Worker.digests_total, this is not monotonically increasing; in Prometheus terms, it's a Gauge, not a Counter.
This ticket solves a problem in Prometheus metrics where the cluster utilization plot over time looks "spikey" whenever you have tasks that start before a scrape point and finish after it. On Coiled, this means any task that lasts longer than 5s (the Prometheus scraping interval).
Of course, this fix is perfect only as long as you do not display the detail of the activity. Once you break down by activity (task-cpu, etc.) in Grafana you will see, for all running tasks, a large positive spike in the actual activities (task-cpu, etc.) matching the task that just finished, which may send the plot well above the number of threads on the cluster, and a matching negative spike in currently-executing.
This ticket does not expect to make the by-activity grafana plot nicer than this.
The text was updated successfully, but these errors were encountered:
When looking at a real-time plot in Grafana or similar tools, it's useful to know what the cluster is currently doing.
In the Worker heartbeat, add a bit to the metrics, in addition to those collected from
Worker.digests_total_new
, that is the coarse time of the tasks that are currently executing, as of the moment the heartbeat was fired, and record it under the label("execute", <prefix>, "currently-running", "seconds")
.When a task finishes, subtract all time logged so far for the task (which may be less than the total task runtime; you'll need to keep track of the timestamp of the latest heartbeat). Whenever the worker is idle, "currently-running" must be zero by construction.
Note that, unlike all other metrics in
Worker.digests_total
, this is not monotonically increasing; in Prometheus terms, it's a Gauge, not a Counter.This ticket solves a problem in Prometheus metrics where the cluster utilization plot over time looks "spikey" whenever you have tasks that start before a scrape point and finish after it. On Coiled, this means any task that lasts longer than 5s (the Prometheus scraping interval).
Of course, this fix is perfect only as long as you do not display the detail of the activity. Once you break down by activity (task-cpu, etc.) in Grafana you will see, for all running tasks, a large positive spike in the actual activities (task-cpu, etc.) matching the task that just finished, which may send the plot well above the number of threads on the cluster, and a matching negative spike in
currently-executing
.This ticket does not expect to make the by-activity grafana plot nicer than this.
The text was updated successfully, but these errors were encountered: