Fine performance metrics: Break down idle time on the Worker #7671

crusaderky · 2023-03-17T16:39:03Z

Part of Fine performance metrics meta-issue #7665

As of #7586, the sum of Worker.digests_total[("execute", *, *, "seconds")] is equal to the time spent executing tasks, multiplied by the number of threads on the worker.

There's a big chunk of extra time that is not counted, which is:

Time the worker spent with idle threads because it was paused.

Worker.digests_total["execute", "n/a", "paused", "seconds"] = min(
   number of threads - number of tasks in executing/cancelled/resumed state,
   number of tasks in ready or constrained state,
) * T

where T is the time between changes in any of the variables in the formula.
Note that we would not record the task prefix here, unlike in other execute digests.
Note that a worker may be paused and not accrue any paused time, because there are tasks still running.

Time the worker spent with idle threads because of resource limits

Worker.digests_total["execute", "n/a", "constrained", "seconds"] = max(0, min(
   number of threads - number of tasks in executing/cancelled/resumed state,
   number of tasks in constrained state
) * T - paused time)

Time the worker spent with idle threads because it was fetching data. This is different from the sum of Worker.digests_total[("gather-dep", *, "seconds")] as it should exclude the time where dependency gathering and execution where properly pipelined. In other words, this time should be defined as

Worker.digests_total["execute", "n/a", "gather-dep", "seconds"] = max(0, min(
   number of threads - number of tasks in executing/cancelled/resumed state,
   number of tasks in waiting state,
) * T
- paused time 
- constrained time
)

time the worker spent with idle threads because it was waiting for more content from the scheduler.
This should be defined as

Worker.digests_total["execute", "n/a", "idle", "seconds"]  = max(0, 
    (number of threads - number of tasks in executing/cancelled/resumed state) * T
    - paused time
    - constrained time
    - gather time
))

With the above additions, the sum of Worker.digests_total[("execute", *, *, "seconds")] should accumulate to (number of threads * worker uptime) by construction when there are no tasks currently running.

The above formulas are a very quick draft and should be reviewed for correctness.

The text was updated successfully, but these errors were encountered:

crusaderky · 2023-06-21T15:04:11Z

#7938 introduces a metric on the spans that measures the total of the idle time. This is superior to just beaming the same information from the workers as it is aware of tasks running everywhere on the scheduler. This issue should be reviewed/redesigned in light of this.

crusaderky added the diagnostics label Mar 17, 2023

This was referenced Mar 17, 2023

Fine performance metrics: Break down idle time on the Scheduler #7672

Open

Fine performance metrics: client context manager #7667

Open

Fine performance metrics meta-issue #7665

Open

This was referenced May 22, 2023

Worker crash causes computations to overlap #7825

Open

Fine performance metrics: apportion to Computations #7776

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine performance metrics: Break down idle time on the Worker #7671

Fine performance metrics: Break down idle time on the Worker #7671

crusaderky commented Mar 17, 2023

crusaderky commented Jun 21, 2023

Fine performance metrics: Break down idle time on the Worker #7671

Fine performance metrics: Break down idle time on the Worker #7671

Comments

crusaderky commented Mar 17, 2023

crusaderky commented Jun 21, 2023