Meter how long each task prefix stays in each state #7560

crusaderky · 2023-02-20T13:00:32Z

Closes Prometheus: "other" tasks count is confusing #7412
Partially closes Better instrumentation for Worker.gather_dep #7217

Add task metrics:

current number of tasks, per task prefix, per task state (on worker), per worker
cumulative elapsed time, per task prefix, per task state (on worker), per worker

This differs from similar metrics on the scheduler because it offers much higher granularity of states, e.g. it differentiates between fetch, flight, and no-worker and between waiting and executing.

This should allow us, for example, to:

appreciate which tasks sit for an abnormal amount of time in waiting state
appreciate which tasks take an abnormal of time to transfer over the network
spot tasks that should be transitory and turn out to sit in memory for a long time (and could potentially benefit from dask.graph_manipulation.clone)
notice when the network is saturated (tasks sit in fetch state for a long time)
measure how long it takes for the cluster to heal after the loss of a worker (tasks sit in no-worker state)
how long tasks sit in cancelled state before they are cleaned up, releasing resources

As discussed offline, these new metrics are not exported to prometheus for the time being over concerns about cardinality. An opt-in switch is likely in the future.

This PR also introduces cancelled, resumed, released and error task states in Prometheus worker metrics and removes the "other" state.

crusaderky · 2023-02-20T13:21:22Z

distributed/worker_state_machine.py

+            # This happens exclusively on a transition from cancelled(flight) to
+            # resumed(flight->waiting) of a task with dependencies; the dependencies
+            # will remain in released state and never transition to anything else.
+            self._current_count[ts.prefix, ts.state] += 1


I find quite scary that this code path is tripped by exactly one test in the whole test suite (test_deadlock_cancelled_after_inflight_before_gather_from_worker)

Sounds like we're not handling that edge case well.

github-actions · 2023-02-20T14:13:23Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      24 files ±  0       24 suites ±0 10h 4m 50s ⏱️ - 14m 38s
  3 337 tests -   1   3 235 ✔️ -   2   100 💤 ±0 2 ❌ +1
39 343 runs - 12 37 516 ✔️ - 12 1 825 💤 - 1 2 ❌ +1

For more details on these failures, see this check.

Results for commit 47a3822. ± Comparison against base commit 1924e65.

♻️ This comment has been updated with latest results.

hendrikmakait

Generally LGTM, thanks @crusaderky! As discussed offline, exposing the times in each state as histograms might also be very interesting and potentially more valuable. This definitely wouldn't help with cardinality though.

distributed/http/worker/prometheus/core.py

hendrikmakait · 2023-02-21T14:10:29Z

distributed/worker_state_machine.py

+        if self._previous_ts is not None:
+            for k, n_tasks in self._current_count.items():
+                self._cumulative_elapsed[k] += elapsed * n_tasks


nit for simplicity: How about we move this before the previous for-loop, which would then allow us to resolve count deltas in a single loop? I.e., we could skip https://github.com/dask/distributed/pull/7560/files#diff-ec0688ae38a83ef9dbd910985d5ea7f4a890630c2e8482dccfe781c50c0d94c6R3819-R3820 and merge the previous for-loop with the following one (I may be missing something here).

Thanks for the tip! Now it's much simpler

hendrikmakait · 2023-02-21T14:15:08Z

distributed/worker_state_machine.py

+            # This happens exclusively on a transition from cancelled(flight) to
+            # resumed(flight->waiting) of a task with dependencies; the dependencies
+            # will remain in released state and never transition to anything else.
+            self._current_count[ts.prefix, ts.state] += 1


Sounds like we're not handling that edge case well.

Co-authored-by: Hendrik Makait <hendrik.makait@gmail.com>

crusaderky self-assigned this Feb 20, 2023

crusaderky commented Feb 20, 2023

View reviewed changes

crusaderky force-pushed the prometheus-gather-dep branch from eb87720 to 6d9b8c5 Compare February 20, 2023 13:22

crusaderky mentioned this pull request Feb 20, 2023

Better instrumentation for Worker.gather_dep #7217

Open

crusaderky force-pushed the prometheus-gather-dep branch from 6d9b8c5 to 4e542c9 Compare February 20, 2023 15:37

Meter how long each task prefix stays in each state

bad7b42

crusaderky force-pushed the prometheus-gather-dep branch from b9743e0 to bad7b42 Compare February 20, 2023 15:57

hendrikmakait self-requested a review February 20, 2023 17:13

Fix Windows failures

840b467

hendrikmakait approved these changes Feb 21, 2023

View reviewed changes

crusaderky and others added 6 commits February 21, 2023 16:42

Update distributed/http/worker/prometheus/core.py

e2258fa

Co-authored-by: Hendrik Makait <hendrik.makait@gmail.com>

Merge branch 'main' into prometheus-gather-dep

30f9ad1

Simplify algorithm

13ec8a5

Merge branch 'main' into prometheus-gather-dep

f29e866

fix flakiness on MacOSX

815aa52

Further tweak algorithm

47a3822

crusaderky merged commit 10fb727 into dask:main Feb 22, 2023

crusaderky deleted the prometheus-gather-dep branch February 22, 2023 13:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Meter how long each task prefix stays in each state #7560

Meter how long each task prefix stays in each state #7560

Uh oh!

crusaderky commented Feb 20, 2023 •

edited

Loading

Uh oh!

crusaderky Feb 20, 2023

Uh oh!

hendrikmakait Feb 21, 2023

Uh oh!

github-actions bot commented Feb 20, 2023 •

edited

Loading

Uh oh!

hendrikmakait left a comment

Uh oh!

Uh oh!

hendrikmakait Feb 21, 2023 •

edited

Loading

Uh oh!

crusaderky Feb 22, 2023

Uh oh!

hendrikmakait Feb 21, 2023

Uh oh!

Uh oh!

Uh oh!

Meter how long each task prefix stays in each state #7560

Meter how long each task prefix stays in each state #7560

Uh oh!

Conversation

crusaderky commented Feb 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crusaderky Feb 20, 2023

Choose a reason for hiding this comment

Uh oh!

hendrikmakait Feb 21, 2023

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

hendrikmakait left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hendrikmakait Feb 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusaderky Feb 22, 2023

Choose a reason for hiding this comment

Uh oh!

hendrikmakait Feb 21, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

crusaderky commented Feb 20, 2023 •

edited

Loading

github-actions bot commented Feb 20, 2023 •

edited

Loading

hendrikmakait Feb 21, 2023 •

edited

Loading