Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meter how long each task prefix stays in each state #7560

Merged
merged 8 commits into from
Feb 22, 2023

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented Feb 20, 2023

Add task metrics:

  1. current number of tasks, per task prefix, per task state (on worker), per worker
  2. cumulative elapsed time, per task prefix, per task state (on worker), per worker

This differs from similar metrics on the scheduler because it offers much higher granularity of states, e.g. it differentiates between fetch, flight, and no-worker and between waiting and executing.

This should allow us, for example, to:

  • appreciate which tasks sit for an abnormal amount of time in waiting state
  • appreciate which tasks take an abnormal of time to transfer over the network
  • spot tasks that should be transitory and turn out to sit in memory for a long time (and could potentially benefit from dask.graph_manipulation.clone)
  • notice when the network is saturated (tasks sit in fetch state for a long time)
  • measure how long it takes for the cluster to heal after the loss of a worker (tasks sit in no-worker state)
  • how long tasks sit in cancelled state before they are cleaned up, releasing resources

As discussed offline, these new metrics are not exported to prometheus for the time being over concerns about cardinality. An opt-in switch is likely in the future.

This PR also introduces cancelled, resumed, released and error task states in Prometheus worker metrics and removes the "other" state.

@crusaderky crusaderky self-assigned this Feb 20, 2023
# This happens exclusively on a transition from cancelled(flight) to
# resumed(flight->waiting) of a task with dependencies; the dependencies
# will remain in released state and never transition to anything else.
self._current_count[ts.prefix, ts.state] += 1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find quite scary that this code path is tripped by exactly one test in the whole test suite (test_deadlock_cancelled_after_inflight_before_gather_from_worker)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like we're not handling that edge case well.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 20, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       24 files  ±  0         24 suites  ±0   10h 4m 50s ⏱️ - 14m 38s
  3 337 tests  -   1    3 235 ✔️  -   2     100 💤 ±0  2 +1 
39 343 runs   - 12  37 516 ✔️  - 12  1 825 💤  - 1  2 +1 

For more details on these failures, see this check.

Results for commit 47a3822. ± Comparison against base commit 1924e65.

♻️ This comment has been updated with latest results.

@crusaderky crusaderky force-pushed the prometheus-gather-dep branch from 6d9b8c5 to 4e542c9 Compare February 20, 2023 15:37
@crusaderky crusaderky force-pushed the prometheus-gather-dep branch from b9743e0 to bad7b42 Compare February 20, 2023 15:57
@hendrikmakait hendrikmakait self-requested a review February 20, 2023 17:13
Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, thanks @crusaderky! As discussed offline, exposing the times in each state as histograms might also be very interesting and potentially more valuable. This definitely wouldn't help with cardinality though.

distributed/http/worker/prometheus/core.py Outdated Show resolved Hide resolved
Comment on lines 3829 to 3831
if self._previous_ts is not None:
for k, n_tasks in self._current_count.items():
self._cumulative_elapsed[k] += elapsed * n_tasks
Copy link
Member

@hendrikmakait hendrikmakait Feb 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit for simplicity: How about we move this before the previous for-loop, which would then allow us to resolve count deltas in a single loop? I.e., we could skip https://github.com/dask/distributed/pull/7560/files#diff-ec0688ae38a83ef9dbd910985d5ea7f4a890630c2e8482dccfe781c50c0d94c6R3819-R3820 and merge the previous for-loop with the following one (I may be missing something here).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip! Now it's much simpler

# This happens exclusively on a transition from cancelled(flight) to
# resumed(flight->waiting) of a task with dependencies; the dependencies
# will remain in released state and never transition to anything else.
self._current_count[ts.prefix, ts.state] += 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like we're not handling that edge case well.

@crusaderky crusaderky merged commit 10fb727 into dask:main Feb 22, 2023
@crusaderky crusaderky deleted the prometheus-gather-dep branch February 22, 2023 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better instrumentation for Worker.gather_dep Prometheus: "other" tasks count is confusing
2 participants