-
-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Meter how long each task prefix stays in each state #7560
Conversation
# This happens exclusively on a transition from cancelled(flight) to | ||
# resumed(flight->waiting) of a task with dependencies; the dependencies | ||
# will remain in released state and never transition to anything else. | ||
self._current_count[ts.prefix, ts.state] += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find quite scary that this code path is tripped by exactly one test in the whole test suite (test_deadlock_cancelled_after_inflight_before_gather_from_worker
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like we're not handling that edge case well.
eb87720
to
6d9b8c5
Compare
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 24 files ± 0 24 suites ±0 10h 4m 50s ⏱️ - 14m 38s For more details on these failures, see this check. Results for commit 47a3822. ± Comparison against base commit 1924e65. ♻️ This comment has been updated with latest results. |
6d9b8c5
to
4e542c9
Compare
b9743e0
to
bad7b42
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM, thanks @crusaderky! As discussed offline, exposing the times in each state as histograms might also be very interesting and potentially more valuable. This definitely wouldn't help with cardinality though.
distributed/worker_state_machine.py
Outdated
if self._previous_ts is not None: | ||
for k, n_tasks in self._current_count.items(): | ||
self._cumulative_elapsed[k] += elapsed * n_tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit for simplicity: How about we move this before the previous for-loop, which would then allow us to resolve count deltas in a single loop? I.e., we could skip https://github.com/dask/distributed/pull/7560/files#diff-ec0688ae38a83ef9dbd910985d5ea7f4a890630c2e8482dccfe781c50c0d94c6R3819-R3820 and merge the previous for-loop with the following one (I may be missing something here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tip! Now it's much simpler
# This happens exclusively on a transition from cancelled(flight) to | ||
# resumed(flight->waiting) of a task with dependencies; the dependencies | ||
# will remain in released state and never transition to anything else. | ||
self._current_count[ts.prefix, ts.state] += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like we're not handling that edge case well.
Co-authored-by: Hendrik Makait <hendrik.makait@gmail.com>
Worker.gather_dep
#7217Add task metrics:
This differs from similar metrics on the scheduler because it offers much higher granularity of states, e.g. it differentiates between fetch, flight, and no-worker and between waiting and executing.
This should allow us, for example, to:
dask.graph_manipulation.clone
)fetch
state for a long time)no-worker
state)cancelled
state before they are cleaned up, releasing resourcesAs discussed offline, these new metrics are not exported to prometheus for the time being over concerns about cardinality. An opt-in switch is likely in the future.
This PR also introduces cancelled, resumed, released and error task states in Prometheus worker metrics and removes the "other" state.