Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose transfer-related metrics in Worker.get_metrics and WorkerMetricCollector #6936

Merged
merged 8 commits into from
Aug 31, 2022

Conversation

hendrikmakait
Copy link
Member

@hendrikmakait hendrikmakait commented Aug 23, 2022

Closes #6892

  • Adds transfer-related metrics in Worker.get_metrics and WorkerMetricCollector
  • Adds testing for transfer_incoming_* metrics

Out of scope:

  • Dashboard component for transfer_incoming_bytes and transfer_outgoing_bytes to be added in a follow-up PR

Notes:

  • outgoing metrics appear to be rather awkward to test since the timing needs to be "just right".
  • Tests added / passed
  • Passes pre-commit run --all-files

@github-actions
Copy link
Contributor

github-actions bot commented Aug 23, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

  3 files   -        12    3 suites   - 12   0s ⏱️ - 6h 37m 28s
21 tests  -   3 032  18 ✔️  -   2 950  1 💤  -   83  2 +1 
63 runs   - 22 522  54 ✔️  - 21 547  3 💤  - 980  6 +5 

For more details on these failures, see this check.

Results for commit f247b24. ± Comparison against base commit 817ead3.

♻️ This comment has been updated with latest results.

@hendrikmakait hendrikmakait marked this pull request as ready for review August 24, 2022 10:03
@hendrikmakait hendrikmakait self-assigned this Aug 25, 2022
@@ -974,6 +974,7 @@ async def get_metrics(self) -> dict:
"memory": spilled_memory,
"disk": spilled_disk,
},
comm_reserved_bytes=self.state.comm_nbytes,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting that on line 967 above,

in_flight=self.state.in_flight_tasks_count,

This is also strictly the number of incoming in-flight tasks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outgoing traffic can be measured:

  • Worker.outgoing_current_count for number of outgoing connections
  • there is no measure for outgoing number of bytes. However it's straightforward to add it to Worker.get_data:
    nbytes = {k: self.state.tasks[k].nbytes for k in data if k in self.state.tasks}

    (add to a total counter after the above line and remove the same amount in a finally clause)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify: it's OK to leave metering of outgoing traffic to a different PR.
However, within the scope of this PR, make it clear through naming that the "in flight tasks" and nbytes are specifically about incoming traffic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added transfer_outgoing_bytes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding in_flight: Can we safely rename this?

@crusaderky
Copy link
Collaborator

crusaderky commented Aug 26, 2022

@hendrikmakait

To recap our meeting:

  • rename WorkerState.comm_nbytes to comm_incoming_bytes.
    No deprecation in WorkerState needed; just change the already existing DeprecatedWorkerStateAttribute in Worker
  • new property WorkerState.comm_incoming_count: return len(self.in_flight_workers)
  • rename Worker.outgoing_current_count to comm_outgoing_count. Add a deprecated property (read-only).
  • new attribute Worker.comm_outgoing_bytes
  • rename Worker.incoming_count to comm_incoming_cumulative_count with a deprecated accessor
  • rename Worker.outgoing_count to comm_outgoing_cumulative_count with a deprecated accessor
  • heartbeat will now return
comm:
  incoming_bytes: #
  incoming_count: #
  incoming_cumulative_count: #
  outgoing_bytes: #
  outgoing_count: #
  outgoing_cumulative_count: #

@crusaderky
Copy link
Collaborator

[EDIT] let's not drop the cumulative counter; they're free

@crusaderky
Copy link
Collaborator

crusaderky commented Aug 29, 2022

Updated naming after conversation on #6933 (comment)
This supersedes my previous post.

main new notes
WorkerState.comm_nbytes WorkerState.transfer_incoming_bytes_current
(new property) WorkerState.transfer_incoming_count_current len(self.in_flight_workers)
WorkerState.total_out_connections WorkerState.transfer_incoming_count_max
WorkerState.comm_threshold_bytes WorkerState.transfer_incoming_bytes_max
WorkerState.target_message_size (do not change? please discuss)
Worker.outgoing_current_count Worker.transfer_outgoing_count_current
(new attribute) Worker.transfer_outgoing_bytes_current
Worker.incoming_count WorkerState.transfer_incoming_count_cumulative Moved to WorkerState
Worker.outgoing_count Worker.transfer_outgoing_count_cumulative
Worker.max_connections Worker.transfer_outgoing_count_max

Heartbeat will return

transfer:
    incoming_bytes_current: ...
    etc.

@crusaderky
Copy link
Collaborator

Note that the above list does not include

  • transfer_incoming_bytes_cumulative
  • transfer_outgoing_bytes_cumulative

They would be new attributes, but I'd rather leave them for a future PR if and when we realize they're actually useful.
Their meaning would also be non-trivial - are they

  • the bytes requested, or
  • the bytes actually received, or
  • the bytes actually received so far + incoming_current + an adjustment of actual vs. expected at the end of every transfer?

@hendrikmakait
Copy link
Member Author

I will WorkerState.comm_threshold_bytes to WorkerState.transfer_incoming_size_throttle_threshold, as it is used as a minimum size of individual transfers to be affected by throttling.

@crusaderky
Copy link
Collaborator

I will WorkerState.comm_threshold_bytes to WorkerState.transfer_incoming_size_throttle_threshold, as it is used as a minimum size of individual transfers to be affected by throttling.

did you mean transfer_incoming_bytes_throttle_threshold? +1 from me

@hendrikmakait
Copy link
Member Author

Regarding WorkerState.target_message_size: We could rename this to WorkerState.transfer_(incoming_)message_target_bytes but I don't have strong opinions on that particular variable.

@crusaderky
Copy link
Collaborator

Regarding WorkerState.target_message_size: We could rename this to WorkerState.transfer_(incoming_)message_target_bytes but I don't have strong opinions on that particular variable.

this makes sense

@hendrikmakait
Copy link
Member Author

Regarding WorkerState.target_message_size: We could rename this to WorkerState.transfer_(incoming_)message_target_bytes but I don't have strong opinions on that particular variable.

this makes sense

Then let's go with WorkerState.transfer_message_target_bytes for brevity as there is currently no conflicting attribute for outgoing messages.

@hendrikmakait
Copy link
Member Author

hendrikmakait commented Aug 30, 2022

Waiting for #6933 to get merged before giving this a once-over with fresh names.

@hendrikmakait hendrikmakait marked this pull request as draft August 31, 2022 07:48
@hendrikmakait hendrikmakait changed the title Expose comm_nbytes in Worker.get_metrics and WorkerMetricCollector Expose transfer-related metrics in Worker.get_metrics and WorkerMetricCollector Aug 31, 2022
@fjetter
Copy link
Member

fjetter commented Aug 31, 2022

Waiting for #6936 to get merged before giving this a once-over with fresh names.

Damn deadlocks everywhere...

you're self referencing this issue ;)

@hendrikmakait
Copy link
Member Author

Damn deadlocks everywhere...

you're self referencing this issue ;)

I guess we need to start linting comments, I don't see another way.

@hendrikmakait hendrikmakait marked this pull request as ready for review August 31, 2022 11:45
distributed/worker.py Outdated Show resolved Hide resolved
distributed/worker.py Outdated Show resolved Hide resolved
distributed/worker.py Outdated Show resolved Hide resolved
hendrikmakait and others added 3 commits August 31, 2022 18:52
Co-authored-by: crusaderky <crusaderky@gmail.com>
@crusaderky crusaderky merged commit bfc5cfe into dask:main Aug 31, 2022
@crusaderky
Copy link
Collaborator

ty!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add metrics for worker communications
3 participants