-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: expand on runtime metrics #4373
Conversation
26e60bd
to
60c7152
Compare
This patch does some refactoring to the current-thread scheduler bringing it closer to the structure of the multi-threaded scheduler. More specifically, the core scheduler data is stored in a Core struct and that struct is passed around as a "token" indicating permission to do work. The Core structure is also stored in the thread-local context. This refactor is intended to support #4373, making it easier to track counters in more locations in the current-thread scheduler.
Re-applies #4377 and fixes the bug resulting in Hyper's double panic. Revert: #4394 Original PR: This PR does some refactoring to the current-thread scheduler bringing it closer to the structure of the multi-threaded scheduler. More specifically, the core scheduler data is stored in a Core struct and that struct is passed around as a "token" indicating permission to do work. The Core structure is also stored in the thread-local context. This refactor is intended to support #4373, making it easier to track counters in more locations in the current-thread scheduler. I tried to keep commits small, but the "set Core in thread-local context" is both the biggest commit and the key one.
04230c9
to
7c71ee1
Compare
@Darksonn @hawkw @LucioFranco I'm marking this as ready to review. There is more work to do on docs, but we can do that in follow-up PRs. The feature is still marked as unstable and others are waiting for this PR to land to do work in parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to see all of those tests.
# Technically, removing this is a breaking change even though it only ever did | ||
# anything with the unstable flag on. It is probably safe to get rid of it after | ||
# a few releases. | ||
stats = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why shouldn't this continue to be a feature flag? For machines without 64-bit atomics, the stats involve locking a bunch of mutexes quite a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If we rename to
metrics
we will need to change the feature flag. While unstable, I'm not worried about it since technically adding a feature flag is part of the public API. - We can always do runtime enabling / disabling of metrics collection.
/// Returns the number of times the given worker thread stole tasks from | ||
/// another worker thread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So not the number of tasks it has stolen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think the number of times it successfully steals is more interesting than the number of tasks, but I could be wrong.
|
||
cfg_metrics! { | ||
impl Runtime { | ||
/// TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few questions inline nothing blocking but a few typos
} | ||
|
||
pub(crate) fn worker_metrics(&self, worker: usize) -> &WorkerMetrics { | ||
assert_eq!(0, worker); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the assert here? Is there a better way to express this invariant maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The index is a low level API. Do you have a suggestion for a better way to express it?
tokio/src/runtime/basic_scheduler.rs
Outdated
&self.shared.scheduler_metrics | ||
} | ||
|
||
pub(crate) fn remote_queue_depth(&self) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible for a user to contend this by accident?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, yes w/ the current-thread runtime, but it isn't inherent. The injection queue should be improved in a later PR.
|
||
cfg_rt_multi_thread! { | ||
impl MetricsBatch { | ||
pub(crate) fn incr_steal_count(&mut self, by: u16) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why u16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because that is the type used to track queue size internally. It isn't a public fn.
/// | ||
/// [`Runtime::metrics`]: crate::runtime::Runtime::metrics() | ||
#[derive(Clone, Debug)] | ||
pub struct RuntimeMetrics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it makes more sense to have a Worker
struct that contains Handle
and worker_id: usize
instead of having each api panic if the index is incorrect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried that initially. It is a pain to wire it all up w/o adding a bunch of structs w/ lifetimes in the public API. It also seems less future-proof.
.load(Relaxed) | ||
} | ||
|
||
/// Returns the number of tasks currently scheduled in the runtime's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider not using internal language in docs like this? We also probably need some place that defines what an injection queue is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you call the queue? The details of the injection queue vs. local queue are relevant when understanding the runtime characteristics of the runtime.
|
||
cfg_metrics! { | ||
impl Runtime { | ||
/// TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaving it to a follow-up PR. Still unstable.
Co-authored-by: Alice Ryhl <alice@ryhl.io> Co-authored-by: Lucio Franco <luciofranco14@gmail.com>
I'm going to merge once CI passes. If there any other points you want to continue discussing or track before stabilizing, please add them here. I already tracked that docs (TODO) should be handled. |
Work in progress
This PR expands on previously implemented runtime metrics adding more detail around runtime no-op ticks and schedule counts.
Depends on #4377
Refs: #4373
Todo