Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: expand on runtime metrics #4373

Merged
merged 24 commits into from
Jan 22, 2022
Merged

runtime: expand on runtime metrics #4373

merged 24 commits into from
Jan 22, 2022

Conversation

carllerche
Copy link
Member

@carllerche carllerche commented Jan 3, 2022

Work in progress

This PR expands on previously implemented runtime metrics adding more detail around runtime no-op ticks and schedule counts.

Depends on #4377
Refs: #4373

Todo

  • Tests
  • Number of tasks stolen per-worker
  • Local-queue overflow count
  • Naming: metrics, stats, or "performance counters"

@github-actions github-actions bot added the R-loom Run loom tests on this PR label Jan 3, 2022
@carllerche carllerche marked this pull request as draft January 3, 2022 22:08
carllerche added a commit that referenced this pull request Jan 7, 2022
This patch does some refactoring to the current-thread scheduler bringing it closer to the
structure of the multi-threaded scheduler. More specifically, the core scheduler data is stored
in a Core struct and that struct is passed around as a "token" indicating permission to do
work. The Core structure is also stored in the thread-local context.

This refactor is intended to support #4373, making it easier to track counters in more locations
in the current-thread scheduler.
carllerche added a commit that referenced this pull request Jan 12, 2022
Re-applies #4377 and fixes the bug resulting in Hyper's double panic.

Revert: #4394

Original PR:

This PR does some refactoring to the current-thread scheduler bringing it closer to the structure of the
multi-threaded scheduler. More specifically, the core scheduler data is stored in a Core struct and that
struct is passed around as a "token" indicating permission to do work. The Core structure is also stored
in the thread-local context.

This refactor is intended to support #4373, making it easier to track counters in more locations in the
current-thread scheduler.

I tried to keep commits small, but the "set Core in thread-local context" is both the biggest commit and
the key one.
@carllerche carllerche marked this pull request as ready for review January 20, 2022 04:07
@carllerche
Copy link
Member Author

@Darksonn @hawkw @LucioFranco I'm marking this as ready to review. There is more work to do on docs, but we can do that in follow-up PRs. The feature is still marked as unstable and others are waiting for this PR to land to do work in parallel.

Copy link
Contributor

@Darksonn Darksonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to see all of those tests.

Comment on lines +87 to +90
# Technically, removing this is a breaking change even though it only ever did
# anything with the unstable flag on. It is probably safe to get rid of it after
# a few releases.
stats = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why shouldn't this continue to be a feature flag? For machines without 64-bit atomics, the stats involve locking a bunch of mutexes quite a lot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If we rename to metrics we will need to change the feature flag. While unstable, I'm not worried about it since technically adding a feature flag is part of the public API.
  • We can always do runtime enabling / disabling of metrics collection.

tokio/src/runtime/basic_scheduler.rs Outdated Show resolved Hide resolved
tokio/src/runtime/metrics/batch.rs Outdated Show resolved Hide resolved
tokio/src/runtime/metrics/batch.rs Outdated Show resolved Hide resolved
Comment on lines +163 to +164
/// Returns the number of times the given worker thread stole tasks from
/// another worker thread.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So not the number of tasks it has stolen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the number of times it successfully steals is more interesting than the number of tasks, but I could be wrong.

tokio/src/runtime/metrics/runtime.rs Outdated Show resolved Hide resolved
tokio/src/runtime/metrics/runtime.rs Outdated Show resolved Hide resolved
tokio/src/runtime/metrics/runtime.rs Outdated Show resolved Hide resolved

cfg_metrics! {
impl Runtime {
/// TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo

tokio/src/runtime/queue.rs Outdated Show resolved Hide resolved
Copy link
Member

@LucioFranco LucioFranco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few questions inline nothing blocking but a few typos

}

pub(crate) fn worker_metrics(&self, worker: usize) -> &WorkerMetrics {
assert_eq!(0, worker);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the assert here? Is there a better way to express this invariant maybe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index is a low level API. Do you have a suggestion for a better way to express it?

&self.shared.scheduler_metrics
}

pub(crate) fn remote_queue_depth(&self) -> usize {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible for a user to contend this by accident?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, yes w/ the current-thread runtime, but it isn't inherent. The injection queue should be improved in a later PR.

tokio/src/runtime/metrics/batch.rs Show resolved Hide resolved

cfg_rt_multi_thread! {
impl MetricsBatch {
pub(crate) fn incr_steal_count(&mut self, by: u16) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why u16?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because that is the type used to track queue size internally. It isn't a public fn.

tokio/src/runtime/metrics/runtime.rs Show resolved Hide resolved
tokio/src/runtime/metrics/runtime.rs Outdated Show resolved Hide resolved
///
/// [`Runtime::metrics`]: crate::runtime::Runtime::metrics()
#[derive(Clone, Debug)]
pub struct RuntimeMetrics {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it makes more sense to have a Worker struct that contains Handle and worker_id: usize instead of having each api panic if the index is incorrect?

Copy link
Member Author

@carllerche carllerche Jan 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried that initially. It is a pain to wire it all up w/o adding a bunch of structs w/ lifetimes in the public API. It also seems less future-proof.

tokio/src/runtime/metrics/runtime.rs Outdated Show resolved Hide resolved
.load(Relaxed)
}

/// Returns the number of tasks currently scheduled in the runtime's
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider not using internal language in docs like this? We also probably need some place that defines what an injection queue is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you call the queue? The details of the injection queue vs. local queue are relevant when understanding the runtime characteristics of the runtime.


cfg_metrics! {
impl Runtime {
/// TODO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaving it to a follow-up PR. Still unstable.

carllerche and others added 3 commits January 21, 2022 13:01
Co-authored-by: Alice Ryhl <alice@ryhl.io>
Co-authored-by: Lucio Franco <luciofranco14@gmail.com>
@carllerche
Copy link
Member Author

I'm going to merge once CI passes. If there any other points you want to continue discussing or track before stabilizing, please add them here. I already tracked that docs (TODO) should be handled.

@carllerche carllerche merged commit 24f4ee3 into master Jan 22, 2022
@carllerche carllerche deleted the more-rt-metrics branch January 22, 2022 06:17
@Darksonn Darksonn added A-tokio Area: The main tokio crate M-metrics Module: tokio/runtime/metrics labels Aug 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate M-metrics Module: tokio/runtime/metrics R-loom Run loom tests on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants