Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rt: do not trace tasks while locking OwnedTasks #6036

Merged
merged 9 commits into from
Oct 6, 2023

Conversation

jswrenn
Copy link
Contributor

@jswrenn jswrenn commented Sep 27, 2023

If a polled tasks completes while OwnedTasks is locked, it will not be able to remove itself from OwnedTasks, resulting in a deadlock.

Fixes #6035.

If a polled tasks completes while `OwnedTasks` is locked, it will
not be able to remove itself from `OwnedTasks`, resulting in a
deadlock.

Fixes tokio-rs#6035.
@jswrenn jswrenn requested a review from Darksonn September 27, 2023 18:59
@github-actions github-actions bot added R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR R-loom-multi-thread-alt Run loom multi-thread alt tests on this PR labels Sep 27, 2023
@Darksonn Darksonn added A-tokio Area: The main tokio crate M-taskdump --cfg tokio_taskdump labels Sep 29, 2023
Comment on lines 325 to 327
task.as_raw().state().transition_to_notified_for_tracing();
// store the raw tasks into a vec
tasks.push(task.as_raw());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be taking a refcount to the task instead of using as_raw.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the API for that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add this:

impl<S: 'static> Clone for Task<S> {
    fn clone(&self) -> Task<S> {
        self.raw.ref_inc();
        Task::new(self.raw)
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh. I realized that transition_to_notified_for_tracing actually performs a refcount increment. Because of that, you don't need the additional refcount increment. However, I would probably prefer to see this refcount management a bit more wrapped up in a less error-prone api.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Unfortunately, now I'm quite confused by how refcount management is supposed to work. I removed only the increment in transition_to_notified_for_tracing, but doing so causes the tests to SIGABRT:

free(): invalid pointer

What's the relationship between notifying a task, polling a task, and its refcount?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior of each function is documented in harness.rs. There's also a bunch of documentation at the top of mod.rs.

Basically, each Task (or Notified) object owns a refcount. Various methods will consume the refcount. We want the refcount that transition_to_notified_for_tracing creates to be owned by a Task somehow, rather than implicitly via a RawTask. This is because it generally makes code much more clear, and it also makes the code more robust towards error paths due to RAII cleanup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks! So, to summarize, the crash occurred because:

  • cloning Task incremented the refcount
  • polling the task decremented the refcount
  • dropping Task decremented the refcount

...which is one too many decrements. transition_to_notified_for_tracing needs to increment the refcount (much like transition_to_notified_by_ref), because polling is going to consume that count.

Copy link
Contributor

@Darksonn Darksonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Edit: I confused myself...

@jswrenn jswrenn force-pushed the taskdump-deadlock-fix branch from 2849fb3 to c0d5cf7 Compare October 2, 2023 17:40
@jswrenn jswrenn requested a review from Darksonn October 2, 2023 17:41
@jswrenn jswrenn force-pushed the taskdump-deadlock-fix branch from c0d5cf7 to 2849fb3 Compare October 3, 2023 19:13
Comment on lines 331 to 334
// set the notified bit
task.as_raw().state().transition_to_notified_for_tracing();
// store the tasks into a vec
tasks.push(task.clone());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so, I've figured out how I think we should handle this refcount business:

  • First, we wrap transition_to_notified_for_tracing with a method on Task that returns a Notified. This Notified will own the refcount that we created inside of transition_to_notified_for_tracing.
  • Next, when you want to poll the task, instead of calling poll directly, you instead use OwnedTasks::assert_owner to convert the Notified into a LocalNotified.
  • To poll the task, you call LocalNotified::run.

This way, we only increment the refcount once (so there's no call to Task::clone), and all refcounts we create are owned by Notified object, so their destructor will clear it if it gets dropped before it is polled, e.g. due to a panic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@jswrenn jswrenn requested a review from Darksonn October 4, 2023 20:36
cfg_taskdump! {
pub(super) fn notify_for_tracing(&self) -> Notified<S> {
self.as_raw().state().transition_to_notified_for_tracing();
Notified(self.clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since transition_to_notified_for_tracing already increments the refcount, you do not also need to clone it. You probably have a memory leak right now.

Suggested change
Notified(self.clone())
Notified(self)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a leak here. Remember, when we're calling notify_for_tracing, we're iterating over OwnedTasks, and need to get notified copies of those tasks into a separate vector. A clone has to occur somewhere. Either we can do it here, or we can modified this method to consume self and do the clone in trace_helper instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, you increment twice (in transition_to_notified_for_tracing and clone), but only decrement once (in run)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yep. I didn't catch that run mem::forgets the LocalNotified.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of removing the ref_inc from transition_to_notified_for_tracing, it makes more sense to me to remove the call to clone. This way, you only touch the atomic once instead of twice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 4264f61, but note that there's now a need for cloning in trace_owned, because we have to stash the Tasks (or Notified)s outside of OwnedTasks.

`notify_for_tracing` already increments refcount with its `clone`,
and `LocalNotified::run` forgets (not drops) the underlying task.
@jswrenn jswrenn requested a review from Darksonn October 6, 2023 14:49
@@ -361,6 +361,13 @@ impl<S: 'static> Task<S> {
fn header_ptr(&self) -> NonNull<Header> {
self.raw.header_ptr()
}

cfg_taskdump! {
pub(super) fn notify_for_tracing(self) -> Notified<S> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to fix that like this:

Suggested change
pub(super) fn notify_for_tracing(self) -> Notified<S> {
pub(super) fn notify_for_tracing(&self) -> Notified<S> {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a compile error. Notified wraps a Task, not a &Task.

error[E0308]: mismatched types
   --> tokio/src/runtime/task/mod.rs:368:22
    |
368 |             Notified(self)
    |             -------- ^^^^ expected `Task<S>`, found `&Task<S>`
    |             |
    |             arguments to this struct are incorrect
    |
    = note: expected struct `runtime::task::Task<S>`
            found reference `&runtime::task::Task<S>`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you created a new refcount, you can create a new Task using the raw field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks! Done.

Copy link
Contributor

@Darksonn Darksonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one nit remaining.

tokio/src/runtime/task/mod.rs Outdated Show resolved Hide resolved
tokio/src/runtime/task/raw.rs Outdated Show resolved Hide resolved
@jswrenn jswrenn requested a review from Darksonn October 6, 2023 19:22
@Darksonn Darksonn merged commit 2bd4376 into tokio-rs:master Oct 6, 2023
74 checks passed
@jswrenn jswrenn linked an issue Oct 9, 2023 that may be closed by this pull request
18 tasks
@jswrenn jswrenn removed a link to an issue Oct 9, 2023
18 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate M-taskdump --cfg tokio_taskdump R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR R-loom-multi-thread-alt Run loom multi-thread alt tests on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Task dumps can deadlock.
2 participants