Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: missing waker.wake #79

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

winksaville
Copy link

My analysis of the problem is that using the current Reactor::ticker
value in ticks is not always correct. It is my understanding that when
placing a ticker value into Direction::ticks there is a guarantee that
the waker.wake call has completed and the associated task has been run.
The problem is the increment of the ticker value is in ReactorLock::react
is typically running on a different CPU thread than the
Source::poll_ready (called by poll_readable, poll_writeable), which
is saving the ticker value in ticks:

state[dir].ticks = Some((Reactor::get().ticker(), state[dir].tick));

Because the increment and use of ticker value are on different threads
the required guarantee can not be full filled and I'm proposing the
following fix in this PR, which only uses the a current state[dir].tick
and not the ticker value:

state[dir].ticks = Some((state[dir].tick, 0)

fix #78

My analysis of the problem is that using the current `Reactor::ticker`
value in `ticks` is not always correct. It is my understanding that when
placing a ticker value into `Direction::ticks` there is a guarantee that
the waker.wake call has completed and the associated task has been run.
The problem is the increment of the ticker value is in `ReactorLock::react`
is typically running on a different CPU thread than the
`Source::poll_ready` (called by `poll_readable`, `poll_writeable`), which
is saving the ticker value in ticks:

  state[dir].ticks = Some((Reactor::get().ticker(), state[dir].tick));

Because the increment and use of ticker value are on different threads
the required guarantee can not be full filled and I'm proposing the
following fix in this PR, which only uses the a current `state[dir].tick`
and not the ticker value:

  state[dir].ticks = Some((state[dir].tick, 0)

fix smol-rs#78
@dignifiedquire
Copy link

@taiki-e any chance you can take a look at this?

@winksaville
Copy link
Author

@dignifiedquire, would it be be better to discuss this on issue #78, as there maybe better solutions and/or other related issues?

@taiki-e
Copy link
Collaborator

taiki-e commented Jun 15, 2022

Sorry, I don't have time to investigate this problem at this time.

@taiki-e
Copy link
Collaborator

taiki-e commented Jun 15, 2022

r? @smol-rs/admins

@fogti
Copy link
Member

fogti commented Jun 17, 2022

This appears to reduce the amount of completions (e.g. Poll::Ready(Ok(()))), which irritates me because the outer stuff should call it again if expected events didn't occur. What irritates me a bit is that in case the "tick expired" (event delivered for newer, "non-anticipated" tick), we don't call the waker (even though it looks like it might happen that we would then either not call the waker at all or too late), but we call it if we replace it, which appears a bit weird... but calling a waker when it doesn't need to be called usually is better than not calling it when it may be necessary,

@winksaville
Copy link
Author

This appears to reduce the amount of completions (e.g. Poll::Ready(Ok(()))), which irritates me because the outer stuff should call it again if expected events didn't occur. What irritates me a bit is that in case the "tick expired" (event delivered for newer, "non-anticipated" tick), we don't call the waker (even though it looks like it might happen that we would then either not call the waker at all or too late), but we call it if we replace it, which appears a bit weird... but calling a waker when it doesn't need to be called usually is better than not calling it when it may be necessary,

I might be mistaken, but it seems you're saying this change will cause additional missed wakes. But in the particular case I've seen it fixes a missed wake. There certainly maybe better ways to fix this, but it is my belief this decreases the likelihood of missed wakes.

@fogti
Copy link
Member

fogti commented Jun 20, 2022

Theoretically, the problem shouldn't happen, because Reactor::get().ticker() uses SeqCst, which enforces a global ordering across threads for that value, and because self.state is locked (it's a Mutex) at that time, which means [dir].tick can't race, either. A call to the waker does only schedule the task, which means that there is no guarantee when the associated task itself gets polled, or in which order. only that the Poll::Pending will eventually bubble up and cause the futures executor/queue to get processed. The problem with this "fix" is that it basically removes an optimization which suppresses supposedly unnecessary/currently-consumed ready events (Poll::Ready(Ok())), and probably hides the original source of the error, which I would suspect to be a missing waker call somewhere else, or a duplicate readiness check (which would usually cause a missing or delayed wake call (because it consumes a ready event, and might cause a dead-lock)). This "fix" replaces this tick optimization by much more polling afaik, because every second poll of the containing Future would cause a ready event when ticks is set (that is, after a poll_ready call) and tick was updated (the source received an event). I suspect this might cause an additional spurious wakeup... (because it would also count events delivered by the previous reactor tick).

Also, even if this PR fixes the problem, or at least it's symptoms (with the downside of probably reduced performance), this PR basically obscures the code, because the ticks: Option<(usize, usize)> doesn't really represent what's going on anymore... and should instead also be replaced by ticks: Option<usize>, for symmetry: at the two locations in that file where this pattern is used.

Don't let this discourage you, I'm mostly just trying to understand where this problem stems from, and how to keep the code legible.

@winksaville
Copy link
Author

In my initial comment on this PR I say:

It is my understanding that when
placing a ticker value into Direction::ticks there is a guarantee that
the waker.wake call has completed and the associated task has been run.

Is this a valid statement or am I mistaken?

@fogti
Copy link
Member

fogti commented Jun 21, 2022

when placing a ticker value into Direction::ticks there is a guarantee that
the waker.wake call has completed

afaik correct (but I'm not really familiar with this code base, and have only looked at reactor.rs)

and the associated task has been run.

This doesn't hold (⇒ afaik is generally not true), because in many cases calling the waker only leads to "marking" of a task as to-be-scheduled in the associated reactor, which often also involves a queue, and thus gives no guarantee when (or if at all!) the future-of-interest will be polled again (because of "any-future-of" constructs and such). That is, even if the task gets rescheduled, the future-of-interest might get discarded if an alternative future is polled earlier in that task-execution. (This is sort-of theoretical in this case, because that shouldn't cause dead-locks or delays). And we usually don't know when the (now assumed to be scheduled) task gets polled again.

@winksaville
Copy link
Author

I agree with you, the code has been queued, but there is no guarantee it ran. So the act of sending the wake event only says at some point in the future the other thread will run but on the async-io side there is, as you say;

"an optimization which suppresses supposedly unnecessary/currently-consumed ready events (Poll::Ready(Ok())),"

That "optimization" is the cause of the bug, as now a (Poll::Ready(Ok()) will not be returned when it should have been.

I believe you probably are correct when you say;

"the ticks: Option<(usize, usize)> doesn't really represent what's going on anymore... and should instead also be replaced by ticks: Option, for symmetry: at the two locations in that file where this pattern is used."

I will remove the second usize if desired.

@fogti
Copy link
Member

fogti commented Jun 21, 2022

I will remove the second usize if desired.

Please do that.

@winksaville
Copy link
Author

@zseri, Ok, I'll remove the second usize. Is it your opinion I have identified a bug?

@fogti
Copy link
Member

fogti commented Jun 21, 2022

could you push (or force-push, if you like) it to this branch (as the original fix is embedded in the associated issue, no info would be lost)? That would make it easier to observe the interaction with the remaining parts of reactor.rs. It would also be nice if we could then construct a test case that attempts to trigger it, even if it isn't deterministic.

@winksaville
Copy link
Author

I'll try to do that tomorrow. Can you point me to a test that might be an appropriate to use as a model the new test?

@fogti
Copy link
Member

fogti commented Jun 21, 2022

hm tbh I don't really know how to write a test for that, because it involves many complex parts;
afaik we can't use a mock network stream or such (which usually wouldn't contain much async stuff at all), because we need to reach to interesting bits of async-io, i.e. the reactor, it might be possible to simulate it using a pipe or so, though...;
the original scenario in which this was triggered seems complex too, and I'm not sure how a "good" MWE for that could be constructed even when allowing all kinds of hard-coded network stuff...
getting this to hit the issue most of the time, instead of just sometimes is probably even harder (especially because it involves potential (?) race conditions, or at least appears to).
(other than that, the currently existing integration tests in tests/ are probably a good starting point)
TL;DR: idk, if you manage to get lucky and write something which manages to hit this sometimes, then add a test which does that, otherwise... "just" don't. but if you're not able to write a test, at least try to document the reasoning behind this in the commit message, so that in case this introduces a perf regression, it's easier to tackle that while knowing why this optimization was thrown out, and what might be necessary to get fixed elsewhere to be able to reintroduce it without also reintroducing this issue.

@winksaville
Copy link
Author

Txs for the info, it was fairly difficult to narrow done the problem, but one good thing was that as I added more debug logging the problem didn't go away!

@winksaville
Copy link
Author

As we both suspect creating a test is likely going to be non-trivial and take significant effort on my part. It would be nice if someone with commit rights could provide initial feedback that this could be merged. Who might be interested in doing that quick initial review?

@fogti
Copy link
Member

fogti commented Jun 21, 2022

@yoshuawuyts @zeenix I'm not understanding the impact of this well enough, and wasn't able to fully understand the initial scenario in which the issue was noticed.
This PR doesn't look like it should break anything, just impact performance negatively, but I'm not completely sure...
my most important concerns are listed in #79 (comment).
Can someone else review this?

@notgull
Copy link
Member

notgull commented Sep 7, 2022

Is there still an interest in fixing this? You could probably resolve this by moving ticker into the Mutex and repopulating an atomic variable used by the Sources when the ReactorLock is dropped.

Either way, I'd like a test case of some kind before we move too far in this direction. Maybe it would be advantageous to incorporate loom.

@fogti
Copy link
Member

fogti commented Sep 7, 2022

Regarding #79 (comment), maybe we can move the ticker into the Mutex as mentioned. idk how exactly that would impact the rest of the codebase, tho. But keeping an atomic value and a value in a Mutex synchronized is a maintenance hazard. If possible to move the ticker behind a mutex, we should imo do that (probably not much worse performance-wise than using SeqCst); But I'm not really sure if we understand the originating problem correctly. Can we maybe try to temporarily add tracing to the deps of this crate in a separate branch, or behind a feature-flag, and add debug messages at every place where affected variables are used, and apply this to a minimal test case to get some state logs, because this looks like some internal state transition sometimes going wrong (or maybe some racing off-by-one thingy). I don't really understand the problem good enough yet, and this "fix" basically nullifies an existing optimization in the code base.

@fogti
Copy link
Member

fogti commented Sep 7, 2022

@winksaville you said you added copious amount of logging to the library stack, do you still have the patches for that laying around? I would like to try debugging this myself, but don't want to start at "square zero".

@winksaville
Copy link
Author

winksaville commented Sep 7, 2022

@winksaville you said you added copious amount of logging to the library stack, do you still have the patches for that laying around? I would like to try debugging this myself, but don't want to start at "square zero".

Here is a branch with a bunch of debug, it needs to be rebased on top of master.

@winksaville
Copy link
Author

I ran into this problem while following the hole_punching tutorial for rust-libp2p. This has been discussed in this post relay_v2 performance anomaly on small machines.

One other thing that I have found helpful after working on this was to have a custom logger which adds time in nanos, line numbers and thread id to the log entry. This did/does require an unstable build because thread_id_value uses as_u64(), but it might be worth it.

@winksaville winksaville closed this Sep 7, 2022
@winksaville
Copy link
Author

Oops accidentally closed.

@winksaville winksaville reopened this Sep 7, 2022
gnoliyil pushed a commit to gnoliyil/fuchsia that referenced this pull request Jan 27, 2024
We want to add smol-rs/async-io#79 to our
distribution of async-io, so move it to forks.

Change-Id: Iccdab4f4a6aed7b773dbe200f07ce4e4f0bc9a05
Reviewed-on: https://fuchsia-review.googlesource.com/c/fuchsia/+/871499
Reviewed-by: Steven Grady <slgrady@google.com>
Reviewed-by: Adam Perry <adamperry@google.com>
Fuchsia-Auto-Submit: Casey Dahlin <sadmac@google.com>
Commit-Queue: Casey Dahlin <sadmac@google.com>
gnoliyil pushed a commit to gnoliyil/fuchsia that referenced this pull request Jan 27, 2024
This just adds smol-rs/async-io#79 to our
distribution of async-io, as we've seen issues in infra like the one
reported there.

Bug: 47531
Bug: 126159
Change-Id: I393f35c131b9737ee8f5043b72162b6c3bf98dfb
Reviewed-on: https://fuchsia-review.googlesource.com/c/fuchsia/+/871506
Reviewed-by: Adam Perry <adamperry@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Missing waker.wake events
5 participants