fix: missing waker.wake #79

winksaville · 2022-05-06T17:42:17Z

My analysis of the problem is that using the current Reactor::ticker
value in ticks is not always correct. It is my understanding that when
placing a ticker value into Direction::ticks there is a guarantee that
the waker.wake call has completed and the associated task has been run.
The problem is the increment of the ticker value is in ReactorLock::react
is typically running on a different CPU thread than the
Source::poll_ready (called by poll_readable, poll_writeable), which
is saving the ticker value in ticks:

state[dir].ticks = Some((Reactor::get().ticker(), state[dir].tick));

Because the increment and use of ticker value are on different threads
the required guarantee can not be full filled and I'm proposing the
following fix in this PR, which only uses the a current state[dir].tick
and not the ticker value:

state[dir].ticks = Some((state[dir].tick, 0)

fix #78

My analysis of the problem is that using the current `Reactor::ticker` value in `ticks` is not always correct. It is my understanding that when placing a ticker value into `Direction::ticks` there is a guarantee that the waker.wake call has completed and the associated task has been run. The problem is the increment of the ticker value is in `ReactorLock::react` is typically running on a different CPU thread than the `Source::poll_ready` (called by `poll_readable`, `poll_writeable`), which is saving the ticker value in ticks: state[dir].ticks = Some((Reactor::get().ticker(), state[dir].tick)); Because the increment and use of ticker value are on different threads the required guarantee can not be full filled and I'm proposing the following fix in this PR, which only uses the a current `state[dir].tick` and not the ticker value: state[dir].ticks = Some((state[dir].tick, 0) fix smol-rs#78

dignifiedquire · 2022-05-17T15:17:34Z

@taiki-e any chance you can take a look at this?

winksaville · 2022-05-17T15:29:12Z

@dignifiedquire, would it be be better to discuss this on issue #78, as there maybe better solutions and/or other related issues?

taiki-e · 2022-06-15T03:27:29Z

Sorry, I don't have time to investigate this problem at this time.

taiki-e · 2022-06-15T03:28:31Z

r? @smol-rs/admins

fogti · 2022-06-17T07:16:24Z

This appears to reduce the amount of completions (e.g. Poll::Ready(Ok(()))), which irritates me because the outer stuff should call it again if expected events didn't occur. What irritates me a bit is that in case the "tick expired" (event delivered for newer, "non-anticipated" tick), we don't call the waker (even though it looks like it might happen that we would then either not call the waker at all or too late), but we call it if we replace it, which appears a bit weird... but calling a waker when it doesn't need to be called usually is better than not calling it when it may be necessary,

winksaville · 2022-06-19T20:04:01Z

This appears to reduce the amount of completions (e.g. Poll::Ready(Ok(()))), which irritates me because the outer stuff should call it again if expected events didn't occur. What irritates me a bit is that in case the "tick expired" (event delivered for newer, "non-anticipated" tick), we don't call the waker (even though it looks like it might happen that we would then either not call the waker at all or too late), but we call it if we replace it, which appears a bit weird... but calling a waker when it doesn't need to be called usually is better than not calling it when it may be necessary,

I might be mistaken, but it seems you're saying this change will cause additional missed wakes. But in the particular case I've seen it fixes a missed wake. There certainly maybe better ways to fix this, but it is my belief this decreases the likelihood of missed wakes.

fogti · 2022-06-20T08:47:23Z

Theoretically, the problem shouldn't happen, because Reactor::get().ticker() uses SeqCst, which enforces a global ordering across threads for that value, and because self.state is locked (it's a Mutex) at that time, which means [dir].tick can't race, either. A call to the waker does only schedule the task, which means that there is no guarantee when the associated task itself gets polled, or in which order. only that the Poll::Pending will eventually bubble up and cause the futures executor/queue to get processed. The problem with this "fix" is that it basically removes an optimization which suppresses supposedly unnecessary/currently-consumed ready events (Poll::Ready(Ok())), and probably hides the original source of the error, which I would suspect to be a missing waker call somewhere else, or a duplicate readiness check (which would usually cause a missing or delayed wake call (because it consumes a ready event, and might cause a dead-lock)). This "fix" replaces this tick optimization by much more polling afaik, because every second poll of the containing Future would cause a ready event when ticks is set (that is, after a poll_ready call) and tick was updated (the source received an event). I suspect this might cause an additional spurious wakeup... (because it would also count events delivered by the previous reactor tick).

Also, even if this PR fixes the problem, or at least it's symptoms (with the downside of probably reduced performance), this PR basically obscures the code, because the ticks: Option<(usize, usize)> doesn't really represent what's going on anymore... and should instead also be replaced by ticks: Option<usize>, for symmetry: at the two locations in that file where this pattern is used.

Don't let this discourage you, I'm mostly just trying to understand where this problem stems from, and how to keep the code legible.

winksaville · 2022-06-20T14:30:45Z

In my initial comment on this PR I say:

It is my understanding that when
placing a ticker value into Direction::ticks there is a guarantee that
the waker.wake call has completed and the associated task has been run.

Is this a valid statement or am I mistaken?

fogti · 2022-06-21T07:34:57Z

when placing a ticker value into Direction::ticks there is a guarantee that
the waker.wake call has completed

afaik correct (but I'm not really familiar with this code base, and have only looked at reactor.rs)

and the associated task has been run.

This doesn't hold (⇒ afaik is generally not true), because in many cases calling the waker only leads to "marking" of a task as to-be-scheduled in the associated reactor, which often also involves a queue, and thus gives no guarantee when (or if at all!) the future-of-interest will be polled again (because of "any-future-of" constructs and such). That is, even if the task gets rescheduled, the future-of-interest might get discarded if an alternative future is polled earlier in that task-execution. (This is sort-of theoretical in this case, because that shouldn't cause dead-locks or delays). And we usually don't know when the (now assumed to be scheduled) task gets polled again.

winksaville · 2022-06-21T12:28:06Z

I agree with you, the code has been queued, but there is no guarantee it ran. So the act of sending the wake event only says at some point in the future the other thread will run but on the async-io side there is, as you say;

"an optimization which suppresses supposedly unnecessary/currently-consumed ready events (Poll::Ready(Ok())),"

That "optimization" is the cause of the bug, as now a (Poll::Ready(Ok()) will not be returned when it should have been.

I believe you probably are correct when you say;

"the ticks: Option<(usize, usize)> doesn't really represent what's going on anymore... and should instead also be replaced by ticks: Option, for symmetry: at the two locations in that file where this pattern is used."

I will remove the second usize if desired.

fogti · 2022-06-21T12:37:11Z

I will remove the second usize if desired.

Please do that.

winksaville · 2022-06-21T15:52:13Z

@zseri, Ok, I'll remove the second usize. Is it your opinion I have identified a bug?

fogti · 2022-06-21T18:04:50Z

could you push (or force-push, if you like) it to this branch (as the original fix is embedded in the associated issue, no info would be lost)? That would make it easier to observe the interaction with the remaining parts of reactor.rs. It would also be nice if we could then construct a test case that attempts to trigger it, even if it isn't deterministic.

winksaville · 2022-06-21T18:30:17Z

I'll try to do that tomorrow. Can you point me to a test that might be an appropriate to use as a model the new test?

fogti · 2022-06-21T18:55:42Z

hm tbh I don't really know how to write a test for that, because it involves many complex parts;
afaik we can't use a mock network stream or such (which usually wouldn't contain much async stuff at all), because we need to reach to interesting bits of async-io, i.e. the reactor, it might be possible to simulate it using a pipe or so, though...;
the original scenario in which this was triggered seems complex too, and I'm not sure how a "good" MWE for that could be constructed even when allowing all kinds of hard-coded network stuff...
getting this to hit the issue most of the time, instead of just sometimes is probably even harder (especially because it involves potential (?) race conditions, or at least appears to).
(other than that, the currently existing integration tests in tests/ are probably a good starting point)
TL;DR: idk, if you manage to get lucky and write something which manages to hit this sometimes, then add a test which does that, otherwise... "just" don't. but if you're not able to write a test, at least try to document the reasoning behind this in the commit message, so that in case this introduces a perf regression, it's easier to tackle that while knowing why this optimization was thrown out, and what might be necessary to get fixed elsewhere to be able to reintroduce it without also reintroducing this issue.

winksaville · 2022-06-21T19:14:20Z

Txs for the info, it was fairly difficult to narrow done the problem, but one good thing was that as I added more debug logging the problem didn't go away!

winksaville · 2022-06-21T20:30:30Z

As we both suspect creating a test is likely going to be non-trivial and take significant effort on my part. It would be nice if someone with commit rights could provide initial feedback that this could be merged. Who might be interested in doing that quick initial review?

fogti · 2022-06-21T23:25:15Z

@yoshuawuyts @zeenix I'm not understanding the impact of this well enough, and wasn't able to fully understand the initial scenario in which the issue was noticed.
This PR doesn't look like it should break anything, just impact performance negatively, but I'm not completely sure...
my most important concerns are listed in #79 (comment).
Can someone else review this?

notgull · 2022-09-07T20:47:35Z

Is there still an interest in fixing this? You could probably resolve this by moving ticker into the Mutex and repopulating an atomic variable used by the Sources when the ReactorLock is dropped.

Either way, I'd like a test case of some kind before we move too far in this direction. Maybe it would be advantageous to incorporate loom.

fogti · 2022-09-07T21:36:20Z

Regarding #79 (comment), maybe we can move the ticker into the Mutex as mentioned. idk how exactly that would impact the rest of the codebase, tho. But keeping an atomic value and a value in a Mutex synchronized is a maintenance hazard. If possible to move the ticker behind a mutex, we should imo do that (probably not much worse performance-wise than using SeqCst); But I'm not really sure if we understand the originating problem correctly. Can we maybe try to temporarily add tracing to the deps of this crate in a separate branch, or behind a feature-flag, and add debug messages at every place where affected variables are used, and apply this to a minimal test case to get some state logs, because this looks like some internal state transition sometimes going wrong (or maybe some racing off-by-one thingy). I don't really understand the problem good enough yet, and this "fix" basically nullifies an existing optimization in the code base.

fogti · 2022-09-07T21:39:24Z

@winksaville you said you added copious amount of logging to the library stack, do you still have the patches for that laying around? I would like to try debugging this myself, but don't want to start at "square zero".

winksaville · 2022-09-07T22:34:56Z

@winksaville you said you added copious amount of logging to the library stack, do you still have the patches for that laying around? I would like to try debugging this myself, but don't want to start at "square zero".

Here is a branch with a bunch of debug, it needs to be rebased on top of master.

winksaville · 2022-09-07T23:38:35Z

I ran into this problem while following the hole_punching tutorial for rust-libp2p. This has been discussed in this post relay_v2 performance anomaly on small machines.

One other thing that I have found helpful after working on this was to have a custom logger which adds time in nanos, line numbers and thread id to the log entry. This did/does require an unstable build because thread_id_value uses as_u64(), but it might be worth it.

winksaville · 2022-09-07T23:39:23Z

Oops accidentally closed.

We want to add smol-rs/async-io#79 to our distribution of async-io, so move it to forks. Change-Id: Iccdab4f4a6aed7b773dbe200f07ce4e4f0bc9a05 Reviewed-on: https://fuchsia-review.googlesource.com/c/fuchsia/+/871499 Reviewed-by: Steven Grady <slgrady@google.com> Reviewed-by: Adam Perry <adamperry@google.com> Fuchsia-Auto-Submit: Casey Dahlin <sadmac@google.com> Commit-Queue: Casey Dahlin <sadmac@google.com>

This just adds smol-rs/async-io#79 to our distribution of async-io, as we've seen issues in infra like the one reported there. Bug: 47531 Bug: 126159 Change-Id: I393f35c131b9737ee8f5043b72162b6c3bf98dfb Reviewed-on: https://fuchsia-review.googlesource.com/c/fuchsia/+/871506 Reviewed-by: Adam Perry <adamperry@google.com>

winksaville closed this Sep 7, 2022

winksaville reopened this Sep 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: missing waker.wake #79

fix: missing waker.wake #79

winksaville commented May 6, 2022

dignifiedquire commented May 17, 2022

winksaville commented May 17, 2022

taiki-e commented Jun 15, 2022

taiki-e commented Jun 15, 2022

fogti commented Jun 17, 2022

winksaville commented Jun 19, 2022

fogti commented Jun 20, 2022

winksaville commented Jun 20, 2022

fogti commented Jun 21, 2022

winksaville commented Jun 21, 2022

fogti commented Jun 21, 2022

winksaville commented Jun 21, 2022

fogti commented Jun 21, 2022

winksaville commented Jun 21, 2022

fogti commented Jun 21, 2022

winksaville commented Jun 21, 2022

winksaville commented Jun 21, 2022

fogti commented Jun 21, 2022

notgull commented Sep 7, 2022

fogti commented Sep 7, 2022

fogti commented Sep 7, 2022

winksaville commented Sep 7, 2022 •

edited

Loading

winksaville commented Sep 7, 2022

winksaville commented Sep 7, 2022

fix: missing waker.wake #79

Are you sure you want to change the base?

fix: missing waker.wake #79

Conversation

winksaville commented May 6, 2022

dignifiedquire commented May 17, 2022

winksaville commented May 17, 2022

taiki-e commented Jun 15, 2022

taiki-e commented Jun 15, 2022

fogti commented Jun 17, 2022

winksaville commented Jun 19, 2022

fogti commented Jun 20, 2022

winksaville commented Jun 20, 2022

fogti commented Jun 21, 2022

winksaville commented Jun 21, 2022

fogti commented Jun 21, 2022

winksaville commented Jun 21, 2022

fogti commented Jun 21, 2022

winksaville commented Jun 21, 2022

fogti commented Jun 21, 2022

winksaville commented Jun 21, 2022

winksaville commented Jun 21, 2022

fogti commented Jun 21, 2022

notgull commented Sep 7, 2022

fogti commented Sep 7, 2022

fogti commented Sep 7, 2022

winksaville commented Sep 7, 2022 • edited Loading

winksaville commented Sep 7, 2022

winksaville commented Sep 7, 2022

winksaville commented Sep 7, 2022 •

edited

Loading