-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: missing waker.wake #79
base: master
Are you sure you want to change the base?
Conversation
My analysis of the problem is that using the current `Reactor::ticker` value in `ticks` is not always correct. It is my understanding that when placing a ticker value into `Direction::ticks` there is a guarantee that the waker.wake call has completed and the associated task has been run. The problem is the increment of the ticker value is in `ReactorLock::react` is typically running on a different CPU thread than the `Source::poll_ready` (called by `poll_readable`, `poll_writeable`), which is saving the ticker value in ticks: state[dir].ticks = Some((Reactor::get().ticker(), state[dir].tick)); Because the increment and use of ticker value are on different threads the required guarantee can not be full filled and I'm proposing the following fix in this PR, which only uses the a current `state[dir].tick` and not the ticker value: state[dir].ticks = Some((state[dir].tick, 0) fix smol-rs#78
@taiki-e any chance you can take a look at this? |
@dignifiedquire, would it be be better to discuss this on issue #78, as there maybe better solutions and/or other related issues? |
Sorry, I don't have time to investigate this problem at this time. |
r? @smol-rs/admins |
This appears to reduce the amount of completions (e.g. |
I might be mistaken, but it seems you're saying this change will cause additional missed wakes. But in the particular case I've seen it fixes a missed wake. There certainly maybe better ways to fix this, but it is my belief this decreases the likelihood of missed wakes. |
Theoretically, the problem shouldn't happen, because Also, even if this PR fixes the problem, or at least it's symptoms (with the downside of probably reduced performance), this PR basically obscures the code, because the Don't let this discourage you, I'm mostly just trying to understand where this problem stems from, and how to keep the code legible. |
In my initial comment on this PR I say:
Is this a valid statement or am I mistaken? |
afaik correct (but I'm not really familiar with this code base, and have only looked at
This doesn't hold (⇒ afaik is generally not true), because in many cases calling the waker only leads to "marking" of a task as to-be-scheduled in the associated reactor, which often also involves a queue, and thus gives no guarantee when (or if at all!) the future-of-interest will be polled again (because of "any-future-of" constructs and such). That is, even if the task gets rescheduled, the future-of-interest might get discarded if an alternative future is polled earlier in that task-execution. (This is sort-of theoretical in this case, because that shouldn't cause dead-locks or delays). And we usually don't know when the (now assumed to be scheduled) task gets polled again. |
I agree with you, the code has been queued, but there is no guarantee it ran. So the act of sending the wake event only says at some point in the future the other thread will run but on the async-io side there is, as you say;
That "optimization" is the cause of the bug, as now a (Poll::Ready(Ok()) will not be returned when it should have been. I believe you probably are correct when you say;
I will remove the second usize if desired. |
Please do that. |
@zseri, Ok, I'll remove the second usize. Is it your opinion I have identified a bug? |
could you push (or force-push, if you like) it to this branch (as the original fix is embedded in the associated issue, no info would be lost)? That would make it easier to observe the interaction with the remaining parts of |
I'll try to do that tomorrow. Can you point me to a test that might be an appropriate to use as a model the new test? |
hm tbh I don't really know how to write a test for that, because it involves many complex parts; |
Txs for the info, it was fairly difficult to narrow done the problem, but one good thing was that as I added more debug logging the problem didn't go away! |
As we both suspect creating a test is likely going to be non-trivial and take significant effort on my part. It would be nice if someone with commit rights could provide initial feedback that this could be merged. Who might be interested in doing that quick initial review? |
@yoshuawuyts @zeenix I'm not understanding the impact of this well enough, and wasn't able to fully understand the initial scenario in which the issue was noticed. |
Is there still an interest in fixing this? You could probably resolve this by moving Either way, I'd like a test case of some kind before we move too far in this direction. Maybe it would be advantageous to incorporate |
Regarding #79 (comment), maybe we can move the |
@winksaville you said you added copious amount of logging to the library stack, do you still have the patches for that laying around? I would like to try debugging this myself, but don't want to start at "square zero". |
Here is a branch with a bunch of debug, it needs to be rebased on top of |
I ran into this problem while following the One other thing that I have found helpful after working on this was to have a custom logger which adds time in nanos, line numbers and thread id to the log entry. This did/does require an unstable build because thread_id_value uses as_u64(), but it might be worth it. |
Oops accidentally closed. |
We want to add smol-rs/async-io#79 to our distribution of async-io, so move it to forks. Change-Id: Iccdab4f4a6aed7b773dbe200f07ce4e4f0bc9a05 Reviewed-on: https://fuchsia-review.googlesource.com/c/fuchsia/+/871499 Reviewed-by: Steven Grady <slgrady@google.com> Reviewed-by: Adam Perry <adamperry@google.com> Fuchsia-Auto-Submit: Casey Dahlin <sadmac@google.com> Commit-Queue: Casey Dahlin <sadmac@google.com>
This just adds smol-rs/async-io#79 to our distribution of async-io, as we've seen issues in infra like the one reported there. Bug: 47531 Bug: 126159 Change-Id: I393f35c131b9737ee8f5043b72162b6c3bf98dfb Reviewed-on: https://fuchsia-review.googlesource.com/c/fuchsia/+/871506 Reviewed-by: Adam Perry <adamperry@google.com>
My analysis of the problem is that using the current
Reactor::ticker
value in
ticks
is not always correct. It is my understanding that whenplacing a ticker value into
Direction::ticks
there is a guarantee thatthe waker.wake call has completed and the associated task has been run.
The problem is the increment of the ticker value is in
ReactorLock::react
is typically running on a different CPU thread than the
Source::poll_ready
(called bypoll_readable
,poll_writeable
), whichis saving the ticker value in ticks:
state[dir].ticks = Some((Reactor::get().ticker(), state[dir].tick));
Because the increment and use of ticker value are on different threads
the required guarantee can not be full filled and I'm proposing the
following fix in this PR, which only uses the a current
state[dir].tick
and not the ticker value:
state[dir].ticks = Some((state[dir].tick, 0)
fix #78