Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ready-cache: Avoid panic on strange race #420

Merged
merged 2 commits into from
Feb 24, 2020
Merged

Conversation

jonhoo
Copy link
Collaborator

@jonhoo jonhoo commented Feb 24, 2020

It's been observed that occasionally tower-ready-cache would panic
trying to find an already canceled service in cancel_pending_txs
(#415). The source of the race is not entirely clear, but extensive
debugging demonstrated that occasionally a call to evict would send on
the CancelTx for a service, yet that service would be yielded back
from pending in poll_pending in a non-Canceled state. This
is equivalent to saying that this code may panic:

async {
  let (tx, rx) = oneshot::channel();
  tx.send(42).unwrap();
  yield_once().await;
  rx.try_recv().unwrap(); // <- may occasionally panic
}

I have not been able to demonstrate a self-contained example failing in
this way, but it's the only explanation I have found for the observed
bug. Pinning the entire runtime to one core still produced the bug,
which indicates that it is not a memory ordering issue. Replacing
oneshot with mpsc::channel(1) still produced the bug, which indicates
that the bug is not with the implementation of oneshot. Logs also
indicate that the ChannelTx we send on in evict() truly is the same
one associated with the ChannelRx polled in Pending::poll, so we're
not getting our wires crossed somewhere. It truly is bizarre.

This patch resolves the issue by considering a failure to find a
ready/errored service's CancelTx as another signal that a service has
been removed. Specifically, if poll_pending finds a service that
returns Ok or Err, but does not find its CancelTx, then it
assumes that it must be because the service was canceled, but did not
observe that cancellation signal.

As an explanation, this isn't entirely satisfactory, since we do not
fully understand the underlying problem. It may be that a canceled
service could remain in the pending state for a very long time if it
does not become ready and does not see the cancellation signal (so it
returns Poll::Pending and is not removed). That, in turn, might cause
an issue if the driver of the ReadyCache then chooses to re-use a key
they believe they have evicted. However, any such case must first hit
the panic that exists in the code today, so this is still an improvement
over the status quo.

Fixes #415.

It's been observed that occasionally tower-ready-cache would panic
trying to find an already canceled service in `cancel_pending_txs`
(#415). The source of the race is not entirely clear, but extensive
debugging demonstrated that occasionally a call to `evict` would send on
the `CancelTx` for a service, yet that service would be yielded back
from `pending` in `poll_pending` in a non-`Canceled` state. This
is equivalent to saying that this code may panic:

```rust
async {
  let (tx, rx) = oneshot::channel();
  tx.send(42).unwrap();
  yield_once().await;
  rx.try_recv().unwrap(); // <- may occasionally panic
}
```

I have not been able to demonstrate a self-contained example failing in
this way, but it's the only explanation I have found for the observed
bug. Pinning the entire runtime to one core still produced the bug,
which indicates that it is not a memory ordering issue. Replacing
oneshot with `mpsc::channel(1)` still produced the bug, which indicates
that the bug is not with the implementation of `oneshot`. Logs also
indicate that the `ChannelTx` we send on in `evict()` truly is the same
one associated with the `ChannelRx` polled in `Pending::poll`, so we're
not getting our wires crossed somewhere. It truly is bizarre.

This patch resolves the issue by considering a failure to find a
ready/errored service's `CancelTx` as another signal that a service has
been removed. Specifically, if `poll_pending` finds a service that
returns `Ok` or `Err`, but does _not_ find its `CancelTx`, then it
assumes that it must be because the service _was_ canceled, but did not
observe that cancellation signal.

As an explanation, this isn't entirely satisfactory, since we do not
fully understand the underlying problem. It _may_ be that a canceled
service could remain in the pending state for a very long time if it
does not become ready _and_ does not see the cancellation signal (so it
returns `Poll::Pending` and is not removed). That, in turn, might cause
an issue if the driver of the `ReadyCache` then chooses to re-use a key
they believe they have evicted. However, any such case _must_ first hit
the panic that exists in the code today, so this is still an improvement
over the status quo.

Fixes #415.
@jonhoo jonhoo requested a review from olix0r February 24, 2020 15:19
tower-ready-cache/src/cache.rs Show resolved Hide resolved
@jonhoo jonhoo merged commit 414e3b0 into master Feb 24, 2020
@jonhoo jonhoo deleted the jonhoo/ready-cache-read-race branch February 24, 2020 18:03
jonhoo added a commit to mit-pdos/noria-mysql that referenced this pull request Feb 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Panicked at 'missing cancelation' in tower-ready-cache
2 participants