ready-cache: Avoid panic on strange race #420
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
It's been observed that occasionally tower-ready-cache would panic
trying to find an already canceled service in
cancel_pending_txs
(#415). The source of the race is not entirely clear, but extensive
debugging demonstrated that occasionally a call to
evict
would send onthe
CancelTx
for a service, yet that service would be yielded backfrom
pending
inpoll_pending
in a non-Canceled
state. Thisis equivalent to saying that this code may panic:
I have not been able to demonstrate a self-contained example failing in
this way, but it's the only explanation I have found for the observed
bug. Pinning the entire runtime to one core still produced the bug,
which indicates that it is not a memory ordering issue. Replacing
oneshot with
mpsc::channel(1)
still produced the bug, which indicatesthat the bug is not with the implementation of
oneshot
. Logs alsoindicate that the
ChannelTx
we send on inevict()
truly is the sameone associated with the
ChannelRx
polled inPending::poll
, so we'renot getting our wires crossed somewhere. It truly is bizarre.
This patch resolves the issue by considering a failure to find a
ready/errored service's
CancelTx
as another signal that a service hasbeen removed. Specifically, if
poll_pending
finds a service thatreturns
Ok
orErr
, but does not find itsCancelTx
, then itassumes that it must be because the service was canceled, but did not
observe that cancellation signal.
As an explanation, this isn't entirely satisfactory, since we do not
fully understand the underlying problem. It may be that a canceled
service could remain in the pending state for a very long time if it
does not become ready and does not see the cancellation signal (so it
returns
Poll::Pending
and is not removed). That, in turn, might causean issue if the driver of the
ReadyCache
then chooses to re-use a keythey believe they have evicted. However, any such case must first hit
the panic that exists in the code today, so this is still an improvement
over the status quo.
Fixes #415.