-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc/transport: fix temporary dispatch loop stalls #11023
Conversation
src/v/rpc/transport.cc
Outdated
from_now(timing.memory_reserved_at), | ||
from_now(timing.enqueued_at), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are enqueue and memory reserved out of order?
src/v/rpc/transport.cc
Outdated
@@ -204,6 +205,7 @@ transport::make_response_handler( | |||
_correlations.size(), | |||
from_now( | |||
timing.timeout.timeout_at() - timing.timeout.timeout_period), | |||
from_now(timing.memory_reserved_at), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should go after enqueued_at, just noticed.. will fix .
Incase RPCs are timing out due to memory pressure (unlikely given we do not set a small limit to the semaphore but just in case).
.. to track the sequence of dispatches.
Moving to draft as I do more ci-repeat runs..I folded the dispatch loop stall fix into this PR. |
Currently the way the code is structured can result in temporary dispatch stalls timing out a bunch of requests. Consider a request queue of sequence numbers [5, 6, 7, 8] Lets assume seq=6 dispatch got delayed. This can happen if timeout is low or there is an unknown delay by the scheduler (eg: debug builds). _last_seq=4, queue = [5, 7, 8] - 5 is dispatched right away _last_seq=5, queue = [7,8] -- out of order. Dispatch of 7, 8 doesn't happen right away due to out of order sequence and we never do a dispatch with seq=6 because we detect it already timed out. Now it takes a new RPC request to clear the stalled queue but they may not happen for a while as most of these are timer based and meanwhile seq=7/8 timeout for no fault of theirs. This patch does two main things. * Consolidates _last_seq tracking. Currently it is not monotonic and can jump all over the place. This is confusing. Now we only update it in a centralized place in dispatch_send(). * ^^ Requires that dispatch_send() is called in all cases which also avoids dispatch loop stalls. For example, a timed out request will clear the stalled queue right away (which is not the case before).
/ci-repeat 5 |
repeat-5 debug failures are all known flaky issues currently happening with debug builds. |
Following up here from out-of-band conversation: q: do we have any rpc users that depend on ordered delivery? i guess the important thing is to fully understand if there are material differences for delivery order? |
I think there is no change.. out of order messages are always possible with timeouts eg: head of the queue times out and then the next RPC that depends on it is successfully dispatched. AIUI in order delivery is done on a best effort basis as a performance optimization but we have checks in the raft layer for correctness. |
Thanks @bharathv |
/ci-repeat 1 |
/backport v23.1.x |
Did we see this issue reported in 23.1.x? |
No not really.. I closed the backport, we can revisit it later. |
Currently the way the code is structured can result in temporary
dispatch stalls timing out a bunch of requests.
Consider a request queue of sequence numbers [5, 6, 7, 8]
Lets assume seq=6 dispatch got delayed. This can happen if timeout is
low or there is an unknown delay by the scheduler (eg: debug builds).
_last_seq=4, queue = [5, 7, 8] - 5 is dispatched right away
_last_seq=5, queue = [7,8] -- out of order.
Dispatch of 7, 8 doesn't happen right away due to out of order sequence
and we never do a dispatch with seq=6 because we detect it already
timed out.
Now it takes a new RPC request to clear the stalled queue but
that may not happen for a while (eg: until the queued RPCs are timed out) and
meanwhile seq=7/8 timeout for no fault of theirs.
This patch does two main things.
Consolidates _last_seq tracking. Currently it is not monotonic and can
jump all over the place. This is confusing. Now we only update it in a
centralized place in dispatch_send().
^^ Requires that dispatch_send() is called in all cases which also
avoids dispatch loop stalls. For example, a timed out request will clear
the stalled queue right away (which is not the case before).
Backports Required
Release Notes