Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v23.1.x] rpc/transport: fix temporary dispatch loop stalls #11184

Conversation

vbotbuildovich
Copy link
Collaborator

Backport of PR #11023

bharathv added 4 commits June 3, 2023 17:32
Incase RPCs are timing out due to memory pressure (unlikely given we do
not set a small limit to the semaphore but just in case).

(cherry picked from commit d014caf)
.. to track the sequence of dispatches.

(cherry picked from commit 3cbb63b)
Currently the way the code is structured can result in temporary
dispatch stalls timing out a bunch of requests.

Consider a request queue of sequence numbers [5, 6, 7, 8]

Lets assume seq=6 dispatch got delayed. This can happen if timeout is
low or there is an unknown delay by the scheduler (eg: debug builds).

_last_seq=4, queue = [5, 7, 8] - 5 is dispatched right away
_last_seq=5, queue = [7,8] -- out of order.

Dispatch of 7, 8 doesn't happen right away due to out of order sequence
and we never do a dispatch with seq=6 because we detect it already
timed out.

Now it takes a new RPC request to clear the stalled queue but
they may not happen for a while as most of these are timer based and
meanwhile seq=7/8 timeout for no fault of theirs.

This patch does two main things.

* Consolidates _last_seq tracking. Currently it is not monotonic and can
jump all over the place. This is confusing. Now we only update it in a
centralized place in dispatch_send().

* ^^ Requires that dispatch_send() is called in all cases which also
avoids dispatch loop stalls. For example, a timed out request will clear
the stalled queue right away (which is not the case before).

(cherry picked from commit 453d340)
@vbotbuildovich vbotbuildovich added this to the v23.1.x-next milestone Jun 3, 2023
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label Jun 3, 2023
@bharathv
Copy link
Contributor

bharathv commented Jun 5, 2023

issue not observed so far in this branch and the fix is very fundamental, we can revisit this backport once the original fix has more bake time.

@bharathv bharathv closed this Jun 5, 2023
@vshtokman vshtokman modified the milestones: v23.1.x-next, v23.1.12 Jun 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda kind/backport PRs targeting a stable branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants