Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PubSub: Fix streaming pull incorrectly handling FlowControl max_messages setting #7948

Merged
merged 5 commits into from
May 24, 2019

Conversation

plamut
Copy link
Contributor

@plamut plamut commented May 10, 2019

Closes #7677.

This PR fixes how subscriber client handles the FlowControl.max_messages setting. It makes sure that subscriber only delivers (i.e. invokes callbacks) max_messages at a time, and resumes invoking callbacks only when the user code acknowledges some of the previously delivered messages.

How to test

Steps to reproduce:

  • Prerequsite: Have a working service account, pubsub topic, and subscription created.
  • Publish 50 - 100 messages to that topic in batches of random size - example script.
  • Run a test subscriber script (example) that subscribes to the topic using a streaming pull.
  • (optional) Apply a patch that adds colors to the log output, makes the latter easier to read.

Actual result (before the fix):

  • The background consumer does not always get paused in time and pulls an additional batch of messages from the server:
    log_incorrectly_paused
  • The client subscriber script often receives too many messages simultaneously, i.e. more than FlowControl.max_messages.

Expected result (after the fix):

  • The background consumer always gets paused in time when the leased messages load hits the limit (max_messages):
    log_correctly_paused
  • The client subscriber script never has more than max_messages pending messages (received, but not yet acknowledged) - the "pending ACK" figure in the logs. The streaming pull process waits until some of the messages are ACK-ed before delivering new messages.
  • Excessive messages pulled from the server (i.e. those that would cause an overload) are held in an internal buffer, and are released to the client script before any new messages get pulled from the server. The streaming pull process does not temporarily hold any more messages than necessary.

Footnotes

For max_messages settings greater than 10, one might observe that a maximum of 10 (or less) messages are delivered to the client code at once, especially if message callbacks are slow, e.g. by artificially sleeping for prolonged periods of time.

This is caused by the default callback thread pool size, which has max_workers set to 10.

This can be mitigated by creating the subscriber client with a custom scheduler that uses a thread poll with a higher cap on the worker thread count, or by using batch callbacks for received messages. The latter feature is almost done, and will submit a PR for it soon.

@plamut plamut added the api: pubsub Issues related to the Pub/Sub API. label May 10, 2019
@plamut plamut requested a review from crwilcox as a code owner May 10, 2019 16:31
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label May 10, 2019
@plamut
Copy link
Contributor Author

plamut commented May 15, 2019

Rebased because of a merge conflict when #7954 was merged.

@plamut plamut added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label May 15, 2019
@plamut
Copy link
Contributor Author

plamut commented May 15, 2019

Putting on hold, because the change is a non-trivial one, and we want to make a PubSub release first. We will unblock the PR once done (reviews still welcome, though).

cc: @sduskis

Copy link
Contributor

@sduskis sduskis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, but I want to look over this more. I have a few documentation comments.

pubsub/tests/system.py Outdated Show resolved Hide resolved
pubsub/tests/system.py Outdated Show resolved Hide resolved
pubsub/tests/system.py Show resolved Hide resolved
@plamut plamut force-pushed the iss-7677 branch 2 times, most recently from 1b8a212 to 7e5b9d7 Compare May 17, 2019 10:03
@plamut
Copy link
Contributor Author

plamut commented May 17, 2019

@sduskis Made the changes suggested, the diff can be seen in the first force push.

(the second force push was just rebasing on top of the latest master).

Copy link
Contributor

@sduskis sduskis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep on looking through this, and don't see anything wrong, per se. Given the size and complexity, I'm personally not comfortable Approving. @tseaver, would you be able to lend a hand and review this PR?

@sduskis sduskis requested a review from tseaver May 17, 2019 22:16
In certain cases automatically leasing Message instances upon creation
might not be desired, thus an optional parameter is added to Message
initializer that allows skipping that.

The default behavior is not changed, new Message instances *are*
automatically leased upon creation.
Leasing messages through a request queue in dispatcher causes a race
condition with the ConsumeBidirectionalStream thread. A request to
pause the background consumer can arrive when the Bidi consumer is just
about to fetch the the next batch of messages, and thus the latter gets
paused only *after* fetching those messages.

This commit synchronously leases received messages in the streaming
pull manager callback. If that hits the lease management load limit,
the background consumer is paused synchronously, and will correctly
pause *before* pulling another batch of messages.
If the PubSub backend sends too many messages in a single response that
would cause the leaser overload should all these messeges were added to
it, the StreamingPullManager now puts excessive messages into an
internal holding buffer.

The messages are released from the buffer when the leaser again has
enough capacity (as defined by the FlowControl settings), and the
message received callback is invoked then as well.
With the StreamingPullManager._on_response() callback adding received
messages to the leaser synchronously (in the background consumer
thread), a race condition can happen with the dispatcher thread that
can asynchronously add (remove) messages to (from) lease management,
e.g. on ack() and nack() requests.

The same is the case with related operations of maybe pausing/resuming
the background consumer. This commit thus adds locks in key places,
assuring that these operations are atomic, ant not subject to race
conditions.
@plamut
Copy link
Contributor Author

plamut commented May 24, 2019

@tseaver Addressed the comments, please have another look. The diff is small enough, thus I changed a few existing commits and force-pushed the changes.

Also removing the do not merge label, since PubSub version 0.41.0 has been successfully released in the meantime.

@plamut plamut removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label May 24, 2019
@plamut plamut merged commit 8a23a04 into googleapis:master May 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the Pub/Sub API. cla: yes This human has signed the Contributor License Agreement. 🚨 This issue needs some love.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pub/Sub: Flow Control does not work as expected.
5 participants