PubSub: Add an option to streaming pull to invoke callbacks for batches of messages #8022

plamut · 2019-05-17T19:43:26Z

Closes #4994.

Do not merge yet!

This has been removed from GA for now (comment), because other implementations currently do not have this feature.

This PR adds an option to receive asynchronously pulled messages in callbacks in batches, rather than a single message per each callback invocation.

NOTE: This PR branch is based on #7948, because it needs to modify several parts of the code added by the latter.

How to test

Steps to reproduce:

Prerequsite: Have a working service account, pubsub topic, and subscription created.
Rapidly publish several messages to the topic - they must be sent to the PubSub server in the same batch (or several batches)
Create a subscriber client and start the streaming pull with the argument batch=True:

subscriber.subscribe(SUBSCRIPTION_NAME, callback, batch=True)

Actual result (before the fix):
The provided callback is invoked once for each message.

Expected result (after the fix):
The provided callback is invoked once for each batch of published message (subject to FlowControl.max_messages limits). The callback is invoked with a list of messages as the argument.

TODO list

Fix docs (format of some of the changed docstrings).

In certain cases automatically leasing Message instances upon creation might not be desired, thus an optional parameter is added to Message initializer that allows skipping that. The default behavior is not changed, new Message instances *are* automatically leased upon creation.

Leasing messages through a request queue in dispatcher causes a race condition with the ConsumeBidirectionalStream thread. A request to pause the background consumer can arrive when the Bidi consumer is just about to fetch the the next batch of messages, and thus the latter gets paused only *after* fetching those messages. This commit synchronously leases received messages in the streaming pull manager callback. If that hits the lease management load limit, the background consumer is paused synchronously, and will correctly pause *before* pulling another batch of messages.

If the PubSub backend sends too many messages in a single response that would cause the leaser overload should all these messeges were added to it, the StreamingPullManager now puts excessive messages into an internal holding buffer. The messages are released from the buffer when the leaser again has enough capacity (as defined by the FlowControl settings), and the message received callback is invoked then as well.

With the StreamingPullManager._on_response() callback adding received messages to the leaser synchronously (in the background consumer thread), a race condition can happen with the dispatcher thread that can asynchronously add (remove) messages to (from) lease management, e.g. on ack() and nack() requests. The same is the case with related operations of maybe pausing/resuming the background consumer. This commit thus adds locks in key places, assuring that these operations are atomic, ant not subject to race conditions.

sduskis · 2019-05-17T22:10:37Z

lint failed: google/cloud/pubsub_v1/subscriber/client.py:202:15: W291 trailing whitespace

sduskis · 2019-05-17T22:11:44Z

We should discuss this feature with the Pub/Sub team before merging. They want to limit the difference between implementations in different languages, if possible.

Subscriber client's async streaming pull can now be configured to invoke a callback for batches of received messages, as opposed to invoking a callback for each received message.

c0rv4x · 2019-07-29T17:30:52Z

Hi, is there any ETA of merging this? That is a pretty vital decision i have to make in project. And the decision depends of this ETA. Thanks!

plamut · 2019-07-29T19:00:44Z

@C0rvax I can only give an unofficial answer, as I am not the decision maker here, but AFAIK this PR is unlikely to be merged in the near future.

PubSub clients for other languages currently lack this feature, and consistency is desired across various implementations.

c0rv4x · 2019-08-02T10:24:20Z

@plamut

Thanks for your answer, it helps a lot!

szeitlin · 2020-03-24T21:16:44Z

It would be nice to have this feature, though, since java, ruby and go all have the streaming pull/concurrency feature, and python does not.

plamut · 2020-03-25T09:52:05Z

@szeitlin The Python client also supports asynchronous (streaming) pull, and it's actually the preferred way of pulling the messages (docs). The thing here is just that each received message is dispatched individually, i.e. one message per each callback invocation, as opposed to multiple messages per invocation.

Or did I misunderstand your comment?

szeitlin · 2020-03-25T17:05:13Z

We're trying the asynchronous approach, but ultimately we still have a use case where we'll need to write out messages in batch, which is going to be a bit of work to implement using the streaming pull method and checking to see when we've accumulated enough messages. I get the impression a few other people would appreciate this feature for the same reason, unless I'm misunderstanding how it's supposed to work?

plamut · 2020-03-25T22:38:30Z

The batch callback, at least how it was implemented here, simply means that if N messages are available to dispatch to the user callback, all of them are passed to the callback in a list, as opposed to N callback invocations, one for each individual message.

The concern raised in internal discussions was that it is not defined what a "batch" actually is, and can thus be confusing to users. Publishing several messages in a single batch does not necessarily mean that the backend, too, will actually deliver these messages in a single batch. The "received" batch can also contain messages from several different publish batches.

Since the number of messages received in a single batch might vary, you would still have to implement the aggregation logic to know when enough messages have been accumulated.

Just to avoid any misunderstanding, what would be your expectation of such feature, i.e. what messages would you expect to receive in a particular batch?

szeitlin · 2020-03-26T20:20:38Z

Good question. I guess we were just hoping we could set a minimum batch size (in our case it's ok if things arrive out of order).

plamut · 2020-03-27T10:56:55Z

The backend does not support that, unfortunately, thus any "minimum size" batching logic needs to be implemented on the client side. While the client library could do that internally, it cannot decide for the user what to do in edge cases, e.g. when no new messages arrive for a substantial period of time, but it does not yet hold enough messages to release the next batch.

One way to achieve batching is to put received messages into a queue which is then processed by a different thread that can then batch/process them as needed, e.g. write them out when there are enough of them in the queue.

import queue

received_queue = queue.Queue()

...

future = subscriber.subscribe("my/subscription/path", callback=received_queue.put)

try:
    future.result()
except KeyboardInterrupt:
    future.cancel()

Would that help?

szeitlin · 2020-03-27T16:26:11Z

This is an interesting point, we will try that if the way we have it isn't fast enough. Thanks!

plamut added 6 commits May 17, 2019 12:02

Add system test for PubSub max_messages setting

7e5b9d7

Only allow one ack/nack call on Message instance

1fe79d9

plamut added do not merge Indicates a pull request not ready for merge, due to either quality or timing. api: pubsub Issues related to the Pub/Sub API. labels May 17, 2019

plamut requested a review from crwilcox as a code owner May 17, 2019 19:43

googlebot added the cla: yes This human has signed the Contributor License Agreement. label May 17, 2019

plamut force-pushed the iss-4994 branch from b0927d2 to 47ba8d7 Compare May 17, 2019 20:19

sduskis requested review from sduskis and removed request for crwilcox May 17, 2019 22:10

plamut added 2 commits May 18, 2019 01:12

Add batch callback option to streaming pull

c38c2d9

Subscriber client's async streaming pull can now be configured to invoke a callback for batches of received messages, as opposed to invoking a callback for each received message.

Add system test for subscriber batch callbacks

49fa1b2

plamut force-pushed the iss-4994 branch from 47ba8d7 to 49fa1b2 Compare May 17, 2019 23:15

yoshi-automation added the 🚨 This issue needs some love. label Jun 4, 2019

sduskis closed this Jun 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PubSub: Add an option to streaming pull to invoke callbacks for batches of messages #8022

PubSub: Add an option to streaming pull to invoke callbacks for batches of messages #8022

plamut commented May 17, 2019 •

edited

Loading

sduskis commented May 17, 2019 •

edited

Loading

sduskis commented May 17, 2019

c0rv4x commented Jul 29, 2019

plamut commented Jul 29, 2019

c0rv4x commented Aug 2, 2019

szeitlin commented Mar 24, 2020

plamut commented Mar 25, 2020

szeitlin commented Mar 25, 2020

plamut commented Mar 25, 2020 •

edited

Loading

szeitlin commented Mar 26, 2020

plamut commented Mar 27, 2020

szeitlin commented Mar 27, 2020

PubSub: Add an option to streaming pull to invoke callbacks for batches of messages #8022

PubSub: Add an option to streaming pull to invoke callbacks for batches of messages #8022

Conversation

plamut commented May 17, 2019 • edited Loading

Do not merge yet!

How to test

TODO list

sduskis commented May 17, 2019 • edited Loading

sduskis commented May 17, 2019

c0rv4x commented Jul 29, 2019

plamut commented Jul 29, 2019

c0rv4x commented Aug 2, 2019

szeitlin commented Mar 24, 2020

plamut commented Mar 25, 2020

szeitlin commented Mar 25, 2020

plamut commented Mar 25, 2020 • edited Loading

szeitlin commented Mar 26, 2020

plamut commented Mar 27, 2020

szeitlin commented Mar 27, 2020

plamut commented May 17, 2019 •

edited

Loading

sduskis commented May 17, 2019 •

edited

Loading

plamut commented Mar 25, 2020 •

edited

Loading