Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PubSub: Add an option to streaming pull to invoke callbacks for batches of messages #8022

Closed
wants to merge 8 commits into from

Conversation

plamut
Copy link
Contributor

@plamut plamut commented May 17, 2019

Closes #4994.


Do not merge yet!

This has been removed from GA for now (comment), because other implementations currently do not have this feature.


This PR adds an option to receive asynchronously pulled messages in callbacks in batches, rather than a single message per each callback invocation.

NOTE: This PR branch is based on #7948, because it needs to modify several parts of the code added by the latter.

How to test

Steps to reproduce:

  • Prerequsite: Have a working service account, pubsub topic, and subscription created.
  • Rapidly publish several messages to the topic - they must be sent to the PubSub server in the same batch (or several batches)
  • Create a subscriber client and start the streaming pull with the argument batch=True:
subscriber.subscribe(SUBSCRIPTION_NAME, callback, batch=True)

Actual result (before the fix):
The provided callback is invoked once for each message.

Expected result (after the fix):
The provided callback is invoked once for each batch of published message (subject to FlowControl.max_messages limits). The callback is invoked with a list of messages as the argument.

TODO list

  • Fix docs (format of some of the changed docstrings).

In certain cases automatically leasing Message instances upon creation
might not be desired, thus an optional parameter is added to Message
initializer that allows skipping that.

The default behavior is not changed, new Message instances *are*
automatically leased upon creation.
Leasing messages through a request queue in dispatcher causes a race
condition with the ConsumeBidirectionalStream thread. A request to
pause the background consumer can arrive when the Bidi consumer is just
about to fetch the the next batch of messages, and thus the latter gets
paused only *after* fetching those messages.

This commit synchronously leases received messages in the streaming
pull manager callback. If that hits the lease management load limit,
the background consumer is paused synchronously, and will correctly
pause *before* pulling another batch of messages.
If the PubSub backend sends too many messages in a single response that
would cause the leaser overload should all these messeges were added to
it, the StreamingPullManager now puts excessive messages into an
internal holding buffer.

The messages are released from the buffer when the leaser again has
enough capacity (as defined by the FlowControl settings), and the
message received callback is invoked then as well.
With the StreamingPullManager._on_response() callback adding received
messages to the leaser synchronously (in the background consumer
thread), a race condition can happen with the dispatcher thread that
can asynchronously add (remove) messages to (from) lease management,
e.g. on ack() and nack() requests.

The same is the case with related operations of maybe pausing/resuming
the background consumer. This commit thus adds locks in key places,
assuring that these operations are atomic, ant not subject to race
conditions.
@plamut plamut added do not merge Indicates a pull request not ready for merge, due to either quality or timing. api: pubsub Issues related to the Pub/Sub API. labels May 17, 2019
@plamut plamut requested a review from crwilcox as a code owner May 17, 2019 19:43
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label May 17, 2019
@sduskis sduskis requested review from sduskis and removed request for crwilcox May 17, 2019 22:10
@sduskis
Copy link
Contributor

sduskis commented May 17, 2019

lint failed: google/cloud/pubsub_v1/subscriber/client.py:202:15: W291 trailing whitespace

@sduskis
Copy link
Contributor

sduskis commented May 17, 2019

We should discuss this feature with the Pub/Sub team before merging. They want to limit the difference between implementations in different languages, if possible.

Subscriber client's async streaming pull can now be configured to
invoke a callback for batches of received messages, as opposed to
invoking a callback for each received message.
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Jun 4, 2019
@sduskis sduskis closed this Jun 4, 2019
@c0rv4x
Copy link

c0rv4x commented Jul 29, 2019

Hi, is there any ETA of merging this? That is a pretty vital decision i have to make in project. And the decision depends of this ETA. Thanks!

@plamut
Copy link
Contributor Author

plamut commented Jul 29, 2019

@C0rvax I can only give an unofficial answer, as I am not the decision maker here, but AFAIK this PR is unlikely to be merged in the near future.

PubSub clients for other languages currently lack this feature, and consistency is desired across various implementations.

@c0rv4x
Copy link

c0rv4x commented Aug 2, 2019

@plamut

Thanks for your answer, it helps a lot!

@szeitlin
Copy link

It would be nice to have this feature, though, since java, ruby and go all have the streaming pull/concurrency feature, and python does not.

@plamut
Copy link
Contributor Author

plamut commented Mar 25, 2020

@szeitlin The Python client also supports asynchronous (streaming) pull, and it's actually the preferred way of pulling the messages (docs). The thing here is just that each received message is dispatched individually, i.e. one message per each callback invocation, as opposed to multiple messages per invocation.

Or did I misunderstand your comment?

@szeitlin
Copy link

We're trying the asynchronous approach, but ultimately we still have a use case where we'll need to write out messages in batch, which is going to be a bit of work to implement using the streaming pull method and checking to see when we've accumulated enough messages. I get the impression a few other people would appreciate this feature for the same reason, unless I'm misunderstanding how it's supposed to work?

@plamut
Copy link
Contributor Author

plamut commented Mar 25, 2020

The batch callback, at least how it was implemented here, simply means that if N messages are available to dispatch to the user callback, all of them are passed to the callback in a list, as opposed to N callback invocations, one for each individual message.

The concern raised in internal discussions was that it is not defined what a "batch" actually is, and can thus be confusing to users. Publishing several messages in a single batch does not necessarily mean that the backend, too, will actually deliver these messages in a single batch. The "received" batch can also contain messages from several different publish batches.

Since the number of messages received in a single batch might vary, you would still have to implement the aggregation logic to know when enough messages have been accumulated.

Just to avoid any misunderstanding, what would be your expectation of such feature, i.e. what messages would you expect to receive in a particular batch?

@szeitlin
Copy link

Good question. I guess we were just hoping we could set a minimum batch size (in our case it's ok if things arrive out of order).

@plamut
Copy link
Contributor Author

plamut commented Mar 27, 2020

The backend does not support that, unfortunately, thus any "minimum size" batching logic needs to be implemented on the client side. While the client library could do that internally, it cannot decide for the user what to do in edge cases, e.g. when no new messages arrive for a substantial period of time, but it does not yet hold enough messages to release the next batch.

One way to achieve batching is to put received messages into a queue which is then processed by a different thread that can then batch/process them as needed, e.g. write them out when there are enough of them in the queue.

import queue

received_queue = queue.Queue()

...

future = subscriber.subscribe("my/subscription/path", callback=received_queue.put)

try:
    future.result()
except KeyboardInterrupt:
    future.cancel()

Would that help?

@szeitlin
Copy link

This is an interesting point, we will try that if the way we have it isn't fast enough. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the Pub/Sub API. cla: yes This human has signed the Contributor License Agreement. do not merge Indicates a pull request not ready for merge, due to either quality or timing. 🚨 This issue needs some love.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants