-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pubsub: Pull Subscriber unable to re-connect after a while #7910
Comments
I came across #7709 which might be the root cause. When the new release comes out I'll post an update, but will also try to dig deeper in the meanwhile. |
@daaain Thank you for reporting this and doing the initial research! The linked issue could indeed be the root cause, and the fix for it merged recently makes sure that a clean shutdown of background threads is triggered if the underlying gRPC channel remains in an error state for too long. The log records posted here are produced by these very threads, thus shutting them down should get rid of this problem. Looking forward for more info after the next release! (alternatively, if the nature of your application allows for it, i.e. it is not mission-critical, you could also experiment with the current development version of google-cloud-pubsub). |
Update:
It appears that the streaming_pull() method is being repeatedly called in rapid succession despite the Retry settings that should result in exponentially increasing delays between successive calls. The issue is reproducible on the current latest subscriber.subscribe(SUBSCRIPTION_PATH, callback=my_callback)
while True:
try:
time.sleep(60)
except KeyboardInterrupt:
break This busy "re-open the stream" loop consuming a lot of CPU is definitely not expected behavior. |
I think that the "definition of done" for this issue will be to definitively prove that this is a grpc level issue. If so, we can open up a grpc issue and we can work with the pub/sub team to see if this feature is truly GA blocking. |
Hi, I've been experiencing something similar but with publishing. When the connection to the cloud is disrupted (error 503), the gRPC in the background continues to attempt to re-establish connectivity with "exponentially increasing delays" as described above. Despite using an async request (check code sample below) this essentially hijacks the normal execution course of the program until the connection is re-established. This error can be easily replicated by removing / adding routes to the routing table manually to simulate connections and disconnects to the cloud. I've managed to replicate this issue with a sample code from the pubsub documentation as follows: def callback(message_future, testOne, testTwo):
# When timeout is unspecified, the exception method waits indefinitely.
if message_future.exception(timeout=3):
print('Publishing message on {} threw an Exception {}.'.format(
TOPIC_NAME, message_future.exception()))
else:
print("The number is: {} and the refOne: {} plus refTwo {}".format(message_future.result(), testOne, testTwo))
for n in range(1, 6):
data = u'Message number {}'.format(n)
# Data must be a bytestring
data = data.encode('utf-8')
# When you publish a message, the client returns a Future.
message_future = PUBLISHER.publish(TOPIC_PATH, data=data)
message_future.add_done_callback(partial(callback, testOne=123, testTwo=str(datetime.datetime.now())))
print('Published message IDs:') I am using the latest google.cloud.pubsub libraries available to date (0.41). Any suggestions/advise? Thanks! Dan |
Thanks to reporting this, @Dan4London. Could it be that this issue is the same as #8036 reporting a similar problem with publisher blocking? |
I investigated the issue with the subscriber reconnect attempts skyrocketing. It seems that it has to do with the api_core.bidi.ResumableBidiRpc helper. If there are network problems, the following can happen:
(if the network error gets resolved in the meantime, the behavior returns back to normal) One might wonder why the Retry wrapper that wraps the That call does not result in an exception, but instead the Since the latter does not happen in that part of the code, the call returns normally (i.e. without an exception), thus the Retry wrapper around it does not kick in (and exponential backoff does not happen). Instead, the aforementioned |
It sounds like there needs to be a backoff of some sort. This looks very much like this nodejs-pubsub issue. @busunkim96, this sounds like something we should discuss with you. |
@plamut that makes sense. @busunkim96 should be involved, or at least aware, with any api_core related changes. @crwilcox mentioned offline that |
Thanks for your feedback Plamut. #8036 describes precisely the problem I am experiencing. You can replicate the behaviour by re-executing the for loop before the previous gprc publish() request has timed out. Do you have any suggestions when a fix for this would be available? Thanks! |
@Dan4London Unfortunately not (yet...), because it seems that the bug will have to be fixed in one of the PubSub client dependencies, but that will have to be coordinated with other teams that could be affected by the chance. The bug is high on the priority list, though. Edit: Oh, you were probably asking about the publisher issue? Edit 2: If nothing else comes across, I can probably have a look at it tomorrow. Edit 3: This issue has been prioritized over the publisher issue, will look at the latter after this. |
Environment details
Steps to reproduce
The issue did happen in a different service with only one subscriber in the container, but having several subscribers with all the others working excludes the possibility of a lot of other factors preventing re-connection (ie DNS resolution, no network, etc etc).
Code example
Totally standard Pull subscription using
SubscriberClient
+create_subscription
+subscribe
. Can paste code if required though.StackDriver log snippet
First of all of course I'd be interested to help getting to the bottom of the issue and get it resolved.
But in the meanwhile it would be great to have a workaround detecting lost connection with a subscriber. I went through the public API documentation and couldn't find a way to get to the underlying (gRPC?) client, but it would be great to have a clean(ish) method doing a periodic check on the connection to be able to restart once the issue happens.
Thanks a lot in advance 🙏
The text was updated successfully, but these errors were encountered: