-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uncaught exceptions within the streaming pull code. #7709
Comments
We're seeing this behavior with the following stack trace as well. Errors are thrown and the subscriber stops receiving messages without the main thread (future.result(timeout=x)) raising an exception (other than the timeout).
Here's what our code looks like:
The Exception branch is never taken, and the callback ceases to be called after such an error, requiring main thread restart. |
I believe this error occurs if the underlying channel enters the TRANSIENT_FAILURE state and remains in it for too long, i.e. longer than the I was not able to produce the bug with a sample pub/sub application running on Kubernetes, but I did manage to trigger the reported scenario locally by doing the the following:
--- /home/peter/workspace/google-cloud-python/venv-3.6/lib/python3.6/site-packages/grpc/_channel.py 2019-04-23 17:01:39.282064676 +0200
+++ /home/peter/workspace/google-cloud-python/venv-3.6/lib/python3.6/site-packages/grpc/_channel.py 2019-04-25 15:49:05.220317794 +0200
@@ -456,6 +456,16 @@
def _end_unary_response_blocking(state, call, with_call, deadline):
+ #####################
+ import datetime
+ minute = datetime.datetime.now().minute
+ if 45 <= minute <= 56:
+ state.code = grpc.StatusCode.UNAVAILABLE
+ state.details = "channel is in **fake** TRANSIENT_FAILURE state"
+ state.debug_error_string = (
+ "transient failure is faked during a fixed time window in an hour"
+ )
+ ###########################
if state.code is grpc.StatusCode.OK:
if with_call:
rendezvous = _Rendezvous(state, call, None, deadline) The patch fakes a channel error during particular minutes in an hour (adjust as necessary).
Result:
What happens is that if the subscriber has been retrying for too long, a RetryError is raised in the retry wrapper. This error is considered non-retryable, and the subscriber stopping pulling messages is actually expected behavior IMO. Will look into it. What should happen, however, is propagating the error to the main thread (and shutting everything down cleanly in the background), giving users a chance to catch the error and react to it as they see fit. Will discuss if this is the the expected way of handling this, and then work on a fix. Thank you for reporting the issue! |
@plamut Thanks very much for digging into this! A bit surprised my team are the first to report it. Please do post here if we can provide more details or testing, or when a fix or an ETA is available. :) |
@jakeczyz AFAIK there have been several independent reports of the same (or similar) bug in the past, including the non-Python clients, but it was (very) difficult to reproduce it. I could not reproduce it either, thus only suspect that this is the true cause kicking in on random occasions. The tracebacks are very similar, though, which is promising. I do not have an ETA yet, but expect to discuss this with others next week - will post more, when I know more. :) |
Just as quick update, it appears to me that in order to propagate the Right now the background consumer thread does not propagate any errors and assumes that all error handling is done though the underlying RPC. However, if a The subscriber client shuts itself down when the channel terminates, but since the latter does not happen, the client shutdown does not happen as well, and the future result never gets set, despite the consumer thread not running anymore.
Update: API core changes will not be needed after all, the subscriber client can properly respond to retry errors on its own. |
A fix for this issue has been merged. It makes sure that if a Again, I was not able to actually reproduce the error in a production setup, but was able to reproduce similar tracebacks locally by faking it. Should the fix prove to be insufficient, feel free to comment here with new info (and thanks in advance!). |
Thanks. We'll report back if it still seems to break this way after the new code is released and available on pypi. We see this problem 1-2 times a week; so, it won't be long before we have confirmation. Thanks again for your work on fixing this! |
Facing the same issue:
is there a fix for this and what's going wrong in the first place? |
@sreetamdas The fix for the original issue has been merged several releases ago, but there might be another bug that results in a similar error. Which PubSub client and Any extra information could be useful, thanks! |
Thanks for replying @plamut! I am currently using the the Something to note: this error doesn't show up all the time. In fact, its hasnt shown up in the past 24 hours, while it came up about 7 out of 10 times whenever I'd try to run my Cloud Function. Additionally, as part of clearing up my old data in PubSub, I'd resorted to Is it possible that the |
@sreetamdas A The ACK deadline for a message is somewhat different - it's a server-side limit, and if the server does not receive an ACK request before that deadline, it will try to re-send the same message. That could happen if the client's ACK response gets lost, for instance. Since the network is not 100% reliable, it is kind of expected that Could it be that those 7/10 Cloud Function failures all happened in a short time span? It is quite possible that the network was unreliable at that time, especially when reading that the error did not repeat in the last 24 hours. |
I am using Cloud Scheduler to run a cronjob every hour using an I've also tried invoking my Function manually as well as trying to pull messages on my local machine using Side note: Do you believe that I should contact GCP at this stage 😅 ? |
@sreetamdas Hard to tell in a vacuum, i.e. without seeing the code, maybe some additional log output, and knowing about the exact library versions used. Is pulling the messages done synchronously ( If it happens that often and across longer time spans, a temporary network issue can probably be excluded, indeed. If you believe that the application setup and the code are both correct, contacting GCP is an option, as they have a much better insight into the overall setup and what is happening behind the scenes. In any case, looking forward to any additional info that could help narrowing down the issue. |
@plamut Sorry, I genuinely completely forgot about that. Here's the (relevant) packages I'm using:
And here's the (relevant) code snippet: subscriber = pubsub_v1.SubscriberClient().from_service_account_json(
"service_account.json"
)
response = subscriber.pull(input_topic, max_messages=10)
print(">>", response)
for message in response.received_messages:
print(message.message.data.decode("utf-8")) I looked around and found out that in case there are no messages in the subscription, |
@sreetamdas Thanks, I can now see that the code uses a synchronous pull method. I was actually able to reproduce the reported behavior - if there are messages available, the code snippet works fine. On the other hand, if there are no messages, a Using the FWIW, it seems counter-intuitive to receive 504 instead of a successful empty response, i.e. without messages. I'll check with the backend team if this is intended behavior. |
@sreetamdas Still awaiting a definite answer on the early deadline exceeded error. BTW, I was informed that the |
@plamut I apologise for not having replied here sooner, but I was away. Funnily enough, in my then-ongoing search for alternate solutions, I stumbled upon a comment on an issue on this repo itself, which said that they'd faced similar issues (I believe their error was a different), but this was only after they'd upgraded their I was a bit skeptical that it'd work, but lo and behold, my pipelines are working (flawlessly) again. I'll dig out that comment, and thanks again for your time. I wish I could provide you steps to reproduce the error on your end, but its pretty much just a standard Thanks again! |
@sreetamdas I actually did manage to reproduce the reported behavior, but I still appreciate your willingness to help! Since this is a synchronous pull (as opposed to the asynchronous streaming pull this issue is about), I will open a separate issue for easier traceability. Update: Issue created - https://github.com/googleapis/google-cloud-python/issues/9822 |
File "", line 3, in raise_from |
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = current_app.config['GOOGLE_APPLICATION_CREDENTIALS']
|
@radianceltd Is this error related to the PubSub synchronous pull, or...? It seems more like an issue with the Cloud IoT? |
run into wamp localhost php version 7.2 error found Fatal error: Uncaught BadMethodCallException: Streaming calls are not supported while using the REST transport. in C:\wamp64\www\google-ads-php\vendor\google\gax\src\Transport\HttpUnaryTransportTrait.php:125 |
This comes from a StackOverflow question. There are internal exceptions that are not being caught and result in the client library no longer delivery messages.
The user who reported the error was using the following versions:
python == 3.6.5
google-cloud-pubsub == 0.40.0 # but this has behaved similarly for at least the last several versions
google-api-core == 1.8.2
google-api-python-client == 1.7.8
The text was updated successfully, but these errors were encountered: