Pub/Sub Messages stop being Processed #1579

GEverding · 2017-01-30T20:58:28Z

Hi,
Over the weekend I've been experiencing abnormal behaviour with Pub/Sub. Since Friday at 4pm ET I've experienced numerous cases were my services will stop processing messages. I can see in stackdriver that messages are being queued but not processed. Restarting the services allows the messages to be processed until it stops again.

This seems to be a issue with v0.8.x and less with v0.4.0. In v0.8.x there is a infinite timeout (DefaultPubSubRpc.java) and in v0.4.0 there doesn't seem to be (DefaultPubSubRpc.java).

In my experience setting a infinite timeout is rarely a good idea. Is there a reason why you do this?

nhoughto · 2017-02-02T23:21:32Z

Possibly seeing the same problem, were doing subscription.pullAsync(..) and sometimes messages just stop being processed and build up until the consumer is restarted. Considering moving to pull() and managing threading / timeouts ourselves.

GEverding · 2017-02-02T23:25:34Z

I filed a ticket with cloud support. The response I got after some back and forth was it's is a issue with there grpc load balancers for pub sub. They've been slowly rolling out a fix since yesterday. -- Garrett On Thu, Feb 2, 2017 at 6:21 PM -0500, "nhoughto" <notifications@github.com> wrote: Possibly seeing the same problem, were doing subscription.pullAsync(..) and sometimes messages just stop being processed and build up until the consumer is restarted. Considering moving to pull() and managing threading / timeouts ourselves. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

nhoughto · 2017-02-02T23:31:22Z

Good to know, thanks.

Interesting that the MessageConsumer obj doesn't expose an isAlive() or something similar to detect failure, so the assumption is that it manages it for you. Presume even if gRPC LBs are misbehaving the gRPC client should handle the failure mode itself.

pongad · 2017-02-03T00:39:59Z

Setting timeout to infinity does indeed sound like a problem. The motivation of making the infinite timeout seems to be #1157 .

We are currently working on a higher-performance PubSub client, currently in the pubsub-hp branch. In this new implementation, the timeout is properly set.

The difference is that now the RPC will return immediately if there's no message available instead of waiting. It will exponentially back off though (max 10s). You might be making more RPCs with the new client, though I don't think the difference will be big enough to be a problem. Once streaming pull becomes available, the difference should become moot.

@GEverding @nhoughto Do you think this will help your use case?

nhoughto · 2017-02-03T04:12:31Z

Yep that sounds good. Happy with immediate return with a backoff (seems like a good tradeoff).

Any idea on timelines for landing pubsub-hp or streaming pull changes? We are going to halt our pubsub roll out to new components until we sort this out.

GEverding · 2017-02-03T17:51:33Z

That seems like a much better idea. Is there a eta for when that branch will be ready?

garrettjonesgoogle · 2017-02-03T17:56:15Z

We expect to release it next week (sometime Feb 7-Feb 10).

BodySplash · 2017-02-07T16:58:01Z

I'm still encountering this issue. Bye changing the way I poll (calling pullAsync but with a limit and retrying automatically), I finally got this exception (pasted below)
Don't know if it's connected to the current problem though.

java.util.concurrent.ExecutionException: com.google.cloud.pubsub.PubSubException: io.grpc.StatusRuntimeException: UNAVAILABLE: HTTP/2 error code: NO_ERROR
17:15:04.000
Received Goaway
17:15:04.000
max_age
17:15:04.000
at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:500)
17:15:04.000
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:479)
17:15:04.000
at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:76)
17:15:04.000
at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:62)
17:15:04.000
at com.google.api.gax.core.ForwardingRpcFuture.get(ForwardingRpcFuture.java:51)

pongad · 2017-02-07T21:11:12Z

@BodySplash I don't think this is immediately related. According to pubsub's error code doc, this seems to be an issue that "happens sometimes". That said, the new client will do the retry for you.

garrettjonesgoogle · 2017-02-23T02:03:43Z

The high-performance rewrite was merged and released. I recommend trying version 0.9.3-alpha. @nhoughto and @GEverding can you let us know how it works?

GEverding · 2017-02-23T11:47:57Z

We're integrating today. I'll get back to you

…

-- Garrett On Wed, Feb 22, 2017 at 9:03 PM -0500, "garrettjonesgoogle" <notifications@github.com> wrote: The high-performance rewrite was merged and released. I recommend trying version 0.9.3-alpha. @nhoughto and @GEverding can you let us know how it works? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

nhoughto · 2017-02-25T03:33:02Z

Looking at 0.9.3-alpha, I assume the new high-performance changes are found in the com.google.cloud.pubsub.spi.v1 classes rather than the now deprecated classes? Any examples of how the new classes should be used? Readme / examples all reference the deprecated stuff.

pongad · 2017-02-26T22:50:52Z

@nhoughto That is correct. For an example of using Subscriber, could you see if this helps you?

We seem to have missed the README. I have reported this at #1664.

nhoughto · 2017-02-26T23:23:28Z

Yep that's useful, so the SubscriptionListener exposes the listener failure so you can manually restart it? Rather than handling it for you and auto reconnecting as per the buggy deprecated impl? We shouldn't expect a startAsync() to handle reconnects and should build some auto reconnect with backoff etc logic ourselves?

pongad · 2017-02-27T00:49:07Z

Subscriber uses Guava's Service behind the scene. This document should provide an overview.

To answer your question specifically, Subscriber does attempt some retries on its own. Eg, pubsub service replying with error code "UNAVAILABLE" usually signifies a transient condition; Subscriber knows this and will sleep for a bit then try again.

If it encounters a non-retryable error (or sees retryable errors too many times consecutively), it will call the SubscriberListener::failed method. If you want, you can use it to detect and "restart" the Subscriber by creating a new Subscriber object (Services cannot go back to running state after failing). Does this answer your question?

nhoughto · 2017-02-27T06:41:02Z

Yep seems reasonable, I'll give it a try 👍🏼

garrettjonesgoogle · 2017-03-10T23:05:20Z

I'm going to go ahead and resolve this now.

garrettjonesgoogle assigned pongad Feb 2, 2017

shinfan added the api: pubsub Issues related to the Pub/Sub API. label Feb 23, 2017

garrettjonesgoogle added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. triaged for beta labels Feb 23, 2017

garrettjonesgoogle removed the triaged for beta label Mar 10, 2017

garrettjonesgoogle closed this as completed Mar 10, 2017

songy23 mentioned this issue Dec 12, 2017

Stackdriver Monitoring exporter recieves Goaway after running an hour census-instrumentation/opencensus-java#869

Closed

wjoel mentioned this issue May 22, 2018

fix(pubsub): Restart Google Pubsub subscriber on failures spinnaker/echo#263

Merged

theacodes unassigned pongad Sep 28, 2018

JustinBeckwith assigned garrettjonesgoogle Feb 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pub/Sub Messages stop being Processed #1579

Pub/Sub Messages stop being Processed #1579

GEverding commented Jan 30, 2017

nhoughto commented Feb 2, 2017

GEverding commented Feb 2, 2017 via email

nhoughto commented Feb 2, 2017

pongad commented Feb 3, 2017

nhoughto commented Feb 3, 2017 •

edited

Loading

GEverding commented Feb 3, 2017

garrettjonesgoogle commented Feb 3, 2017

BodySplash commented Feb 7, 2017

pongad commented Feb 7, 2017

garrettjonesgoogle commented Feb 23, 2017

GEverding commented Feb 23, 2017 via email

nhoughto commented Feb 25, 2017

pongad commented Feb 26, 2017

nhoughto commented Feb 26, 2017 via email

pongad commented Feb 27, 2017

nhoughto commented Feb 27, 2017 via email

garrettjonesgoogle commented Mar 10, 2017

Pub/Sub Messages stop being Processed #1579

Pub/Sub Messages stop being Processed #1579

Comments

GEverding commented Jan 30, 2017

nhoughto commented Feb 2, 2017

GEverding commented Feb 2, 2017 via email

nhoughto commented Feb 2, 2017

pongad commented Feb 3, 2017

nhoughto commented Feb 3, 2017 • edited Loading

GEverding commented Feb 3, 2017

garrettjonesgoogle commented Feb 3, 2017

BodySplash commented Feb 7, 2017

pongad commented Feb 7, 2017

garrettjonesgoogle commented Feb 23, 2017

GEverding commented Feb 23, 2017 via email

nhoughto commented Feb 25, 2017

pongad commented Feb 26, 2017

nhoughto commented Feb 26, 2017 via email

pongad commented Feb 27, 2017

nhoughto commented Feb 27, 2017 via email

garrettjonesgoogle commented Mar 10, 2017

nhoughto commented Feb 3, 2017 •

edited

Loading