Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pub/Sub Messages stop being Processed #1579

Closed
GEverding opened this issue Jan 30, 2017 · 17 comments
Closed

Pub/Sub Messages stop being Processed #1579

GEverding opened this issue Jan 30, 2017 · 17 comments
Assignees
Labels
api: pubsub Issues related to the Pub/Sub API. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@GEverding
Copy link

Hi,
Over the weekend I've been experiencing abnormal behaviour with Pub/Sub. Since Friday at 4pm ET I've experienced numerous cases were my services will stop processing messages. I can see in stackdriver that messages are being queued but not processed. Restarting the services allows the messages to be processed until it stops again.

This seems to be a issue with v0.8.x and less with v0.4.0. In v0.8.x there is a infinite timeout (DefaultPubSubRpc.java) and in v0.4.0 there doesn't seem to be (DefaultPubSubRpc.java).

In my experience setting a infinite timeout is rarely a good idea. Is there a reason why you do this?

@nhoughto
Copy link

nhoughto commented Feb 2, 2017

Possibly seeing the same problem, were doing subscription.pullAsync(..) and sometimes messages just stop being processed and build up until the consumer is restarted. Considering moving to pull() and managing threading / timeouts ourselves.

@GEverding
Copy link
Author

GEverding commented Feb 2, 2017 via email

@nhoughto
Copy link

nhoughto commented Feb 2, 2017

Good to know, thanks.

Interesting that the MessageConsumer obj doesn't expose an isAlive() or something similar to detect failure, so the assumption is that it manages it for you. Presume even if gRPC LBs are misbehaving the gRPC client should handle the failure mode itself.

@pongad
Copy link
Contributor

pongad commented Feb 3, 2017

Setting timeout to infinity does indeed sound like a problem. The motivation of making the infinite timeout seems to be #1157 .

We are currently working on a higher-performance PubSub client, currently in the pubsub-hp branch. In this new implementation, the timeout is properly set.

The difference is that now the RPC will return immediately if there's no message available instead of waiting. It will exponentially back off though (max 10s). You might be making more RPCs with the new client, though I don't think the difference will be big enough to be a problem. Once streaming pull becomes available, the difference should become moot.

@GEverding @nhoughto Do you think this will help your use case?

@nhoughto
Copy link

nhoughto commented Feb 3, 2017

Yep that sounds good. Happy with immediate return with a backoff (seems like a good tradeoff).

Any idea on timelines for landing pubsub-hp or streaming pull changes? We are going to halt our pubsub roll out to new components until we sort this out.

@GEverding
Copy link
Author

That seems like a much better idea. Is there a eta for when that branch will be ready?

@garrettjonesgoogle
Copy link
Member

We expect to release it next week (sometime Feb 7-Feb 10).

@BodySplash
Copy link

I'm still encountering this issue. Bye changing the way I poll (calling pullAsync but with a limit and retrying automatically), I finally got this exception (pasted below)
Don't know if it's connected to the current problem though.

java.util.concurrent.ExecutionException: com.google.cloud.pubsub.PubSubException: io.grpc.StatusRuntimeException: UNAVAILABLE: HTTP/2 error code: NO_ERROR
17:15:04.000
Received Goaway
17:15:04.000
max_age
17:15:04.000
at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:500)
17:15:04.000
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:479)
17:15:04.000
at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:76)
17:15:04.000
at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:62)
17:15:04.000
at com.google.api.gax.core.ForwardingRpcFuture.get(ForwardingRpcFuture.java:51)

@pongad
Copy link
Contributor

pongad commented Feb 7, 2017

@BodySplash I don't think this is immediately related. According to pubsub's error code doc, this seems to be an issue that "happens sometimes". That said, the new client will do the retry for you.

@shinfan shinfan added the api: pubsub Issues related to the Pub/Sub API. label Feb 23, 2017
@garrettjonesgoogle
Copy link
Member

The high-performance rewrite was merged and released. I recommend trying version 0.9.3-alpha. @nhoughto and @GEverding can you let us know how it works?

@garrettjonesgoogle garrettjonesgoogle added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. triaged for beta labels Feb 23, 2017
@GEverding
Copy link
Author

GEverding commented Feb 23, 2017 via email

@nhoughto
Copy link

Looking at 0.9.3-alpha, I assume the new high-performance changes are found in the com.google.cloud.pubsub.spi.v1 classes rather than the now deprecated classes? Any examples of how the new classes should be used? Readme / examples all reference the deprecated stuff.

@pongad
Copy link
Contributor

pongad commented Feb 26, 2017

@nhoughto That is correct. For an example of using Subscriber, could you see if this helps you?

We seem to have missed the README. I have reported this at #1664.

@nhoughto
Copy link

nhoughto commented Feb 26, 2017 via email

@pongad
Copy link
Contributor

pongad commented Feb 27, 2017

Subscriber uses Guava's Service behind the scene. This document should provide an overview.

To answer your question specifically, Subscriber does attempt some retries on its own. Eg, pubsub service replying with error code "UNAVAILABLE" usually signifies a transient condition; Subscriber knows this and will sleep for a bit then try again.

If it encounters a non-retryable error (or sees retryable errors too many times consecutively), it will call the SubscriberListener::failed method. If you want, you can use it to detect and "restart" the Subscriber by creating a new Subscriber object (Services cannot go back to running state after failing). Does this answer your question?

@nhoughto
Copy link

nhoughto commented Feb 27, 2017 via email

@garrettjonesgoogle
Copy link
Member

I'm going to go ahead and resolve this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the Pub/Sub API. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

6 participants