Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing over pub sub from google client libraries to Google cloud java has increased tail latencies #1338

Closed
robin-anil opened this issue Oct 26, 2016 · 12 comments
Assignees
Labels
api: pubsub Issues related to the Pub/Sub API. performance

Comments

@robin-anil
Copy link

robin-anil commented Oct 26, 2016

Hi team, I wasn't sure where to file this issue

Here is the chart from our production where we measure roundtrip latencies on messages sent through pub-sub as measured by the same server.

As you can see before the big spike we were on the old Google client libraries the performance was stellar. with p99 close to p50 at around 120ms
The big spike you see is moving over to the Google cloud java version of the pubsub library.
The lower period after that is us moving to directly using the grpc stubs.

The take-away is that GRPC based pub-sub (with return immediately = false) is much worse than what we used to have when we were on the google client library version of pubsub.

Even today in production I can consistenly see roundtrip latencies varying from 120-180s range that spikes to 4-6s every now and then. This was solidly steady in the 120ms range previously.

screen shot 2016-10-26 at 1 12 47 pm

@mziccard
Copy link
Contributor

It's hard to reason on a graph without having actual numbers, your machine/network setup and the code.

The take-away is that GRPC based pub-sub (with return immediately = false) is much worse

Since google-cloud-java's pullAsync(String subscription, int maxMessages) and pull(String subscription, int maxMessages) use returnImmediately=true the times you get depend very much on the polling strategy you are implementing. Using MessageConsumer (i.e. pullAsync(String subscription, MessageProcessor callback, PullOption... options)) would surely have better performance (it implements continuous pulling with returnImmediately=false under the hood).

Good news is that we will soon switch to using returnImmediately=false for all our pullAsync methods. Keep an eye on #1157 for the progress on this.

Even today in production I can consistenly see roundtrip latencies varying from 120-180s range that spikes to 4-6s every now and then.

By "today in production" you mean you see those spikes when using gRPC stubs? If this is the case I invite you to report this on the Pub/Sub google group where we can get some insights from the Pub/Sub team.

@mziccard mziccard added the api: pubsub Issues related to the Pub/Sub API. label Oct 26, 2016
@robin-anil
Copy link
Author

robin-anil commented Oct 26, 2016

Sorry, these are all 4-cpu 5G Google container engine nodes, VM sizes have been stable the past 6 months and the only thing variable there would be the kubernetes version.

As for the graph, the deep colors are the 50th percentile, 90, 95 and the lightest one is the p99. The graph is quite skewed as you can see but you can hopefully see the overall trend in the tail latencies by looking how tight the lines were before the switch over on Oct 5 and afterwards when we switched to gRPC

Thanks. I will report this on the cloud-pubsub group.

@mziccard
Copy link
Contributor

@r-tock Just to confirm, which method were you using to pull messages when using google-cloud-java? Care to share some code? (So I can see if there are performance issues when compared to gRPC stubs or its just a matter of setting returnImmediately=false).

@robin-anil
Copy link
Author

It may be hard to pull the overall code by I can write down the overall architecture.

I was using pull() we have 4 threads running each of which replicas from the same subscription on a topic. So we assume each will get the values once (if we acknowledge within the timeout). each of the calls pull, processes and acknowledges the message and then pulls again in a loop. The processing and acknowledging happens on a different executor.

The only difference I think is the returnImmediately=false part (ignoring the tail latencies we are getting with GRPC).

Here is how we create the channel which goes into creating the pubilsher and subscriber stub. The executor here is useless as it does not seem to be used. ClientAuthInterceptor is marked deprecated without a replacement.

  @Provides
  Channel createChannel(AuthCredentials authCredentials) {
    ManagedChannelImpl managedChannel = NettyChannelBuilder
        .forAddress("pubsub.googleapis.com", 443)
        .negotiationType(NegotiationType.TLS)
        .build();
    GoogleCredentials credentials = authCredentials.credentials().createScoped(SCOPES);
    ClientAuthInterceptor interceptor =
        new ClientAuthInterceptor(credentials, Executors.newSingleThreadExecutor());
    return ClientInterceptors.intercept(managedChannel, interceptor);
  }

@mziccard
Copy link
Contributor

each of the calls pull, processes and acknowledges the message and then pulls again in a loop.

Were you using some sort of timeout/sleep inside the loop?

@robin-anil
Copy link
Author

Nope, no timeout or sleep which is why we relied heavily on returnImmediately=false

@robin-anil
Copy link
Author

@mziccard More investigation is on the groups thread
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/cloud-pubsub-discuss/jE92Hbc2JbY/LEzfxlfPCAAJ

You might be interested to know the behavior difference between old pubsub and the new pubsub.

@garrettjonesgoogle
Copy link
Member

This issue is being worked on internally and may take some time to turn around.

@garrettjonesgoogle
Copy link
Member

We are working on a high-perf version of Pub/Sub: #1493

@garrettjonesgoogle
Copy link
Member

@r-tock , we have merged and released the high-performance rewrite of Pub/Sub. Could you take a look and see if it works better for you?

@garrettjonesgoogle
Copy link
Member

@r-tock I'm going to close out this issue now - re-open if you discover problems with the new version of the library.

@robin-anil
Copy link
Author

Yeah, sorry, we just haven't had a chance to think about this. Other things trumping in terms of priority. Will revert back when we are able to catch a breath to think about package upgrades.
In general the direction you took sounds 👍 will have to test it under real conditions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the Pub/Sub API. performance
Projects
None yet
Development

No branches or pull requests

4 participants