-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changing over pub sub from google client libraries to Google cloud java has increased tail latencies #1338
Comments
It's hard to reason on a graph without having actual numbers, your machine/network setup and the code.
Since Good news is that we will soon switch to using
By "today in production" you mean you see those spikes when using gRPC stubs? If this is the case I invite you to report this on the Pub/Sub google group where we can get some insights from the Pub/Sub team. |
Sorry, these are all 4-cpu 5G Google container engine nodes, VM sizes have been stable the past 6 months and the only thing variable there would be the kubernetes version. As for the graph, the deep colors are the 50th percentile, 90, 95 and the lightest one is the p99. The graph is quite skewed as you can see but you can hopefully see the overall trend in the tail latencies by looking how tight the lines were before the switch over on Oct 5 and afterwards when we switched to gRPC Thanks. I will report this on the cloud-pubsub group. |
@r-tock Just to confirm, which method were you using to pull messages when using |
It may be hard to pull the overall code by I can write down the overall architecture. I was using pull() we have 4 threads running each of which replicas from the same subscription on a topic. So we assume each will get the values once (if we acknowledge within the timeout). each of the calls pull, processes and acknowledges the message and then pulls again in a loop. The processing and acknowledging happens on a different executor. The only difference I think is the returnImmediately=false part (ignoring the tail latencies we are getting with GRPC). Here is how we create the channel which goes into creating the pubilsher and subscriber stub. The executor here is useless as it does not seem to be used. ClientAuthInterceptor is marked deprecated without a replacement. @Provides
Channel createChannel(AuthCredentials authCredentials) {
ManagedChannelImpl managedChannel = NettyChannelBuilder
.forAddress("pubsub.googleapis.com", 443)
.negotiationType(NegotiationType.TLS)
.build();
GoogleCredentials credentials = authCredentials.credentials().createScoped(SCOPES);
ClientAuthInterceptor interceptor =
new ClientAuthInterceptor(credentials, Executors.newSingleThreadExecutor());
return ClientInterceptors.intercept(managedChannel, interceptor);
} |
Were you using some sort of timeout/sleep inside the loop? |
Nope, no timeout or sleep which is why we relied heavily on returnImmediately=false |
@mziccard More investigation is on the groups thread You might be interested to know the behavior difference between old pubsub and the new pubsub. |
This issue is being worked on internally and may take some time to turn around. |
We are working on a high-perf version of Pub/Sub: #1493 |
@r-tock , we have merged and released the high-performance rewrite of Pub/Sub. Could you take a look and see if it works better for you? |
@r-tock I'm going to close out this issue now - re-open if you discover problems with the new version of the library. |
Yeah, sorry, we just haven't had a chance to think about this. Other things trumping in terms of priority. Will revert back when we are able to catch a breath to think about package upgrades. |
Hi team, I wasn't sure where to file this issue
Here is the chart from our production where we measure roundtrip latencies on messages sent through pub-sub as measured by the same server.
As you can see before the big spike we were on the old Google client libraries the performance was stellar. with p99 close to p50 at around 120ms
The big spike you see is moving over to the Google cloud java version of the pubsub library.
The lower period after that is us moving to directly using the grpc stubs.
The take-away is that GRPC based pub-sub (with return immediately = false) is much worse than what we used to have when we were on the google client library version of pubsub.
Even today in production I can consistenly see roundtrip latencies varying from 120-180s range that spikes to 4-6s every now and then. This was solidly steady in the 120ms range previously.
The text was updated successfully, but these errors were encountered: