-
Notifications
You must be signed in to change notification settings - Fork 826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multicluster: Add gRPC dial timeout #1700
Comments
There is a retry logic that retries for 7 times to sent request to the high priority cluster. Which should take around 25s if request immediately fails. Can you share your logs on the type of error the service is encountering? It could be that the request is failing with DEADLINE_EXCEEDED which suggest the default gRPC timeout is reached. DEADLINE_EXCEEDED should add 10s default gRPC deadline per retry, so in this case, it could take 70s + 25s. Which is close to what you are experiencing. Some potential solutions:
|
Hi @pooneh-m, I see the retry logic taking affect from the logs. I've redacted some information but the logs are as follows:
It seems like the gRPC dial timeout here is 20 seconds, with ~6 seconds being added from the backoff for ~146 seconds blocked in total. Sadly I imagine there are situations where we would want to continue retrying on "Unavailable" codes, but the other solutions you offered seem reasonable to me. |
This is very helpful, thanks! The formula for retry is sum(100ms * try^2){1, 7} = 1002 + 1002^2 + ... + 100*2^7 = 25.4s It seems that it took 14:00:18-13:57:52 = 166s which is 7*20s (timeout) +26s (backoff) I think the gRPC timeout should be set to 10s for (4) with a default cap of total retry timeout at 30s, if the helm config for (5) is not set. Does it seem reasonable for your case? In the meantime, I highly recommend updating the |
Very reasonable yes - introducing a total cap of 30s would be a great start, with configuration for this option being a bonus for us. I agree with your intermediate solution of adjusting the allocation policies manually where possible, we'll definitely be doing that where we can. But it's those unplanned "maintenance windows" that worry us a bit here 😉 |
There could be a solution with adding new fields to the CRD also: |
@aLekSer - that sound fantastic - looking forward to it! |
When using the multicluster allocator dialling remote clusters can cause allocation requests to block for 2 minutes + In order to control dial timeout of allocation requests Timeout, BackoffCap properties were added to game server allocation policy
When using the multicluster allocator dialing remote clusters can cause allocation requests to block for 2 minutes + In order to control dial timeout of allocation requests Timeout, BackoffCap properties were added to game server allocation policy
Why do we need to add a per cluster setting for the Since the API is stable every change to the fields should go through a feature stage, starting with |
What happened:
When using the multicluster allocator, dialling remote clusters which are unhealthy can cause allocation requests to block for 2 minutes+.
What you expected to happen:
Some sensible default, or configuration option to control dial timeout of multicluster allocation gRPC requests.
How to reproduce it (as minimally and precisely as possible):
Deploy a
gameserverallocationpolicy
that allocates to some blocking endpoint. We have observed this issue occurring with a DNS name that resolves to the IP of a google LB that has no configured upstreams (because the cluster has been taken down for maintenance, for example.)Observe that allocation requests following this policy will block for ~150 seconds before responding.
Anything else we need to know?:
Our use-case is slightly more involved than the minimum repro scenario, we'd like to have a second, low-priority, cluster configured as "failover" for when the first cluster is unhealthy or down for maintenance. The behaviour we've observed is that when the high-priority cluster blackholes requests (such is the default behaviour when a GKE cluster with GCP loadbalancers is taken down), allocation requests take ~150 seconds to succeed as they block waiting for the high-priority cluster to respond.
Environment:
1.7.0
1.16.10
The text was updated successfully, but these errors were encountered: