-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datastore: _Rendezvous of RPC that terminated with StatusCode.UNAVAILABLE #2583
Comments
@Bogdanp Sorry this has been happening. gRPC support for datastore was added in 0.19.0, so the 0.20.0 upgrade wouldn't have changed anything. So your issue is essentially two issues:
|
@Bogdanp now that #2497 has been fixed, the retry question is the only thing that remains. We have "avoided" added automatic retries because we don't want to surprise users with under-the-covers magic. How / where would you see retries being used (explicitly or implicitly) in our interfaces that you're familiar with? |
@dhermes here's what that snippet I posted in my first comment evolved into over the past week: # The maximum number of retries that should be done per Datastore
# error code.
_MAX_RETRIES_BY_CODE = {
grpc.StatusCode.INTERNAL: 1,
grpc.StatusCode.ABORTED: 5, # Only retried for non-transactional commits
grpc.StatusCode.UNAVAILABLE: 5,
grpc.StatusCode.DEADLINE_EXCEEDED: 5,
}
def _handle_errors(f, transactional=False):
@functools.wraps(f)
def handler(*args, **kwargs):
retries = 0
while True:
try:
return f(*args, **kwargs)
# TODO: Replace w/ concrete error types once they are
# added to gcloud. See: google-cloud-python/issues/2583
except (
google.cloud.exceptions.Conflict, # gcloud catches ABORTED
google.cloud.exceptions._Rendezvous
) as e:
if isinstance(e, google.cloud.exceptions.Conflict):
code = grpc.StatusCode.ABORTED
else:
code = e.code()
max_retries = _MAX_RETRIES_BY_CODE.get(code)
if max_retries is None or transactional and code == grpc.StatusCode.ABORTED:
raise
if retries > max_retries:
raise RetriesExceeded(e)
backoff = min(0.0625 * 2 ** retries, 1.0)
bread.get_logger().debug("Sleeping for %r before retrying failed request...", backoff)
retries += 1
time.sleep(backoff)
return handler
class Client(datastore.Client):
def __init__(self, *args, **kwargs):
super(Client, self).__init__(*args, **kwargs)
self.delete_multi = _handle_errors(self.delete_multi)
self.get_multi = _handle_errors(self.get_multi)
self.put_multi = _handle_errors(self.put_multi)
def transaction(self, *args, **kwargs):
transaction = super(Client, self).transaction(*args, **kwargs)
transaction.commit = _handle_errors(transaction.commit, transactional=True)
return transaction
def query(self, *args, **kwargs):
query = super(Client, self).query(*args, **kwargs)
query.fetch = _handle_errors(query.fetch)
return query We have been running this in our staging environment for about a week now and this seems to have cut the error rate down to nearly zero (we see a This is the set of Datastore methods we use in our ORM which we've seen raise at least one of these errors:
I suspect
Personally, I believe number 2 does not scale in large codebases/orgs. I think the first option is perfectly reasonable, but it has the disadvantage that learned experiences are less likely to be shared between users whereas if retries were consolidated in the library (perhaps with an option to turn them off or configure max retries and backoff figures) it would lead to more reliable code across the board. Regarding your changes, I notice in some cases multiple error codes map to the same exception and the exceptions don't keep track of the original code. This means that a retriable error code like Regarding where and how these should be handled by the library, I think all of the user-facing methods that end up hitting datastore endpoints should transparently handle these errors by default. An option to disable retries either at the client level or the method level might be desirable but we have no use case for it currently. If you think that's too risky/high level then, alternatively, the library could expose a |
Sounds good!
We definitely can pack more info into the base
You could be seeing a strange race condition with token refresh / expiry. |
@dhermes any updates on this? |
None yet, though returning "fresh" to this issue after two months, it seems we should close it and open 1, 2 or 3 new issues witha focused goal. WDYT? |
Sounds good to me! |
I agree that built-in, optional client-side retry would be nice, as incorporating retries into an existing code base can be fairly cumbersome. :) |
I am also experiencing this issue, and have resorted to wrapping my datastore code in a try/except which will initially try with the current datastore client, and if that fails, get a new client and try the query again. I am using a service account to access datastore, and this seems to happen more frequently if I use the same service account from multiple computers (i.e. I try it from my home computer at the same time as running a query with that service account from a google compute engine server). |
@devries I ended up writing a small retry decorator to make things easier. There's 3rd party libraries available that may help, as well as a recipe here. |
@kunalq thank you. I've done something similar, but your recipe is a lot more comprehensive and robust. |
Hello, As part of trying to get things under control (as well as to empower us to provide better customer service in the future), I am declaring a "bankruptcy" of sorts on many of the old issues, especially those likely to have been addressed or made obsolete by more recent updates. My goal is to close stale issues whose relevance or solution is no longer immediately evident, and which appear to be of lower importance. I believe in good faith that this is one of those issues, but I am scanning quickly and may occasionally be wrong. If this is an issue of high importance, please comment here and we will reconsider. If this is an issue whose solution is trivial, please consider providing a pull request. Thank you! |
We see this fairly often on
commit
with google-cloud-datastore version0.20
. I believe these should either be retried with exponential backoff automatically by the library (according to this) or a more specific error should be raised so user code can deal w/ it (preferably one exception for every one of the cases listed on that doc).Edit:
Here's our current (somewhat tested) workaround for this issue in our internal ORM:
The text was updated successfully, but these errors were encountered: