Datastore: _Rendezvous of RPC that terminated with StatusCode.UNAVAILABLE #2583

Bogdanp · 2016-10-21T12:06:25Z

We see this fairly often on commit with google-cloud-datastore version 0.20. I believe these should either be retried with exponential backoff automatically by the library (according to this) or a more specific error should be raised so user code can deal w/ it (preferably one exception for every one of the cases listed on that doc).

_Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, {"created":"@1476898717.596308747","description":"Secure read failed","file":"src/core/lib/security/transport/secure_endpoint.c","file_line":157,"grpc_status":14,"referenced_errors":[{"created":"@1476898717.596257572","description":"EOF","file":"src/core/lib/iomgr/tcp_posix.c","file_line":235}]})>

Edit:

Here's our current (somewhat tested) workaround for this issue in our internal ORM:

# The maximum number of retries that should be done per Datastore
# error code.
_MAX_RETRIES_BY_CODE = {
    grpc.StatusCode.INTERNAL: 1,
    grpc.StatusCode.ABORTED: 5,  # Only retried for non-transactional commits
    grpc.StatusCode.UNAVAILABLE: 5,
    grpc.StatusCode.DEADLINE_EXCEEDED: 5,
}


def _handle_errors(f, transactional=False):
    @functools.wraps(f)
    def handler(*args, **kwargs):
        retries = 0
        while True:
            try:
                return f(*args, **kwargs)
            # TODO: Replace w/ concrete error types once/if they are
            # added to gcloud.  See: google-cloud-python/issues/2583
            except google.cloud.exceptions._Rendezvous as e:
                code = e.code()
                max_retries = _MAX_RETRIES_BY_CODE.get(code)
                if max_retries is None or transactional and code == grpc.StatusCode.ABORTED:
                    raise

                if retries > max_retries:
                    raise RetriesExceeded(e)

                backoff = min(0.0625 * 2 ** retries, 1.0)
                bread.get_logger().debug("Sleeping for %r before retrying failed request...", backoff)

                retries += 1
                time.sleep(backoff)

    return handler


class Client(datastore.Client):
    def __init__(self, *args, **kwargs):
        super(Client, self).__init__(*args, **kwargs)

        self.delete_multi = _handle_errors(self.delete_multi)
        self.get_multi = _handle_errors(self.get_multi)
        self.put_multi = _handle_errors(self.put_multi)

    def transaction(self, *args, **kwargs):
        transaction = super(Client, self).transaction(*args, **kwargs)
        transaction.commit = _handle_errors(transaction.commit, transactional=True)
        return transaction

    def query(self, *args, **kwargs):
        query = super(Client, self).query(*args, **kwargs)
        query.fetch = _handle_errors(query.fetch)
        return query

The text was updated successfully, but these errors were encountered:

dhermes · 2016-10-21T15:32:32Z

@Bogdanp Sorry this has been happening. gRPC support for datastore was added in 0.19.0, so the 0.20.0 upgrade wouldn't have changed anything. UNAVAILABLE is essentially the same as a 503 (but we expect a retry is just fine). Why is that relevant? Because a 503 is a sign that something bad happened in the server, not that something was wrong with the request.

So your issue is essentially two issues:

Fix Use catch-all gRPC StatusCode remapping to our native errors in datastore #2497
Add support for retries

dhermes · 2016-10-24T16:55:48Z

@Bogdanp now that #2497 has been fixed, the retry question is the only thing that remains.

We have "avoided" added automatic retries because we don't want to surprise users with under-the-covers magic. How / where would you see retries being used (explicitly or implicitly) in our interfaces that you're familiar with?

Bogdanp · 2016-10-25T12:55:32Z

@dhermes here's what that snippet I posted in my first comment evolved into over the past week:

# The maximum number of retries that should be done per Datastore
# error code.
_MAX_RETRIES_BY_CODE = {
    grpc.StatusCode.INTERNAL: 1,
    grpc.StatusCode.ABORTED: 5,  # Only retried for non-transactional commits
    grpc.StatusCode.UNAVAILABLE: 5,
    grpc.StatusCode.DEADLINE_EXCEEDED: 5,
}


def _handle_errors(f, transactional=False):
    @functools.wraps(f)
    def handler(*args, **kwargs):
        retries = 0
        while True:
            try:
                return f(*args, **kwargs)
            # TODO: Replace w/ concrete error types once they are
            # added to gcloud.  See: google-cloud-python/issues/2583
            except (
                google.cloud.exceptions.Conflict,  # gcloud catches ABORTED
                google.cloud.exceptions._Rendezvous
            ) as e:
                if isinstance(e, google.cloud.exceptions.Conflict):
                    code = grpc.StatusCode.ABORTED
                else:
                    code = e.code()

                max_retries = _MAX_RETRIES_BY_CODE.get(code)
                if max_retries is None or transactional and code == grpc.StatusCode.ABORTED:
                    raise

                if retries > max_retries:
                    raise RetriesExceeded(e)

                backoff = min(0.0625 * 2 ** retries, 1.0)
                bread.get_logger().debug("Sleeping for %r before retrying failed request...", backoff)

                retries += 1
                time.sleep(backoff)

    return handler


class Client(datastore.Client):
    def __init__(self, *args, **kwargs):
        super(Client, self).__init__(*args, **kwargs)

        self.delete_multi = _handle_errors(self.delete_multi)
        self.get_multi = _handle_errors(self.get_multi)
        self.put_multi = _handle_errors(self.put_multi)

    def transaction(self, *args, **kwargs):
        transaction = super(Client, self).transaction(*args, **kwargs)
        transaction.commit = _handle_errors(transaction.commit, transactional=True)
        return transaction

    def query(self, *args, **kwargs):
        query = super(Client, self).query(*args, **kwargs)
        query.fetch = _handle_errors(query.fetch)
        return query

We have been running this in our staging environment for about a week now and this seems to have cut the error rate down to nearly zero (we see a _Rendezvous with code UNAUTHENTICATED get raised every couple of days -- not sure what to make of that yet). In production, we've been running gcloud versions <=0.17 with gcloud_requests for quite a while now and that employs the same error handling model. Our error rate in production is 0 as most retries succeed immediately and we've never gone over the retry limit (lowest is 1, highest is 5) for any of the error codes.

This is the set of Datastore methods we use in our ORM which we've seen raise at least one of these errors:

datastore.Client.{delete,get,put}_multi
datastore.Query.fetch
datastore.Iterator.next_page
datastore.Transaction.commit

I suspect datastore.Transaction.rollback might also need to be patched, but we do that infrequently enough in the staging environment that it hasn't affected us yet so we haven't wrapped it. At any rate, under a production workload all of these methods will definitely end up raising one of these (retriable) errors (quite frequently!) so, if the library doesn't handle them automatically, end users have two options:

either they wrap the library and expose a safe interface like we have, or
they handle retries at every call site

Personally, I believe number 2 does not scale in large codebases/orgs. I think the first option is perfectly reasonable, but it has the disadvantage that learned experiences are less likely to be shared between users whereas if retries were consolidated in the library (perhaps with an option to turn them off or configure max retries and backoff figures) it would lead to more reliable code across the board.

Regarding your changes, I notice in some cases multiple error codes map to the same exception and the exceptions don't keep track of the original code. This means that a retriable error code like Internal and [I believe] a non-retriable error code like DATA_LOSS both end up mapping to InternalServerError, so user code can't reliably determine if it should retry or not. The same is true of ABORTED and ALREADY_EXISTS since they both map to Conflict.

Regarding where and how these should be handled by the library, I think all of the user-facing methods that end up hitting datastore endpoints should transparently handle these errors by default. An option to disable retries either at the client level or the method level might be desirable but we have no use case for it currently. If you think that's too risky/high level then, alternatively, the library could expose a datastore.SafeClient that does that, but I don't see much benefit there. Thinking about this from a normal library user's perspective, I'm sure most users would appreciate operations slowing down every once in a while but still succeeding rather than outright failing with what's likely going to be an obscure error to them :).

dhermes · 2016-10-25T16:42:32Z

if retries were consolidated in the library (perhaps with an option to turn them off or configure max retries and backoff figures) it would lead to more reliable code across the board.

Sounds good!

user code can't reliably determine if it should retry or not.

We definitely can pack more info into the base GoogleCloudError class.

UNAUTHENTICATED get raised every couple of days

You could be seeing a strange race condition with token refresh / expiry.

Bogdanp · 2016-12-16T07:06:16Z

@dhermes any updates on this?

dhermes · 2016-12-16T07:08:37Z

None yet, though returning "fresh" to this issue after two months, it seems we should close it and open 1, 2 or 3 new issues witha focused goal. WDYT?

Bogdanp · 2016-12-16T07:11:21Z

Sounds good to me!

kunalq · 2016-12-17T01:10:21Z

I agree that built-in, optional client-side retry would be nice, as incorporating retries into an existing code base can be fairly cumbersome. :)

devries · 2017-04-28T18:07:47Z

I am also experiencing this issue, and have resorted to wrapping my datastore code in a try/except which will initially try with the current datastore client, and if that fails, get a new client and try the query again. I am using a service account to access datastore, and this seems to happen more frequently if I use the same service account from multiple computers (i.e. I try it from my home computer at the same time as running a query with that service account from a google compute engine server).

kunalq · 2017-04-28T18:15:01Z

@devries I ended up writing a small retry decorator to make things easier. There's 3rd party libraries available that may help, as well as a recipe here.

devries · 2017-04-29T17:00:15Z

@kunalq thank you. I've done something similar, but your recipe is a lot more comprehensive and robust.

lukesneeringer · 2017-08-11T21:18:48Z

Hello,
One of the challenges of maintaining a large open source project is that sometimes, you can bite off more than you can chew. As the lead maintainer of google-cloud-python, I can definitely say that I have let the issues here pile up.

As part of trying to get things under control (as well as to empower us to provide better customer service in the future), I am declaring a "bankruptcy" of sorts on many of the old issues, especially those likely to have been addressed or made obsolete by more recent updates.

My goal is to close stale issues whose relevance or solution is no longer immediately evident, and which appear to be of lower importance. I believe in good faith that this is one of those issues, but I am scanning quickly and may occasionally be wrong. If this is an issue of high importance, please comment here and we will reconsider. If this is an issue whose solution is trivial, please consider providing a pull request.

Thank you!

dhermes added api: datastore Issues related to the Datastore API. grpc labels Oct 21, 2016

dhermes mentioned this issue Oct 22, 2016

Remapping (almost) all RPC status codes to our exceptions in datastore. #2590

Merged

bendemaree mentioned this issue Jan 4, 2017

Adding retry logic to services in google-cloud-python #2694

Closed

bendemaree mentioned this issue Jan 17, 2017

Create exceptions (and bases) for RPC-only errors in core #2936

Closed

lukesneeringer added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Apr 19, 2017

lukesneeringer closed this as completed Aug 11, 2017

graphicore mentioned this issue Dec 20, 2017

[gRPC/cache] read fails with "Stream removed" googlefonts/fontbakery-dashboard#56

Closed

JustinBeckwith assigned lukesneeringer Feb 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datastore: _Rendezvous of RPC that terminated with StatusCode.UNAVAILABLE #2583

Datastore: _Rendezvous of RPC that terminated with StatusCode.UNAVAILABLE #2583

Bogdanp commented Oct 21, 2016 •

edited

Loading

dhermes commented Oct 21, 2016

dhermes commented Oct 24, 2016

Bogdanp commented Oct 25, 2016 •

edited

Loading

dhermes commented Oct 25, 2016 •

edited

Loading

Bogdanp commented Dec 16, 2016

dhermes commented Dec 16, 2016

Bogdanp commented Dec 16, 2016

kunalq commented Dec 17, 2016

devries commented Apr 28, 2017

kunalq commented Apr 28, 2017

devries commented Apr 29, 2017

lukesneeringer commented Aug 11, 2017

Datastore: _Rendezvous of RPC that terminated with StatusCode.UNAVAILABLE #2583

Datastore: _Rendezvous of RPC that terminated with StatusCode.UNAVAILABLE #2583

Comments

Bogdanp commented Oct 21, 2016 • edited Loading

dhermes commented Oct 21, 2016

dhermes commented Oct 24, 2016

Bogdanp commented Oct 25, 2016 • edited Loading

dhermes commented Oct 25, 2016 • edited Loading

Bogdanp commented Dec 16, 2016

dhermes commented Dec 16, 2016

Bogdanp commented Dec 16, 2016

kunalq commented Dec 17, 2016

devries commented Apr 28, 2017

kunalq commented Apr 28, 2017

devries commented Apr 29, 2017

lukesneeringer commented Aug 11, 2017

Bogdanp commented Oct 21, 2016 •

edited

Loading

Bogdanp commented Oct 25, 2016 •

edited

Loading

dhermes commented Oct 25, 2016 •

edited

Loading