Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datastore: _Rendezvous of RPC that terminated with StatusCode.UNAVAILABLE #2583

Closed
Bogdanp opened this issue Oct 21, 2016 · 12 comments
Closed
Assignees
Labels
api: datastore Issues related to the Datastore API. grpc priority: p2 Moderately-important priority. Fix may not be included in next release.

Comments

@Bogdanp
Copy link

Bogdanp commented Oct 21, 2016

We see this fairly often on commit with google-cloud-datastore version 0.20. I believe these should either be retried with exponential backoff automatically by the library (according to this) or a more specific error should be raised so user code can deal w/ it (preferably one exception for every one of the cases listed on that doc).

_Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, {"created":"@1476898717.596308747","description":"Secure read failed","file":"src/core/lib/security/transport/secure_endpoint.c","file_line":157,"grpc_status":14,"referenced_errors":[{"created":"@1476898717.596257572","description":"EOF","file":"src/core/lib/iomgr/tcp_posix.c","file_line":235}]})>

Edit:

Here's our current (somewhat tested) workaround for this issue in our internal ORM:

# The maximum number of retries that should be done per Datastore
# error code.
_MAX_RETRIES_BY_CODE = {
    grpc.StatusCode.INTERNAL: 1,
    grpc.StatusCode.ABORTED: 5,  # Only retried for non-transactional commits
    grpc.StatusCode.UNAVAILABLE: 5,
    grpc.StatusCode.DEADLINE_EXCEEDED: 5,
}


def _handle_errors(f, transactional=False):
    @functools.wraps(f)
    def handler(*args, **kwargs):
        retries = 0
        while True:
            try:
                return f(*args, **kwargs)
            # TODO: Replace w/ concrete error types once/if they are
            # added to gcloud.  See: google-cloud-python/issues/2583
            except google.cloud.exceptions._Rendezvous as e:
                code = e.code()
                max_retries = _MAX_RETRIES_BY_CODE.get(code)
                if max_retries is None or transactional and code == grpc.StatusCode.ABORTED:
                    raise

                if retries > max_retries:
                    raise RetriesExceeded(e)

                backoff = min(0.0625 * 2 ** retries, 1.0)
                bread.get_logger().debug("Sleeping for %r before retrying failed request...", backoff)

                retries += 1
                time.sleep(backoff)

    return handler


class Client(datastore.Client):
    def __init__(self, *args, **kwargs):
        super(Client, self).__init__(*args, **kwargs)

        self.delete_multi = _handle_errors(self.delete_multi)
        self.get_multi = _handle_errors(self.get_multi)
        self.put_multi = _handle_errors(self.put_multi)

    def transaction(self, *args, **kwargs):
        transaction = super(Client, self).transaction(*args, **kwargs)
        transaction.commit = _handle_errors(transaction.commit, transactional=True)
        return transaction

    def query(self, *args, **kwargs):
        query = super(Client, self).query(*args, **kwargs)
        query.fetch = _handle_errors(query.fetch)
        return query
@dhermes dhermes added api: datastore Issues related to the Datastore API. grpc labels Oct 21, 2016
@dhermes
Copy link
Contributor

dhermes commented Oct 21, 2016

@Bogdanp Sorry this has been happening. gRPC support for datastore was added in 0.19.0, so the 0.20.0 upgrade wouldn't have changed anything. UNAVAILABLE is essentially the same as a 503 (but we expect a retry is just fine). Why is that relevant? Because a 503 is a sign that something bad happened in the server, not that something was wrong with the request.

So your issue is essentially two issues:

@dhermes
Copy link
Contributor

dhermes commented Oct 24, 2016

@Bogdanp now that #2497 has been fixed, the retry question is the only thing that remains.

We have "avoided" added automatic retries because we don't want to surprise users with under-the-covers magic. How / where would you see retries being used (explicitly or implicitly) in our interfaces that you're familiar with?

@Bogdanp
Copy link
Author

Bogdanp commented Oct 25, 2016

@dhermes here's what that snippet I posted in my first comment evolved into over the past week:

# The maximum number of retries that should be done per Datastore
# error code.
_MAX_RETRIES_BY_CODE = {
    grpc.StatusCode.INTERNAL: 1,
    grpc.StatusCode.ABORTED: 5,  # Only retried for non-transactional commits
    grpc.StatusCode.UNAVAILABLE: 5,
    grpc.StatusCode.DEADLINE_EXCEEDED: 5,
}


def _handle_errors(f, transactional=False):
    @functools.wraps(f)
    def handler(*args, **kwargs):
        retries = 0
        while True:
            try:
                return f(*args, **kwargs)
            # TODO: Replace w/ concrete error types once they are
            # added to gcloud.  See: google-cloud-python/issues/2583
            except (
                google.cloud.exceptions.Conflict,  # gcloud catches ABORTED
                google.cloud.exceptions._Rendezvous
            ) as e:
                if isinstance(e, google.cloud.exceptions.Conflict):
                    code = grpc.StatusCode.ABORTED
                else:
                    code = e.code()

                max_retries = _MAX_RETRIES_BY_CODE.get(code)
                if max_retries is None or transactional and code == grpc.StatusCode.ABORTED:
                    raise

                if retries > max_retries:
                    raise RetriesExceeded(e)

                backoff = min(0.0625 * 2 ** retries, 1.0)
                bread.get_logger().debug("Sleeping for %r before retrying failed request...", backoff)

                retries += 1
                time.sleep(backoff)

    return handler


class Client(datastore.Client):
    def __init__(self, *args, **kwargs):
        super(Client, self).__init__(*args, **kwargs)

        self.delete_multi = _handle_errors(self.delete_multi)
        self.get_multi = _handle_errors(self.get_multi)
        self.put_multi = _handle_errors(self.put_multi)

    def transaction(self, *args, **kwargs):
        transaction = super(Client, self).transaction(*args, **kwargs)
        transaction.commit = _handle_errors(transaction.commit, transactional=True)
        return transaction

    def query(self, *args, **kwargs):
        query = super(Client, self).query(*args, **kwargs)
        query.fetch = _handle_errors(query.fetch)
        return query

We have been running this in our staging environment for about a week now and this seems to have cut the error rate down to nearly zero (we see a _Rendezvous with code UNAUTHENTICATED get raised every couple of days -- not sure what to make of that yet). In production, we've been running gcloud versions <=0.17 with gcloud_requests for quite a while now and that employs the same error handling model. Our error rate in production is 0 as most retries succeed immediately and we've never gone over the retry limit (lowest is 1, highest is 5) for any of the error codes.

This is the set of Datastore methods we use in our ORM which we've seen raise at least one of these errors:

  • datastore.Client.{delete,get,put}_multi
  • datastore.Query.fetch
  • datastore.Iterator.next_page
  • datastore.Transaction.commit

I suspect datastore.Transaction.rollback might also need to be patched, but we do that infrequently enough in the staging environment that it hasn't affected us yet so we haven't wrapped it. At any rate, under a production workload all of these methods will definitely end up raising one of these (retriable) errors (quite frequently!) so, if the library doesn't handle them automatically, end users have two options:

  1. either they wrap the library and expose a safe interface like we have, or
  2. they handle retries at every call site

Personally, I believe number 2 does not scale in large codebases/orgs. I think the first option is perfectly reasonable, but it has the disadvantage that learned experiences are less likely to be shared between users whereas if retries were consolidated in the library (perhaps with an option to turn them off or configure max retries and backoff figures) it would lead to more reliable code across the board.

Regarding your changes, I notice in some cases multiple error codes map to the same exception and the exceptions don't keep track of the original code. This means that a retriable error code like Internal and [I believe] a non-retriable error code like DATA_LOSS both end up mapping to InternalServerError, so user code can't reliably determine if it should retry or not. The same is true of ABORTED and ALREADY_EXISTS since they both map to Conflict.

Regarding where and how these should be handled by the library, I think all of the user-facing methods that end up hitting datastore endpoints should transparently handle these errors by default. An option to disable retries either at the client level or the method level might be desirable but we have no use case for it currently. If you think that's too risky/high level then, alternatively, the library could expose a datastore.SafeClient that does that, but I don't see much benefit there. Thinking about this from a normal library user's perspective, I'm sure most users would appreciate operations slowing down every once in a while but still succeeding rather than outright failing with what's likely going to be an obscure error to them :).

@dhermes
Copy link
Contributor

dhermes commented Oct 25, 2016

if retries were consolidated in the library (perhaps with an option to turn them off or configure max retries and backoff figures) it would lead to more reliable code across the board.

Sounds good!

user code can't reliably determine if it should retry or not.

We definitely can pack more info into the base GoogleCloudError class.

UNAUTHENTICATED get raised every couple of days

You could be seeing a strange race condition with token refresh / expiry.

@Bogdanp
Copy link
Author

Bogdanp commented Dec 16, 2016

@dhermes any updates on this?

@dhermes
Copy link
Contributor

dhermes commented Dec 16, 2016

None yet, though returning "fresh" to this issue after two months, it seems we should close it and open 1, 2 or 3 new issues witha focused goal. WDYT?

@Bogdanp
Copy link
Author

Bogdanp commented Dec 16, 2016

Sounds good to me!

@kunalq
Copy link

kunalq commented Dec 17, 2016

I agree that built-in, optional client-side retry would be nice, as incorporating retries into an existing code base can be fairly cumbersome. :)

@lukesneeringer lukesneeringer added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Apr 19, 2017
@devries
Copy link

devries commented Apr 28, 2017

I am also experiencing this issue, and have resorted to wrapping my datastore code in a try/except which will initially try with the current datastore client, and if that fails, get a new client and try the query again. I am using a service account to access datastore, and this seems to happen more frequently if I use the same service account from multiple computers (i.e. I try it from my home computer at the same time as running a query with that service account from a google compute engine server).

@kunalq
Copy link

kunalq commented Apr 28, 2017

@devries I ended up writing a small retry decorator to make things easier. There's 3rd party libraries available that may help, as well as a recipe here.

@devries
Copy link

devries commented Apr 29, 2017

@kunalq thank you. I've done something similar, but your recipe is a lot more comprehensive and robust.

@lukesneeringer
Copy link
Contributor

Hello,
One of the challenges of maintaining a large open source project is that sometimes, you can bite off more than you can chew. As the lead maintainer of google-cloud-python, I can definitely say that I have let the issues here pile up.

As part of trying to get things under control (as well as to empower us to provide better customer service in the future), I am declaring a "bankruptcy" of sorts on many of the old issues, especially those likely to have been addressed or made obsolete by more recent updates.

My goal is to close stale issues whose relevance or solution is no longer immediately evident, and which appear to be of lower importance. I believe in good faith that this is one of those issues, but I am scanning quickly and may occasionally be wrong. If this is an issue of high importance, please comment here and we will reconsider. If this is an issue whose solution is trivial, please consider providing a pull request.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: datastore Issues related to the Datastore API. grpc priority: p2 Moderately-important priority. Fix may not be included in next release.
Projects
None yet
Development

No branches or pull requests

5 participants