-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloud spanner client does not seem to be retrying broken streams #3775
Comments
Our retry logic is in GAX, but leaving this open for @tseaver to confirm that it is applicable here. |
This needs to be done at a higher level than GAX, since we send resume tokens along with the responses. So if a stream breaks, instead of retrying the whole stream, the library should just pass the resume token back in the request which will cause the stream to be resumed from that point onwards. It is further complicated since cloud spanner might not send resume tokens with all responses so the client library needs to buffer responses till it sees one with resume token and only then yield them to the user. This will allow it to safely resume in case the stream breaks. What does GAX do for streaming calls? Does it retry the whole stream if it fails? What if the stream had yielded all but the last 1 response before failing? In that case retrying the whole stream is very wasteful. If GAX does do retries for streaming calls, we need to disable them. |
gax is pretty naive in what it does for anything other than simple RPCs. I'm writing a doc internally on this now. :) |
@lukesneeringer @vkedia Does this need to be Beta blocking? |
Till this is resolved can we fix the documentation to indicate that users would need to resume the stream themselves by passing back the resume token if available? |
We could attempt to add enough information to the retry = functools.partial(self.read, table, columns, keyset, index, limit)
return StreamedResultSet(iterator, retry=retry) or from retry = functools.partial(self.execute_sql, params, param_types, query_mode)
return StreamedResultSet(iterator, retry=retry) The only other way I can see to hide such errors from the user would be to add callback-based methods which took the arguments to be passed to E.g.: def handle_row(row):
...
snapshot.read_cb(table, keyset, ..., callback=handle_row)
snapshot.execute_sql(sql, ..., callback=handle_row) |
I don't understand how the callback solution fixes this. Can you please elaborate? How would you retry in that case? Also note this is more complex because we do not guarantee that all PartialResultSet would have a resume token. So if you yield a Row to the user but the PRS that contained that Row did not have a resume_token, you cannot retry safely. So what you will need to do is store PRS's in a buffer till you see one with resume_token. Only then yield them to the user. So essentially you guarantee that every time you yield something to the user it is at a resumable boundary. |
The curried partial is the pattern we'll use in class _RetryingStreamIterator(object):
def __init__(self, iter, request):
self._iter = iter
self._request = request
self._resume_token = None
def __iter__(self):
return self
def __next__(self):
try:
next = six.next(self._iter)
self._resume_token = next.resumeToken
except grpc.RpcError as exc:
if (exc.status_code == grpc.StatusCode.RESUMABLE
and self.resumeToken is not None):
self._iter = self._request(resume_token=self._resume_token)
return six.next(self._iter)
else:
raise
...
# read/execute_sql:
request = functools.partial(api.streaming_read,
self._session.name, table, columns, keyset.to_pb(),
transaction=transaction, index=index, limit=limit,
resume_token=resume_token, options=options)
iterator = _RetryingStreamIterator(request(), request)
if self._multi_use:
return StreamedResultSet(iterator, source=self)
else:
return StreamedResultSet(iterator) |
I applied the |
Great, thanks @tseaver! |
I left a comment on that PR. The documentation is not completely correct and as I mention their it is really tricky to tell the user what they need to do in case of a broken stream (which is why we handle that in the library). So I think it is better to fix this. |
I thought about this a bit more and it is really hard for users to build their own retry logic around the current API and will be error prone. It is better to just fix it in the client code. |
/cc @jonparrott Who has been working on the retry design |
If it is release blocking, the minimum that this will push us back is two weeks, since @tseaver will be mostly out. |
I see. I also realized now that this might be a breaking change as well. That is because once we implement retries in the client, there is no reason for us to expose the concept of resume_token to the users. It will just be confusing. So we should remove the |
With #3819 in we have the core retry functionality needed for this. I do not plan on adding a general-purpose iterator retry decorator to api.core at the moment. |
Also, is there a spanner-specific deadline after which retries should abort? |
@jonparrott Are we doing anything about "sensible defaults" for the |
@tseaver the defaults to the high-level @lukesneeringer can help with the gapic config part. |
One possibly valid approach in the near term is to use |
@jonparrott I missed seeing the Also, I'm unsure how GAPIC config fits in with retries for the streaming-result-set iteration, where code in this library is responsible for tracking / passing back the |
Yeah in general I wouldn't sweat that for your purposes here- sorry to mislead you on that (it's monday..) Basically, two levels of retries will be involved here: the "iterator" retry and the "method" retry. e.g.: calling It's turtles all the way down. I don't think you need to work about the gapic config for the iterator retry- you can just use the default retry or customize as you feel is appropriate. |
So I need to know what are the"transient" errors which may be propagated from the "iterator" retry, and which the "method" ( |
Per @blowmage: the Ruby implementation appears to only do the "restart" for |
Yes you should try only for UNAVAILABLE with exponential backoff.
…On Mon, Aug 28, 2017 at 9:57 AM, Tres Seaver ***@***.***> wrote:
Per @blowmage <https://github.com/blowmage>: the Ruby implementation
<https://github.com/GoogleCloudPlatform/google-cloud-ruby/blob/master/google-cloud-spanner/lib/google/cloud/spanner/results.rb#L143>
appears to only do the "restart" for UNAVAILABLE.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3775 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ATdef5aCSTCk6do3cbvM_cAhx4zqEl6Xks5scvFjgaJpZM4OyakU>
.
|
@jonparrott the "stock" bits in def consume_next(self):
"""Consume the next partial result set from the stream.
Parse the result set into new/existing rows in :attr:`_rows`
"""
sleep_generator = retry.exponential_sleep_generator(
initial=retry._DEFAULT_INITIAL_DELAY,
maximum=retry._DEFAULT_MAXIMUM_DELAY)
for sleep in sleep_generator:
try:
response = six.next(self._response_iterator)
except exceptions.ServiceUnavailable:
if self._resume_token in (None, b''):
raise
self._response_iterator = self._retry(
resume_token=self._resume_token)
else:
break |
@tseaver that seems pretty reasonable. (it's one of the reasons I made exponential_sleep_generator public) You should set a deadline of some sort, though. |
Does the Python client allow users to specify deadline? If it does then the
same deadline should be set adjusting for the time time elapsed so far.
…On Mon, Aug 28, 2017 at 10:43 AM, Jon Wayne Parrott < ***@***.***> wrote:
@tseaver <https://github.com/tseaver> that seems pretty reasonable. (it's
one of the reasons I made exponential_sleep_generator public)
You should set a deadline of some sort, though.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3775 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ATdef1-qD59qWEW1G7iw2pW8ElDJyjstks5scvxNgaJpZM4OyakU>
.
|
And of course I have to add my own deadline handling. BTW, why does |
No particular reason other than being functionally equivalent and having on single source of truth for time in tests (
How so? |
As you pointed out above, I'm doing the low level work myself. ISTM that the |
I'm still not sure what you mean? Just pass the user's deadline into retry_target? |
I can't use |
Thanks for the PR - I think I see now, so it seems like if you did this: retry_target(functools.partial(six.next, self._response_iterator)) It wouldn't let you catch the error and replace I could imagine a hack using the predicate and some getattr magic, but that sounds gross to me. Is there a minimal change you could make to |
Also - I've thought about a |
* Add '_restart_on_unavailable' iterator wrapper. Tracks the 'resume_token', and issues restart after a 503. * Strip knowledge of 'resume_token' from 'StreamedResultSet'. * Remove 'resume_token' args from 'Snapshot' and 'Session' API surface: Retry handling will be done behind the scenes. * Use '_restart_on_unavailable' wrapper in 'SRS.{read,execute_sql}. Closes #3775.
Woo hoo! Thanks @tseaver! |
Documentation for snapshot transactions mentions:
This is the correct thing to do but the client actually does not seem to be doing this. I looked around a little bit in the code and I could find anyplace where it is retrying the read with the resume token.
This is the place where the stream would throw an error if it broke:
https://github.com/GoogleCloudPlatform/google-cloud-python/blob/9d9b6c0708ae6553458ad6107e5ad8efb23762e8/spanner/google/cloud/spanner/streamed.py#L132
I dont see any exception handling happening here and it looks like that the error would just be propagated back to the user.
The text was updated successfully, but these errors were encountered: