Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional retries on DatabricksError: Workspace X exceeded the concurrent limit of Y requests #391

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

nfx
Copy link
Contributor

@nfx nfx commented Oct 7, 2023

Changes

Platform doesn't seem to send Retry-After header correctly with this response, otherwise, the error would have been TimeoutError. This PR adds retries for error responses with this message.

Fixes databrickslabs/ucx#401

Tests

  • make test run locally
  • make fmt applied
  • relevant integration tests applied

…urrent limit of Y requests`

Platform doesn't seem to send HTTP 429 correctly with this response, otherwise the error would have been `TimeoutError`. This PR adds retries for error responses with this message.

Fixes databrickslabs/ucx#401

Signed-off-by: Serge Smertin <259697+nfx@users.noreply.github.com>
@nfx nfx added the chore anything that makes the code better label Oct 7, 2023
@nfx nfx requested a review from a team October 7, 2023 20:02
@nfx nfx enabled auto-merge October 7, 2023 20:02
@codecov-commenter
Copy link

codecov-commenter commented Oct 7, 2023

Codecov Report

All modified lines are covered by tests ✅

Files Coverage Δ
databricks/sdk/core.py 80.36% <ø> (ø)

📢 Thoughts on this report? Let us know!.

@nfx nfx changed the title Additional retries on DatabricksError: Workspace X exceeded the concurrent limit of Y requests (SEV1) Additional retries on DatabricksError: Workspace X exceeded the concurrent limit of Y requests Oct 10, 2023
Signed-off-by: Serge Smertin <259697+nfx@users.noreply.github.com>
Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does return a 429.

I analyzed the code in the client and the following happens:

  • _perform calls _make_nicer_error
  • _make_nicer_error calls _parse_retry_after if the status code is 429
  • _parse_retry_after returns None if it doesn't have a Retry-After header

So, likely this response does have the right status, but doesn't include the Retry-After header and then goes through the _is_retryable path.

@nfx
Copy link
Contributor Author

nfx commented Oct 10, 2023

@pietern yes, with Retry-After from platform we'd go through https://github.com/databricks/databricks-sdk-py/blob/main/databricks/sdk/retries.py#L35-L39

but now I want it to go through https://github.com/databricks/databricks-sdk-py/blob/main/databricks/sdk/retries.py#L40-L41

All of transient errors eventually have to go away - https://github.com/databricks/databricks-sdk-py/blob/main/databricks/sdk/core.py#L1096-L1102, but there'll always be cases when something is not mapped out correctly.

It's a fix for our SEV1 issue - e.g. databrickslabs/ucx#401 happens sometimes 6 hours into running a workflow, which could be frustrating.

@pietern
Copy link
Contributor

pietern commented Oct 10, 2023

Why not retry on all 429s? I get the other bits, but for this issue, if we simply retry on all 429s, we fix this particular error, as well as the others that already produce the right status code but don't include the Retry-After header.

@nfx
Copy link
Contributor Author

nfx commented Oct 10, 2023

Why not retry on all 429s?

@pietern we can, sure, preferably after #376 is merged. it'll be straightforward to add except TooManyRequests as err: if err.retry_after_seconds is not None: sleep_seconds = err.retry_after_seconds. E.g. as a follow-up to merging #376, this added line can go away.

without #376 it's going to be more hairy than adding a string match 🤷

and it's a fix for our SEV1.

github-merge-queue bot pushed a commit that referenced this pull request Oct 13, 2023
## Changes
Change HTTP client to always retry on 429 and 503, regardless of whether
Retry-After header is present. If absent, default to 1 second, as we do
today.

Subsumes #391 and #396 while we're still discussing error architecture
in the SDK.

## Tests
Unit test to ensure that 429 and 503 work with and without Retry-After
header.
@mgyucht
Copy link
Contributor

mgyucht commented Oct 24, 2023

#402 ensures that the SDK will retry when the Retry-After header is missing, and I'm filing ES tickets internally with teams whose APIs are not compliant by responding with other status codes when rate limits are being reached. Would that be sufficient? I'd prefer not to add more edge cases into the SDK if possible.

auto-merge was automatically disabled November 23, 2023 15:14

Merge queue setting changed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore anything that makes the code better
Projects
None yet
Development

Successfully merging this pull request may close these issues.

crawl_permissions fails: DatabricksError: Workspace X exceeded the concurrent limit of 20 requests.
4 participants