Additional retries on `DatabricksError: Workspace X exceeded the concurrent limit of Y requests` #391

nfx · 2023-10-07T20:02:00Z

Changes

Platform doesn't seem to send Retry-After header correctly with this response, otherwise, the error would have been TimeoutError. This PR adds retries for error responses with this message.

Fixes databrickslabs/ucx#401

Tests

make test run locally
make fmt applied
relevant integration tests applied

…urrent limit of Y requests` Platform doesn't seem to send HTTP 429 correctly with this response, otherwise the error would have been `TimeoutError`. This PR adds retries for error responses with this message. Fixes databrickslabs/ucx#401 Signed-off-by: Serge Smertin <259697+nfx@users.noreply.github.com>

codecov-commenter · 2023-10-07T20:03:25Z

Codecov Report

All modified lines are covered by tests ✅

Files	Coverage Δ
databricks/sdk/core.py	`80.36% <ø> (ø)`

📢 Thoughts on this report? Let us know!.

Signed-off-by: Serge Smertin <259697+nfx@users.noreply.github.com>

pietern

This does return a 429.

I analyzed the code in the client and the following happens:

_perform calls _make_nicer_error
_make_nicer_error calls _parse_retry_after if the status code is 429
_parse_retry_after returns None if it doesn't have a Retry-After header

So, likely this response does have the right status, but doesn't include the Retry-After header and then goes through the _is_retryable path.

nfx · 2023-10-10T18:14:52Z

@pietern yes, with Retry-After from platform we'd go through https://github.com/databricks/databricks-sdk-py/blob/main/databricks/sdk/retries.py#L35-L39

but now I want it to go through https://github.com/databricks/databricks-sdk-py/blob/main/databricks/sdk/retries.py#L40-L41

All of transient errors eventually have to go away - https://github.com/databricks/databricks-sdk-py/blob/main/databricks/sdk/core.py#L1096-L1102, but there'll always be cases when something is not mapped out correctly.

It's a fix for our SEV1 issue - e.g. databrickslabs/ucx#401 happens sometimes 6 hours into running a workflow, which could be frustrating.

pietern · 2023-10-10T18:18:01Z

Why not retry on all 429s? I get the other bits, but for this issue, if we simply retry on all 429s, we fix this particular error, as well as the others that already produce the right status code but don't include the Retry-After header.

nfx · 2023-10-10T18:40:49Z

Why not retry on all 429s?

@pietern we can, sure, preferably after #376 is merged. it'll be straightforward to add except TooManyRequests as err: if err.retry_after_seconds is not None: sleep_seconds = err.retry_after_seconds. E.g. as a follow-up to merging #376, this added line can go away.

without #376 it's going to be more hairy than adding a string match 🤷

and it's a fix for our SEV1.

## Changes Change HTTP client to always retry on 429 and 503, regardless of whether Retry-After header is present. If absent, default to 1 second, as we do today. Subsumes #391 and #396 while we're still discussing error architecture in the SDK. ## Tests Unit test to ensure that 429 and 503 work with and without Retry-After header.

mgyucht · 2023-10-24T08:35:18Z

#402 ensures that the SDK will retry when the Retry-After header is missing, and I'm filing ES tickets internally with teams whose APIs are not compliant by responding with other status codes when rate limits are being reached. Would that be sufficient? I'd prefer not to add more edge cases into the SDK if possible.

nfx added the chore anything that makes the code better label Oct 7, 2023

nfx requested a review from a team October 7, 2023 20:02

nfx enabled auto-merge October 7, 2023 20:02

nfx changed the title ~~Additional retries on DatabricksError: Workspace X exceeded the concurrent limit of Y requests (SEV1)~~ Additional retries on DatabricksError: Workspace X exceeded the concurrent limit of Y requests Oct 10, 2023

Merge branch 'main' into chore/more-retries

f1f830a

Signed-off-by: Serge Smertin <259697+nfx@users.noreply.github.com>

pietern reviewed Oct 10, 2023

View reviewed changes

todo

a20dd2d

mgyucht mentioned this pull request Oct 12, 2023

Retry on all 429 and 503, even when missing Retry-After header #402

Merged

auto-merge was automatically disabled November 23, 2023 15:14
Merge queue setting changed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional retries on `DatabricksError: Workspace X exceeded the concurrent limit of Y requests` #391

Additional retries on `DatabricksError: Workspace X exceeded the concurrent limit of Y requests` #391

nfx commented Oct 7, 2023 •

edited

Loading

codecov-commenter commented Oct 7, 2023 •

edited

Loading

pietern left a comment

nfx commented Oct 10, 2023

pietern commented Oct 10, 2023

nfx commented Oct 10, 2023 •

edited

Loading

mgyucht commented Oct 24, 2023

Additional retries on DatabricksError: Workspace X exceeded the concurrent limit of Y requests #391

Are you sure you want to change the base?

Additional retries on DatabricksError: Workspace X exceeded the concurrent limit of Y requests #391

Conversation

nfx commented Oct 7, 2023 • edited Loading

Changes

Tests

codecov-commenter commented Oct 7, 2023 • edited Loading

Codecov Report

pietern left a comment

Choose a reason for hiding this comment

nfx commented Oct 10, 2023

pietern commented Oct 10, 2023

nfx commented Oct 10, 2023 • edited Loading

mgyucht commented Oct 24, 2023

Additional retries on `DatabricksError: Workspace X exceeded the concurrent limit of Y requests` #391

Additional retries on `DatabricksError: Workspace X exceeded the concurrent limit of Y requests` #391

nfx commented Oct 7, 2023 •

edited

Loading

codecov-commenter commented Oct 7, 2023 •

edited

Loading

nfx commented Oct 10, 2023 •

edited

Loading