-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional retries on DatabricksError: Workspace X exceeded the concurrent limit of Y requests
#391
base: main
Are you sure you want to change the base?
Conversation
…urrent limit of Y requests` Platform doesn't seem to send HTTP 429 correctly with this response, otherwise the error would have been `TimeoutError`. This PR adds retries for error responses with this message. Fixes databrickslabs/ucx#401 Signed-off-by: Serge Smertin <259697+nfx@users.noreply.github.com>
Codecov ReportAll modified lines are covered by tests ✅
📢 Thoughts on this report? Let us know!. |
DatabricksError: Workspace X exceeded the concurrent limit of Y requests
(SEV1)DatabricksError: Workspace X exceeded the concurrent limit of Y requests
Signed-off-by: Serge Smertin <259697+nfx@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does return a 429.
I analyzed the code in the client and the following happens:
_perform
calls_make_nicer_error
_make_nicer_error
calls_parse_retry_after
if the status code is 429_parse_retry_after
returnsNone
if it doesn't have aRetry-After
header
So, likely this response does have the right status, but doesn't include the Retry-After
header and then goes through the _is_retryable
path.
@pietern yes, with but now I want it to go through https://github.com/databricks/databricks-sdk-py/blob/main/databricks/sdk/retries.py#L40-L41 All of transient errors eventually have to go away - https://github.com/databricks/databricks-sdk-py/blob/main/databricks/sdk/core.py#L1096-L1102, but there'll always be cases when something is not mapped out correctly. It's a fix for our SEV1 issue - e.g. databrickslabs/ucx#401 happens sometimes 6 hours into running a workflow, which could be frustrating. |
Why not retry on all 429s? I get the other bits, but for this issue, if we simply retry on all 429s, we fix this particular error, as well as the others that already produce the right status code but don't include the |
@pietern we can, sure, preferably after #376 is merged. it'll be straightforward to add without #376 it's going to be more hairy than adding a string match 🤷 and it's a fix for our SEV1. |
## Changes Change HTTP client to always retry on 429 and 503, regardless of whether Retry-After header is present. If absent, default to 1 second, as we do today. Subsumes #391 and #396 while we're still discussing error architecture in the SDK. ## Tests Unit test to ensure that 429 and 503 work with and without Retry-After header.
#402 ensures that the SDK will retry when the Retry-After header is missing, and I'm filing ES tickets internally with teams whose APIs are not compliant by responding with other status codes when rate limits are being reached. Would that be sufficient? I'd prefer not to add more edge cases into the SDK if possible. |
Merge queue setting changed
Changes
Platform doesn't seem to send
Retry-After
header correctly with this response, otherwise, the error would have beenTimeoutError
. This PR adds retries for error responses with this message.Fixes databrickslabs/ucx#401
Tests
make test
run locallymake fmt
applied