Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBFS API 429 retries: Fix bug on computing the time until the next retry #326

Merged

Conversation

bogdanghita-db
Copy link
Collaborator

@bogdanghita-db bogdanghita-db commented Sep 22, 2020

In #319 we added 429 retries for the DBFS API.

Computing the time until the next retry as retry_state.idle_for - time_for_last_retry is not going to work as expected in the case where more than 1 call to functions wrapped with retry_429 happens (e.g. in the case of put_file). The problem is that time_for_last_retry is only initialised once per databricks-cli command, while it should be initialised every time a function wrapped with retry_429 is called, in the first attempt.

We fix this by initialising time_for_last_retry if retry_state.attempt_number == 1. This only works if the code is single-threaded, which is the case here.

Additional improvements:

  • Remove unnecessary retry log, since we already have one.
  • Increase max number of retries to 8, so that we have one additional retry with long delay.
  • Use click.echo instead of logger, to maintain consistency with the rest of the code.
  • Add clarifying note on retry_state.idle_for.
  • Display seconds with 2 decimal places only.

@@ -113,7 +110,6 @@ def wrapped_function(*args, **kwargs):
return func(*args, **kwargs)
except HTTPError as e:
if e.response.status_code == 429:
click.echo("Rate limit exceeded. Retrying with exponential backoff.")
Copy link
Collaborator Author

@bogdanghita-db bogdanghita-db Sep 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have the following logs, no need for this extra one.

Received 429 REQUEST_LIMIT_EXCEEDED for attempt 1. Retrying in 0.91 seconds.
Received 429 REQUEST_LIMIT_EXCEEDED for attempt 2. Retrying in 0.12 seconds.
Received 429 REQUEST_LIMIT_EXCEEDED for attempt 3. Retrying in 2.95 seconds.
Received 429 REQUEST_LIMIT_EXCEEDED for attempt 4. Retrying in 5.58 seconds.
Received 429 REQUEST_LIMIT_EXCEEDED for attempt 5. Retrying in 12.54 seconds.

@@ -40,6 +40,7 @@
'file_size': 1
}
TEST_FILE_INFO = api.FileInfo(TEST_DBFS_PATH, False, 1)
MAX_RETRY_429_ATTEMPTS = 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Consistency with api.py and keep it as MAX_RETRY_ATTEMPTS

@@ -94,13 +86,21 @@ class DbfsErrorCodes(object):
def before_sleep_on_429(retry_state):
global time_for_last_retry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed now? This is using the global declaration. We can remove this since you're now initializing locally everytime now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need it, since the function is called multiple times and time_for_last_retry needs to be incremented every time. Otherwise, on attempt 2 time_for_last_retry will be uninitialised and we'll get: local variable 'time_for_last_retry' referenced before assignment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah of course! The same logic still holds. Gotcha!

Copy link
Contributor

@gotibhai gotibhai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Thanks for doing this.

@bogdanghita-db bogdanghita-db merged commit 37d51b1 into databricks:master Sep 23, 2020
bogdanghita-db added a commit to bogdanghita-db/databricks-cli that referenced this pull request Oct 16, 2020
bogdanghita-db added a commit that referenced this pull request Oct 20, 2020
…all databricks-cli operations (#343)

Revert retry logic on 429 responses for DBFS API requests (#319, #326, #327) and reimplement it using [urllib3.util.Retry](https://urllib3.readthedocs.io/en/latest/reference/urllib3.util.html), for all the requests made by `databricks-cli`.

* We perform a maximum number of 6 retries with exponential backoff, resulting in the following delays between them: 0.5, 1, 2, 4, 8, 16 (seconds). The `Retry-After` header is respected if present.
* There is no message logged for each retry (I could not find a way to do it using `urllib3.util.Retry`).
* If all retry attempts fail, the original 429 http response is forwarded by the retry utility.
onlinehub0808 added a commit to onlinehub0808/databricksCli that referenced this pull request Feb 1, 2023
…all databricks-cli operations (#343)

Revert retry logic on 429 responses for DBFS API requests (databricks/databricks-cli#319, databricks/databricks-cli#326, databricks/databricks-cli#327) and reimplement it using [urllib3.util.Retry](https://urllib3.readthedocs.io/en/latest/reference/urllib3.util.html), for all the requests made by `databricks-cli`.

* We perform a maximum number of 6 retries with exponential backoff, resulting in the following delays between them: 0.5, 1, 2, 4, 8, 16 (seconds). The `Retry-After` header is respected if present.
* There is no message logged for each retry (I could not find a way to do it using `urllib3.util.Retry`).
* If all retry attempts fail, the original 429 http response is forwarded by the retry utility.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants