Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement retry mechanism on 429 at the http request level and for all databricks-cli operations #343

Merged

Conversation

bogdanghita-db
Copy link
Collaborator

@bogdanghita-db bogdanghita-db commented Oct 16, 2020

Revert retry logic on 429 responses for DBFS API requests (#319, #326, #327) and reimplement it using urllib3.util.Retry, for all the requests made by databricks-cli.

  • We perform a maximum number of 6 retries with exponential backoff, resulting in the following delays between them: 0.5, 1, 2, 4, 8, 16 (seconds). The Retry-After header is respected if present.
  • There is no message logged for each retry (I could not find a way to do it using urllib3.util.Retry).
  • If all retry attempts fail, the original 429 http response is forwarded by the retry utility.

I tested manually by enabling debug logging and validating that retries are performed. I don't see an easy way of adding automated tests and I don't think it's worth the effort to do it given that we are only using standard functionality.

DEBUG:urllib3.connectionpool:https://adb-1805849421037944.4.dev.azuredatabricks.net:443 "GET /api/2.0/dbfs/list?path=dbfs%3A%2F HTTP/1.1" 429 None
DEBUG:urllib3.util.retry:Incremented Retry for (url='/api/2.0/dbfs/list?path=dbfs%3A%2F'): Retry(total=2, connect=None, read=None, redirect=None, status=None)
DEBUG:urllib3.connectionpool:Retry: /api/2.0/dbfs/list?path=dbfs%3A%2F
DEBUG:urllib3.connectionpool:https://adb-1805849421037944.4.dev.azuredatabricks.net:443 "GET /api/2.0/dbfs/list?path=dbfs%3A%2F HTTP/1.1" 429 None
DEBUG:urllib3.util.retry:Incremented Retry for (url='/api/2.0/dbfs/list?path=dbfs%3A%2F'): Retry(total=1, connect=None, read=None, redirect=None, status=None)
DEBUG:urllib3.connectionpool:Retry: /api/2.0/dbfs/list?path=dbfs%3A%2F
DEBUG:urllib3.connectionpool:https://adb-1805849421037944.4.dev.azuredatabricks.net:443 "GET /api/2.0/dbfs/list?path=dbfs%3A%2F HTTP/1.1" 429 None
DEBUG:urllib3.util.retry:Incremented Retry for (url='/api/2.0/dbfs/list?path=dbfs%3A%2F'): Retry(total=0, connect=None, read=None, redirect=None, status=None)
DEBUG:urllib3.connectionpool:Retry: /api/2.0/dbfs/list?path=dbfs%3A%2F
DEBUG:urllib3.connectionpool:https://adb-1805849421037944.4.dev.azuredatabricks.net:443 "GET /api/2.0/dbfs/list?path=dbfs%3A%2F HTTP/1.1" 429 None
Error: {"error_code":"REQUEST_LIMIT_EXCEEDED","message":"Workspace 1805849421037944 exceeded the concurrent limit of 30 requests."}

@codecov-io
Copy link

codecov-io commented Oct 16, 2020

Codecov Report

Merging #343 into master will decrease coverage by 0.36%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #343      +/-   ##
==========================================
- Coverage   84.90%   84.53%   -0.37%     
==========================================
  Files          39       39              
  Lines        2749     2703      -46     
==========================================
- Hits         2334     2285      -49     
- Misses        415      418       +3     
Impacted Files Coverage Δ
databricks_cli/dbfs/exceptions.py 100.00% <ø> (ø)
setup.py 0.00% <ø> (ø)
databricks_cli/dbfs/api.py 63.88% <100.00%> (-7.27%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6677bb0...cc2848c. Read the comment docs.

Copy link
Contributor

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check if 503 global handling is not going to affect things much. And replace deprecated constant name with newer one

def put_file(self, src_path, dbfs_path, overwrite, headers=None):
handle = self.create(dbfs_path, overwrite, headers=headers)['handle']
handle = self.client.create(dbfs_path.absolute_path, overwrite, headers=headers)['handle']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more for SDK side of things - we should consider some day later the "context manager" for DBFS files:

with self.client.open(dbfs_path.absolute_path) as dbfs_file:
    # alternatively ".write(contents)" - that will do base64 stuff
    dbfs_file.add_block(base64encode(...), ...)

def delete(self, dbfs_path, recursive, headers=None):
num_files_deleted = 0
while True:
try:
self.client.delete(dbfs_path.absolute_path,
recursive=recursive, headers=headers)
self.client.delete(dbfs_path.absolute_path, recursive=recursive, headers=headers)
except HTTPError as e:
if e.response.status_code == 503:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

@bogdanghita-db bogdanghita-db Oct 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you say there will be global retries on 503? I am overriding the default status codes with status_forcelist=[429] in databricks_cli/sdk/api_client.py to ensure retries are made only for 429. These are only the defaults:

#: Default status codes to be used for ``status_forcelist``
    RETRY_AFTER_STATUS_CODES = frozenset([413, 429, 503])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then good

total=6,
backoff_factor=1,
status_forcelist=[429],
method_whitelist=set({'POST'}) | set(Retry.DEFAULT_METHOD_WHITELIST),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use DEFAULT_ALLOWED_METHODS, because DEFAULT_METHOD_WHITELIST is deprecated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new constant is not available in urllib3 v1.25.10, which seems to be installed with databricks-cli by default. It was introduced in v1.26.0. Confirmed by trying to use it:

Error: AttributeError: type object 'Retry' has no attribute 'DEFAULT_ALLOWED_METHODS'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, then we need to schedule version updates, as this code would have to be updated down the line anyway

@nfx
Copy link
Contributor

nfx commented Oct 20, 2020

LGTM

@bogdanghita-db bogdanghita-db merged commit 945690f into databricks:master Oct 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants