Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add max_concurrency support for azure #642

Merged
merged 4 commits into from
Aug 18, 2021
Merged

Add max_concurrency support for azure #642

merged 4 commits into from
Aug 18, 2021

Conversation

omBratteng
Copy link
Contributor

@omBratteng omBratteng commented Aug 16, 2021

Motivation

download_blob in azure.storage.blob supports setting a max concurrency for downloading blobs, which can improve download speeds. In a similar implementation of smart_open, we've noticed that downloading a 313 GiB file averaged 83 minutes download time, whilst smart_open at 122 minutes.

Some benchmarks (note: these also calculates the sha256 checksum of the blob downloaded)
buffer_size is set to 268435456

smart_open (max_concurrency=4) smart_open (max_concurrency=1) our implementation (max_concurrency=4) our implementation (max_concurrency=1)
49.8843s 69.1530s 49.7495s 69.4825s
49.3389s 69.1241s 49.0315s 68.9774s
49.3781s 66.0186s 48.4432s 69.1307s
avg: 49.5338s avg: 68.0986s avg: 49.0747s avg: 69.1969s

As one can see, setting max_concurrency to 4, gives a decrease of 27%, which I think is significant, and is a nice feature to add to the azure connector in smart_open

Tests

If there are any failures, please fix them before creating the PR (or mark it as WIP, see below).

Failing tests seems to not be related to azure, but s3

Checklist

Before you create the PR, please make sure you have:

  • Picked a concise, informative and complete title
  • Clearly explained the motivation behind the PR
  • Linked to any existing issues that your PR will be solving
  • Included tests for any new functionality
  • Checked that all unit tests pass

@mpenkov
Copy link
Collaborator

mpenkov commented Aug 18, 2021

@omBratteng Could you please enable maintainer edits for this PR? I need to push some changes to get the tests working, but I'm getting permission denied.

@omBratteng
Copy link
Contributor Author

@mpenkov I can't see where to enable it after I've created the PR, but I've added you to our fork, so you should be able to edit it there directly.
If not, I can close and open a new PR.

@mpenkov
Copy link
Collaborator

mpenkov commented Aug 18, 2021

I'd rather get things working in this PR. I'm still getting permission denied, though.

Can you please have a look here: https://docs.github.com/en/github/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork

That has instructions for giving maintainer access to PRs.

@omBratteng omBratteng marked this pull request as draft August 18, 2021 06:48
@omBratteng omBratteng marked this pull request as ready for review August 18, 2021 06:48
@omBratteng
Copy link
Contributor Author

Sorry @mpenkov, I can't seem to have that option anywhere. If it's just to get the latest commits in from the develop branch, I can do a rebase.

@mpenkov
Copy link
Collaborator

mpenkov commented Aug 18, 2021

No need to rebase, just merge the develop HEAD.

@omBratteng
Copy link
Contributor Author

Done

Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@mpenkov mpenkov merged commit 2d7b024 into piskvorky:develop Aug 18, 2021
@omBratteng omBratteng deleted the azure-max-concurrency branch August 18, 2021 07:34
@mpenkov
Copy link
Collaborator

mpenkov commented Aug 18, 2021

Thank you for your contribution @omBratteng !

mpenkov added a commit that referenced this pull request Aug 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants