-
Notifications
You must be signed in to change notification settings - Fork 405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datasets: add download_azure_container utility function #1887
Datasets: add download_azure_container utility function #1887
Conversation
@microsoft-github-policy-service agree |
@Haimantika are you still working on this? @darkblue-b may be willing to help finish this PR. |
Hi, I have tried wrapping my head around it, but I couldn’t get past. Apologies for not finishing it :( |
Do you mind if @darkblue-b takes it from here? Can he push directly to your PR branch? |
Yes please. I will follow the changes to learn :) |
I tested that the TropicalCyclone dataset can be downloaded using: split = "train"
account_url = "https://radiantearth.blob.core.windows.net"
container_name = "mlhub"
name_starts_with = f"nasa-tropical-storm-challenge/{split}" However, much of the dataset format has changed, so this dataset will require a large rewrite. I'm going to add tests and merge this just as a new utility function, then we can open separate PRs for each dataset. |
) | ||
|
||
client = ContainerClient(*args, **kwargs) | ||
for blob in client.list_blob_names(name_starts_with=name_starts_with): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could wrap this with multiprocessing to speed things up, tqdm to add a progress bar, and various other improvements. There is also an asynchronous version of the API, although I don't think that will benefit us here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: compare download speeds to the CLI version and see if multiprocessing is required to make this comparable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first glance, this is orders of magnitude slower than azcopy sync
. I'll try parallelizing it, but if it's still noticeably slower, I suggest we rely on the CLI tool instead. Downside is that it can't be pip install'ed, but being slow is a bigger sin.
root: str, name_starts_with: str, *args: Any, **kwargs: Any | ||
) -> None: | ||
"""Download a container from Azure blob storage. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just add some examples of how to actually instantiate ContainerClient here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it help if we made account_url
and container_name
explicit parameters instead of putting them in args/kwargs? They are required by ContainerClient, but I didn't explicitly list them to keep the code as simple as possible. Either way, this function currently doesn't show up in the docs, as it's only intended for use in TorchGeo, not use in other programs.
# TODO: filehandle leak | ||
f = open(path, "rb", buffering=0) | ||
return f |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really know what to do about this. On macOS, we're only allowed to open 256 files at a time. I don't think any of our text data will exceed this, but I don't know how else to code this to mimic the actual behavior.
Unfortunately I think manual downloads with |
Needed for #1830