Datasets: add azcopy download support #2043
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds an
azcopy
function totorchgeo.datasets.utils
that makes it easier to download datasets from Azure Blob Storage (such as Source Cooperative). It's basically just a wrapper aroundsubprocess.run
, but with a more useful error message if azcopy isn't installed. It can be used as follows:The hardest part was testing. We don't want our tests to require internet access or download massive datasets, so we need to use local fake data to test. But we also can't get full test coverage unless we actually attempt to "download" the data, and
azcopy
doesn't support local <-> local file transfers like rsync does. My solution was to create a fakeazcopy
command that can copy local files and inject this first in thePATH
. I don't know of a reliable way to test when this command isn't available, so we may need to change CI a bit.Prerequisite for #1830
Closes #1887
Closes #1915
@Haimantika @darkblue-b Once this is reviewed and merged, I could use your help in porting our existing datasets to use this (full list in #1830). Unfortunately, many of the datasets seemingly completely changed their file hierarchy, so some of them may require more than just a simple one-function update.