Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: download multiple files from a collection in parallel #412

Open
LogicalKnee opened this issue Jun 9, 2021 · 4 comments

Comments

@LogicalKnee
Copy link

The docs provide examples for using GNU Parallel to perform tasks simultaneously. However this appears to be limited to operations at an item level. For the use case of downloading an entire item containing many large files, performing the downloads in parallel would provide a significant speed boost. While it is currently possible to achieve this with external tools (e.g. obtaining a file list with ia then using curl/wget with parallel), it would be nice if ia supported this natively.

While this feature could be implemented with the existing requests library, I assume it would likely tie in with any effort to port to pycurl (#244, #247).

@JustAnotherArchivist
Copy link
Contributor

You can already do that, though the docs don't give an example. ia list produces output with one filename per line, which you can then parallelise with ia download and GNU Parallel, xargs, or whichever tool you prefer. For example:

ia list identifier | xargs -P 8 -n 1 ia download identifier

Downsides: you need to repeat the item identifier, and it may be very inefficient if the item has many small files.

If this were to be implemented directly in ia, I'd argue that aiohttp or similar is the least terrible route. Parallel requests aren't trivial with requests or PycURL as they fundamentally lack parallelism and you need to use threads (though there are of course packages implementing that, at least for requests). I'm not sure that's worth the effort though.

@LogicalKnee
Copy link
Author

You can already do that, though the docs don't give an example.

Yes, that's what I was getting at in the original post; there are already ways to download in parallel with a list generated by ia. The heart of the feature request was an integrated method to achieve the same thing. Something akin to to curl's --parallel (and --parallel-max) flags.

Parallel requests aren't trivial with requests

Having a quick look around at how to achieve this with requests, I'd tend to agree. The general consensus seems to be "don't" or "use multiprocessing"; the latter requiring careful consideration to ensure threads are handled correctly.

On the other had, pycurl contains a CurlMulti class which is a wrapper around libcurl's parallel features. pycurl provide sample usage of this functionality.

@laptopsftw
Copy link

this reminds me of youtube-dl where it can actually use external downloaders to download (like aria2c, etc)

idk if that's complicated though

@jjjake
Copy link
Owner

jjjake commented Nov 1, 2021

Here's an example of how you could download files from an item concurrently as well:

ia list nasa | parallel 'ia download nasa {}'

I'll leave this open in case others have feedback, but I personally think this is best handled with external tools like parallel or xargs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants