Feature request: download multiple files from a collection in parallel #412

LogicalKnee · 2021-06-09T13:38:36Z

The docs provide examples for using GNU Parallel to perform tasks simultaneously. However this appears to be limited to operations at an item level. For the use case of downloading an entire item containing many large files, performing the downloads in parallel would provide a significant speed boost. While it is currently possible to achieve this with external tools (e.g. obtaining a file list with ia then using curl/wget with parallel), it would be nice if ia supported this natively.

While this feature could be implemented with the existing requests library, I assume it would likely tie in with any effort to port to pycurl (#244, #247).

The text was updated successfully, but these errors were encountered:

JustAnotherArchivist · 2021-06-09T18:56:50Z

You can already do that, though the docs don't give an example. ia list produces output with one filename per line, which you can then parallelise with ia download and GNU Parallel, xargs, or whichever tool you prefer. For example:

ia list identifier | xargs -P 8 -n 1 ia download identifier

Downsides: you need to repeat the item identifier, and it may be very inefficient if the item has many small files.

If this were to be implemented directly in ia, I'd argue that aiohttp or similar is the least terrible route. Parallel requests aren't trivial with requests or PycURL as they fundamentally lack parallelism and you need to use threads (though there are of course packages implementing that, at least for requests). I'm not sure that's worth the effort though.

LogicalKnee · 2021-06-10T01:35:59Z

You can already do that, though the docs don't give an example.

Yes, that's what I was getting at in the original post; there are already ways to download in parallel with a list generated by ia. The heart of the feature request was an integrated method to achieve the same thing. Something akin to to curl's --parallel (and --parallel-max) flags.

Parallel requests aren't trivial with requests

Having a quick look around at how to achieve this with requests, I'd tend to agree. The general consensus seems to be "don't" or "use multiprocessing"; the latter requiring careful consideration to ensure threads are handled correctly.

On the other had, pycurl contains a CurlMulti class which is a wrapper around libcurl's parallel features. pycurl provide sample usage of this functionality.

laptopsftw · 2021-10-27T08:44:12Z

this reminds me of youtube-dl where it can actually use external downloaders to download (like aria2c, etc)

idk if that's complicated though

jjjake · 2021-11-01T18:05:51Z

Here's an example of how you could download files from an item concurrently as well:

ia list nasa | parallel 'ia download nasa {}'

I'll leave this open in case others have feedback, but I personally think this is best handled with external tools like parallel or xargs.

JustAnotherArchivist mentioned this issue Feb 17, 2022

Change ia list output from TSV to JSONL #489

Closed

JustAnotherArchivist mentioned this issue Jan 9, 2024

multithreaded concurrent downloads? #622

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: download multiple files from a collection in parallel #412

Feature request: download multiple files from a collection in parallel #412

LogicalKnee commented Jun 9, 2021

JustAnotherArchivist commented Jun 9, 2021

LogicalKnee commented Jun 10, 2021

laptopsftw commented Oct 27, 2021

jjjake commented Nov 1, 2021

Feature request: download multiple files from a collection in parallel #412

Feature request: download multiple files from a collection in parallel #412

Comments

LogicalKnee commented Jun 9, 2021

JustAnotherArchivist commented Jun 9, 2021

LogicalKnee commented Jun 10, 2021

laptopsftw commented Oct 27, 2021

jjjake commented Nov 1, 2021