Skip to content

Auto-download files and collections from Internet Archive

License

Notifications You must be signed in to change notification settings

rob-sve/iadownloader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

iadownloader

Summary

iadownloader is a tool to automatically download files from the Internet Archive. It will download all the files - individually or as a compressed archive - in an internet archive upload url automatically, to a configurable download location (defaults to the current working directory). It can also download complete collections etc, by parsing either json or csv files generated by Internet Archive’s advanced search tool.

Usage

iadownloader.py [-h] [-c] [-o OUTPUT_DIR] [-t THREADS] [-T] url

positional arguments:
  url                   URL or path to json/csv file

optional arguments:
  -h, --help            show this help message and exit
  -c, --compressed      Get the compressed archive download instead of the individual files
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        Path to output directory
  -t THREADS, --threads THREADS
                        Number of simultaneous downloads (maximum of 10)
  -T, --torrent         Only download the torrent file if available

The basic usage is to simply invoke iadownloader with a download url.

python iadownloader.py https://archive.org/download/<url>

This causes all the files in the url to be downloaded to the directory the script was invoked from.

Optionally specify the download location:

python iadownloader.py -o /download/path https://archive.org/download/<url>

To download the compressed archive of the upload just add the ‘-c’ flag:

python iadownloader.py -c -o /download/path https://archive.org/download/<url>

You can also specify the amount of threads (up to 10):

python iadownloader.py -t 8 /download/path https://archive.org/download/<url>

It defaults to 4 threads if not specified.

Don’t confuse “download url” with individual file urls. Those are trivially downloaded through your web browser. This tool is to simplify downloading all the included urls in an upload on Internet Archive. Even this can be done using the Web UI quite easily. Where iadownloader shines is the ability to download full collections automatically.

To download a whole collection, all files from a certain author, etc, go to Internet Archive’s advanced search tool and follow the following steps:

  1. Scroll down to “Advanced Search returning JSON, XML, and more”. In the “Query” field enter collection:<name of collection> for collections, creator:<name of creator> for creators, etc. In “Field to return” select “identifier” if not already selected. Select an appropriate “Number of results” depending on the collection.
  2. Choose either JSON format or CSV format. CSV format is a bit more convenient since it prompts you to download it immediately, while the JSON format opens a javascript page with embedded JSON data. Save the .csv file to a location. If you choose JSON, save the page and make sure to save it with the .json ending rather than the suggested .js one.
  3. Run iadownloader.py like this:
    python iadownloader.py -o /download/path /path/to/csv-or-json-file
        

iadownloader will go through all the downloads of the collection and download them into the download path.

Requirements

iadownloader uses requests, lxml, and tqdm to do its magic. To make sure you have them use the included requirements.txt:

pip install -r requirements.txt

Of course, you need python and pip as well.

About

Auto-download files and collections from Internet Archive

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages