WaybackDownloader is a CLI tool for downloading the latest copy of all pages of a website from the wayback machine.
Build from source, or download it from the releases page.
To use the tool in its simplest form, use the following command:
WaybackDownloader.exe "www.example.com" "./example"
In this command, www.example.com
is the website to download, and ./example
is the directory where the downloaded pages will be stored.
The Wayback Downloader uses a log to store information about webpages it has already downloaded. By default, a folder is created in the current working directory under "/downloadHistory". To specify a custom path, use the --historyLogDir
option.
WaybackDownloader.exe <matchUrl> <outputDir> --historyLogDir ../../customHistoryLogFolder
Specify the match type using the -m
or --matchType
option. The default value is 'exact'. Other possible values include 'prefix', 'domain', and 'host'.
WaybackDownloader.exe <matchUrl> <outputDir> -m Prefix
The matchType
option determines how the is matched against the URLs in the Wayback Machine. Using example.com as an example:
Match Type | Description | Command |
---|---|---|
exact (default) |
Returns results matching exactly example.com |
WaybackDownloader.exe example.com outputDir -m exact |
prefix |
Returns results for all results under the path example.com |
WaybackDownloader.exe example.com outputDir -m prefix |
host |
Returns results from host example.com |
WaybackDownloader.exe example.com outputDir -m host |
domain |
Returns results from host example.com and all subhosts *.example.com |
WaybackDownloader.exe example.com outputDir -m domain |
Define a time range using the --from
and --to
options. The timestamp should follow the wayback machine format yyyyMMddHHmmss
. At least a 4-digit year must be specified when specifying a timestamp.
WaybackDownloader.exe <matchUrl> <outputDir> --from 20200101 --to 20201231
Apply filters using the -f
or --filters
option. The default filters are 'statuscode:200' and 'mimetype:text/html'.
WaybackDownloader.exe <matchUrl> <outputDir> -f !statuscode:404 -f !statuscode:302
Use the -p
or --pageFilters
option to apply page filters. Once a page has been downloaded, it will only be saved to disk if it contains one of the words in this list.
WaybackDownloader.exe <matchUrl> <outputDir> -p keyword1 -p keyword2
Limit the number of pages processed using the --limitPages
option. This is an absolute limit on the number of pages processed, two versions of the same page will count twice.
WaybackDownloader.exe <matchUrl> <outputDir> --limitPages 100
Warning
Setting a high rate limit is not recommended as it can lead to throttling or temporary blacklisting by the wayback machine and archive.org.
Set the rate limit for the number of pages to download per second using the -r
or --rateLimit
option. The default value is 5.
WaybackDownloader.exe <matchUrl> <outputDir> -r 10
Clear the history of previously downloaded pages using the --clearHistory
option.
WaybackDownloader.exe <matchUrl> <outputDir> --clearHistory
Enable verbose logging using the -v
or --verbose
option.
WaybackDownloader.exe <matchUrl> <outputDir> -v
WaybackDownloader.exe http://example.com ./downloads -m Prefix --from 20200101 --to 20201231 -f !statuscode:404 -p keyword1 -p keyword2 --limitPages 100 -r 10
This command will download pages from ‘http://example.com’, save them to the ‘./downloads’ directory, match URLs that start with ‘http://example.com’, only download pages from the year 2020, exclude pages with a 404 status code, only save pages that contain ‘keyword1’ or ‘keyword2’, process a maximum of 100 pages, and download a maximum of 10 pages per second.
- .NET 8.0 SDK or higher
Then, just run the following command:
dotnet build