wayback-keyword-search

IMPORTANT NOTICE: WAYBACK IS NOW RATE LIMITING. TOOL IS STILL RUNNING, BUT SLOWLY (IT CAN'T LEVERAGE PARALLELISM ANYMORE).

This tools downloads each page from the Wayback Machine for a specific input domain and saves each page as a local .txt file, so that you can later search for keyword matches within the saved files.

Downloading is done with the "download" file; and searching with the "search" file.

You can download pages saved in specific years (i.e.: 2020), or years and months (i.e.: 202001), or years and months and days (i.e: 20200101), just specifying the date format in the prompt. If you want to download everything in the 2000's or 19**'s regardless the saved date, just type "2" (for the pages saved past 2000) or "1" (for the pages saved in the XXth century) in the prompt, and the tool will save each page matching that criteria. So, if you want to download a website that has been archived across 1999 and 2000, you will need to run the tool twice.

If you need to download big websites (thousands of saved pages), it may require quite a long time now. Still better than nothing. I advice using a VPN with auto switch, changing IP address every 30 minutes to avoid blocking.

Sometimes, when the Wayback API is down, you cannot fetch the entire list of URLs it has archived (this happens quite often based on recent experience); so be patient and retry.

There is a Python3 version and a Go version.

[*] Python usage:

python3 download.py > specify your domain like: nytimes.com (no quotes!)

When the download is completed, a directory named as the domain will be saved in the local path.

So you can search for keyword matches within each file in the local dir using the "search.py" file:

python3 search.py > specify your keyword (no quotes!).

[*] Go usage: [FOLLOWING PULL REQUEST THE CODE HAS BEEN REFACTORED - remember that setting too many workers may get you blocked.]

make run-downloader

If you use the downloader, you can use the following arguments:

downloader --domain=YOUR_SITE --timeStamp=2023 --workers=10

where the parameters are:

domain - specify the target domain (only lowercase)
timeStamp - specify timestamp in the format:'yyyymmdd' (also: 'yyyy' > download only a specific year; 'yyyymm' > year and month; '2' or '1' > everything for the years past 20** or 19**
workers - specify the max workers (default=10)

and then:

make run-search
```bash

The best way to use the Go version is by running the compiled executables:

```bash
make builds

Notice that the Go version also features a cmd/download_channels/main.go version (thanks to Stephen Paulger for such improvement) which is a bit more efficient. Consider testing both!

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
.vscode		.vscode
cmd		cmd
internal/engine		internal/engine
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
download.py		download.py
go.mod		go.mod
requirements.txt		requirements.txt
search.py		search.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wayback-keyword-search

About

Releases

Packages

Languages

lorenzoromani1983/wayback-keyword-search

Folders and files

Latest commit

History

Repository files navigation

wayback-keyword-search

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages