Tools for interacting with Cookiemonster API

Run a crawl

Init

Dependencies

Initialize the Python virtual environment and install the requirements. For example:

python3 -m venv env
source env/bin/activate
pip3 install -r requirements.txt

List of domains to crawl

You will need a list of domains to crawl. It presumes a list of Tranco domains, though other lists might work.

Run the get_latest_tranco.sh script (curls and unzips the latest top 1M Tranco list of domains with subdomains).

Alternatively, you can obtain by hand the latest (zipped) Tranco list with subdomains here: https://tranco-list.eu/top-1m-incl-subdomains.csv.zip

Using

You need to pass as input a CSV file of domains. You also need to pass an output file. If the supplied output file already exists, it will be appended to, not overwritten. Also, you can pass in a --skip parameter to skip the first N rows. All this helps restart crawls.

The script will both print to stdout and write to the output file.

python3 crawl.py -i top-1m.csv -o sept14-crawl-results.txt

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
crawl.py		crawl.py
get_latest_tranco.sh		get_latest_tranco.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tools for interacting with Cookiemonster API

Run a crawl

Init

Dependencies

List of domains to crawl

Using

About

Releases

Packages

Contributors 3

Languages

License

brave-experiments/cookiemonster-tools

Folders and files

Latest commit

History

Repository files navigation

Tools for interacting with Cookiemonster API

Run a crawl

Init

Dependencies

List of domains to crawl

Using

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages