One shot 591 crawler that keep flexibility and peace in mind.
Major features:
- Crawl any rental list, including 住家、店面、辦公.
- Accept incomplete data, such as 404 not found.
- Aggregate all results and export union of attributes.
- Crawl target website with politeness, at most 1 request/sec.
This crawler is NOT designed for:
- Parallel execution or anything related to efficiency
- Aggregate data across multiple time period
- Docker environment - Docker 18+ and docker-compose 1.18.0+
Build development image and update python package
docker-compose build espresso
This tool provide a simple CLI manage.py
, which support:
list
- list all existing crawler jobcrawl
- create a new crawler jobresume
- resume a previously stopped jobHELP NEEDED
delete
- delete specified job and its dataHELP NEEDED
export
- export data in specified job
For detail usage, please see help message:
docker-compose run espresso python manage.py -h
docker-compose run espresso python manage.py crawl 'https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1'
docker-compose run espresso python manage.py crawl 'https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1' -c '台北市,澎湖縣'
docker-compose run espresso python manage.py crawl 'https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1' --novip
docker-compose run espresso python manage.py list
docker-compose run espresso python manage.py export <job_id> <export_config_yaml>
See config/店面.yaml
for example config