A crawler that provides timely response to data change on public rental house platform.
- Partial update on newly created house by given timestamp.
- Full status check on entire dataset.
- Publish house data to
rentea-db
- Docker environment (recommended) - Docker 18+ and docker-compose 1.18.0+, or
- Host environment - Python 3.7.
Build development image and update python package
docker-compose build crawler
-
Initialize a virtualenv
virtualenv -p <python3.7 bin path> .venv . .venv/bin/activate
-
Install required package
pip install -r requirements.txt
This package now support only one crawler: periodic591
, which is designed to perform partial update on 591 website.
In addition, it's recommended to enable persistent job queue to control memory usage of request data.
Supported parameter:
minuteago
: time range to look aheadtarget_cities
: comma seperated list of city, use台
instead of臺
To get new houses created in 591 in last 15 minutes:
scrapy crawl periodic591 -a minuteago=15 -s JOBDIR=data/spider-1
To get new houses created in an hour in 台南市 and 屏東縣:
scrapy crawl periodic591 -a minuteago=60 -a target_cities='台南市,屏東縣' -s JOBDIR=data/spider-1
To run crawler in docker, add docker-compose run crawler
in beginning of command
docker-compose run crawler scrapy crawl periodic591 -a minuteago=15 -s JOBDIR=data/spider-1
Help Wanted!
- Integrate with VSCode remote development
- please install virtualenv in host for autocomplete for VSCode for now.