Similar to scrapy-redis, using the database as a queue, database-based scrapy components.
-
Distributed crawling/scraping
You can start multiple spider instances that share a single db queue. Best suitable for broad multi-domain crawls.
-
Distributed post-processing
Scraped items gets pushed into a DB queued meaning that you can start as many as needed post-processing processes sharing the items queue.
-
Scrapy plug-and-play components
Scheduler + Duplication Filter, Base Spiders.
- Python 3.7+
peewee
>= 3.16.0Scrapy
>= 2.7.0pymysql
>= 1.0.3
From pip
pip install scrapy-db
From GitHub
git clone https://github.com/libra146/scrapy-db.git
cd scrapy-db
python setup.py install
From poetry
poetry add scrapy-db
If you are conducting distributed crawling tasks, scraper db is a very practical scraper component that can help you complete tasks more efficiently.
Clone the current project and run the example crawler in example-project to experience it.
This repository is still under development and may be unstable.
Because I have a huge request pool, I don't have that much memory for redis to save it, so, I thought about database, I created it with reference to scrapy-redis and it works fine.