a distributed, scalable and lightweight environment for deploying and running scrapy spiders/projects with no-hassle on commodity hardware, also it is compatible with scrapyd
/schedule.json
and/daemonstatus.json
.
$ pip install -U git+git://github.com/speakol-ads/scrapy-x.git
let's assume that you have a project called
TestCrawler
- cd to
TestCrawler
- run
scrapy x
- that is all!
it utilizes your default project
settings.py
file
# whether to enable debug mode or not
X_DEBUG = True
# the default queue name that the system will use
# actually it will be used as a prefix for its internal
# queues, currently there is only one queue called `X_QUEUE_NAME + '.BACKLOG'`
# which holds all jobs that should be crawled.
X_QUEUE_NAME = 'SCRAPY_X_QUEUE'
# the queue workers
# by default it uses the cpu cores count
# try to adjust it based on your resources & needs
X_QUEUE_WORKERS_COUNT = {
"page_debug": 2,
"latest_news": 4,
"kstream": 10,
"trending_news": 10,
}
# the webserver workers count
# the workers count required from uvicorn to spwan
# defaults to the available cpu count
# try to adjust it based on your resources & needs
X_SERVER_WORKERS_COUNT = os.cpu_count()
# the port the http server should listen on
X_SERVER_LISTEN_PORT = 6800
# the host used by the http server to listen on
X_SERVER_LISTEN_HOST = '0.0.0.0'
# whether to enable access log or not
X_ENABLE_ACCESS_LOG = True
# redis host
X_REDIS_HOST = 'localhost'
# redis port
X_REDIS_PORT = 6379
# redis db
X_REDIS_DB = 0
# redis password
X_REDIS_PASSWORD = ''
# the maximum allowed wait time for a running task
# it will be killed after that time.
X_TASK_TIMEOUT = 25
as well scrapyd core endpoints like (
schedule.json
,daemonstatus.json
), you have the following too:
GET /
returns some info about the engine like the available spiders and backlog queue length
GET|POST /run/{spider_name}
execute the specified spider in
{spider_name}
and wait for it to return its result, P.S: any query param and json post data will be passed to the spider as argument-a key=value
GET|POST /enqueue/{spider_name}
adding the specified spider in
{spider_name}
to the backlog to be executed later, P.S: any query param and json post data will be used as spider argument
I'm Mohamed, a software engineer who enjoys writing code in his free time, I'm speaking python, php, go, rust and js
P.S: star the project if you liked it ^_^