Scrapyd-k8s is an application for deploying and running Scrapy spiders as either Docker instances or Kubernetes jobs. Its intention is to be compatible with scrapyd, but adapt to a container-based environment.
There are some important differences, though:
-
Spiders are distributed as Docker images, not as Python eggs. This allows to bundle spiders with dependencies, with all its benefits (and downsides).
-
Each spider is run as a Docker instance or Kubernetes job, instead of a process. This gives good visibility within an already running cluster.
-
Projects are specified in the configuration file, which means this can not be modified at run-time. On the other hand, scrapyd-k8s can be restarted without affecting any running spiders.
At this moment, each spider job is directly linked to a Docker instance or Kubernetes job, and the daemon will retrieve its state by looking at those jobs. This makes it easy to inspect and adjust the spider queue even outside scrapyd-k8s.
No scheduling is happening (yet?), so all jobs created will be started immediately.
Typically this application will be run on using a (Docker or Kubernetes) container.
You will need to provide a configuration file, use one of the sample configuration
files as a template (scrapyd_k8s.sample-k8s.conf
or scrapyd_k8s.sample-docker.conf
).
The next section explains how to get this running Docker, Kubernetes or Local. Then read on for an example of how to use the API.
cp scrapyd_k8s.sample-docker.conf scrapyd_k8s.conf
docker build -t ghcr.io/q-m/scrapyd-k8s:latest .
docker run \
--rm \
-v ./scrapyd_k8s.conf:/opt/app/scrapyd_k8s.conf:ro \
-v /var/run/docker.sock:/var/run/docker.sock \
-v $HOME/.docker/config.json:/root/.docker/config.json:ro \
-u 0 \
-p 127.0.0.1:6800:6800 \
ghcr.io/q-m/scrapyd-k8s:latest
You'll be able to talk to localhost on port 6800
.
Make sure to pull the spider image so it is known locally. In case of the default example spider:
docker pull ghcr.io/q-m/scrapyd-k8s-spider-example
Note that running like this in Docker is not really recommended for production, as it exposes the Docker socket and runs as root. It may be useful to try things out.
- Adapt the spider configuration in
kubernetes.yaml
(scrapyd_k8s.conf
in configmap) - Create the resources:
kubectl create -f kubernetes.yaml
You'll be able to talk to the scrapyd-k8s
service on port 6800
.
For development, or just a quick start, you can also run this application locally.
Requirements:
- Python 3
- Skopeo available in
PATH
(for remote repositories) - Either Docker or Kubernetes setup and accessible (scheduling will require Kubernetes 1.24+)
This will work with either Docker or Kubernetes (provided it is setup). For example, for Docker:
cp scrapyd_k8s.sample-docker.conf scrapyd_k8s.conf
python3 -m scrapyd_k8s
You'll be able to talk to localhost on port 6800
.
For Docker, make sure to pull the spider image so it is known locally. In case of the default example spider:
docker pull ghcr.io/q-m/scrapyd-k8s-spider-example
With scrapyd-k8s
running and setup, you can access it. Here we assume that
it listens on localhost:6800
(for Kubernetes, you would use
the service name scrapyd-k8s:6800
instead).
curl http://localhost:6800/daemonstatus.json
{"spiders":0,"status":"ok"}
curl http://localhost:6800/listprojects.json
{"projects":["example"],"status":"ok"}
curl 'http://localhost:6800/listversions.json?project=example'
{"status":"ok","versions":["latest"]}
curl 'http://localhost:6800/listspiders.json?project=example&_version=latest'
{"spiders":["quotes","static"],"status":"ok"}
curl -F project=example -F _version=latest -F spider=quotes http://localhost:6800/schedule.json
{"jobid":"e9b81fccbec211eeb3b109f30f136c01","status":"ok"}
curl http://localhost:6800/listjobs.json
{
"finished":[],
"pending":[],
"running":[{"id":"e9b81fccbec211eeb3b109f30f136c01","project":"example","spider":"quotes","state":"running", "start_time":"2012-09-12 10:14:03.594664", "end_time":null}],
"status":"ok"
}
To see what the spider has done, look at the container logs:
docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS NAMES 8c514a7ac917 ghcr.io/q-m/scrapyd-k8s-spider-example:latest "scrapy crawl quotes" 42s ago Exited (0) 30s ago scrapyd_example_cb50c27cbec311eeb3b109f30f136c01
docker logs 8c514a7ac917
[scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: example) ... [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/> {'text': 'The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.', 'author': 'Albert Einstein', 'tags': 'change'} ... [scrapy.core.engine] INFO: Spider closed (finished)
- Spiders are distributed as Docker images.
- One can run
scrapy crawl <spider>
in the container to run a spider, without any additional setup (so setSCRAPY_SETTINGS_MODULE
). - Each Docker image has specific labels to indicate its project and spiders.
org.scrapy.project
- the project nameorg.scrapy.spiders
- the spiders (those returned byscrapy list
, comma-separated)
An example spider is available at q-m/scrapyd-k8s-example-spider, including a Github Action for building a container.
daemonstatus.json
(➽)
Lists scrapyd jobs by looking at Docker containers or Kubernetes jobs.
addversion.json
(➽)
addversion.json
Not supported, by design. If you want to add a version, add a Docker image to the repository.
schedule.json
(➽)
Schedules a new spider by creating a Docker container or Kubernetes job.
cancel.json
(➽)
Removes a scheduled spider, kills it when running, does nothing when finished.
listprojects.json
(➽)
Lists projects from the configuration file.
listversions.json
(➽)
Lists versions from the project's Docker repository.
listspiders.json
(➽)
Lists spiders from the spider image's org.scrapy.spiders
label.
listjobs.json
(➽)
Lists current jobs by looking at Docker containers or Kubernetes jobs.
Note that end_time
is not yet supported for Docker.
delversion.json
(➽)
delversion.json
Not supported, by design. If you want to delete a version, remove the corresponding Docker image from the repository.
delproject.json
(➽)
delproject.json
Not supported, by design. If you want to delete a project, remove it from the configuration file.
This is done in the file scrapyd_k8s.conf
, the options are explained in the Configuration Guide.
This software is distributed under the MIT license.