- About
- Development setup
- Production setup
- Environment variables
- Rest API and Queues board
- Working with scenarios
- Working with scenario schedulers
- Tutorial: Creating the fist scenario
- Integrations
- License
Crawler is a standalone application written in Node.js built on top of Express.js, Crawlee, Puppeteer and BullMQ, allowing you to crawl data from web pages by defining scenarios. This is all controlled through the Rest API.
- Docker compose
- Make
$ git clone https://github.com/68publishers/crawler.git crawler
$ cd crawler
$ make init
HTTP Basic authorization is required for API access and administration. Here we need to create a user to access the application.
$ docker exec -it crawler-app npm run user:create
- Docker
- Postgres
>=14.6
- Redis
>=7
For production use, the following Redis settings must be made:
- Configuring persistence with
Append-only-file
strategy - https://redis.io/docs/management/persistence/#aof-advantages - Set
Max memory policy
tonoeviction
- https://redis.io/docs/reference/eviction/#eviction-policies
Firstly, you need to run the database migrations with the following command:
$ docker run \
--network <NETWORK> \
-e DB_URL=postgres://<USER>:<PASSWORD>@<HOSTNAME>:<PORT>/<DB_NAME> \
--entrypoint '/bin/sh' \
-it \
--rm \
68publishers/crawler:latest \
-c 'npm run migrations:up'
Then download the seccomp
file, which is required to run chrome:
$ curl -C - -O https://raw.githubusercontent.com/68publishers/crawler/main/.docker/chrome/chrome.json
And run the application:
$ docker run \
-- init \
--network <NETWORK> \
-e APP_URL=<APPLICATION_URL> \
-e DB_URL=postgres://<USER>:<PASSWORD>@<HOSTNAME>:<PORT>/<DB_NAME> \
-e REDIS_HOST=<HOSTNAME> \
-e REDIS_PORT=<PORT> \
-e REDIS_AUTH=<PASSWORD> \
-p 3000:3000 \
--security-opt seccomp=$(pwd)/chrome.json \
-d \
--name 68publishers_crawler \
68publishers/crawler:latest
HTTP Basic authorization is required for API access and administration. Here we need to create a user to access the application.
$ docker exec -it 68publishers_crawler npm run user:create
Name | Required | Default | Description |
---|---|---|---|
APP_URL | yes | - | Full origin of the application e.g. https://www.example.com . The variable is used to create links to screenshots etc. |
APP_PORT | no | 3000 |
Port to which the application listens |
DB_URL | yes | - | Connection string to postgres database e.g. postgres://root:root@localhost:5432/crawler |
REDIS_HOST | yes | - | Redis hostname |
REDIS_PORT | yes | - | Redis port |
REDIS_AUTH | no | - | Optional redis password |
REDIS_DB | no | 0 |
Redis database number |
WORKER_PROCESSES | no | 5 |
Number of workers that process the queue of running scenarios |
CRAWLEE_STORAGE_DIR | no | ./var/crawlee |
Directory where crawler stores runtime data |
CHROME_PATH | no | /usr/bin/chromium-browser |
Path to Chromium executable file |
SENTRY_DSN | no | - | Logging into the Sentry is enabled if the variable is passed |
SENTRY_SERVER_NAME | no | crawler |
Server name that is passed into the Sentry logger |
The specification of the Rest API (Swagger UI) can be found at endpoint /api-docs
. Usually http://localhost:3000/api-docs
in case of development setup. You can try to call all endpoints here.
Alternatively, the specification can be viewed online.
BullBoard is located at /admin/queues
. Here you can see all the scenarios that are currently running or have already run.
@todo
@todo
@todo
- PHP Client for Crawler's API - 68publishers/crawler-client-php
The package is distributed under the MIT License. See LICENSE for more information.