This is more of a toy project, so don't expect full-fledged crawler.
To run, it's best to use included docker image:
docker build -t webcrawl .
docker run --rm -ti --name webcrawl -p 3000:3000 webcrawl -a 0.0.0.0:3000
And then the API should be accessible at http://localhost:3000
on the host.
curl -i -XPOST \
-d '{"url": "http://some.host.example.com", "throttle": 100}' \
http://localhost:3000/api/crawl
curl -i -XGET http://localhost:3000/api/domains
curl -i -XGET http://localhost:3000/api/results?id=http://some.host.example.com
curl -i -XGET http://localhost:3000/api/results/count?id=http://some.host.example.com
GET /api/domains
POST /api/crawl
{
"url": "http://example.com",
"throttle": 50,
}
url
: an url to be crawledthrottle
: a maximum number of concurrent requests
{
"id": "http://example.com"
}
400
- if the payload is malformed, or it contains invalid URL409
- if the crawl is already pending
GET /api/results?id={id}
A json list of retrieved URLs
202
- if the crawl is pending and the result is not yet available404
- if theid
is not present in the results cache
GET /api/results/count?id={id}
{
"http://example.com": 123
}
202
- if the crawl is pending and the result is not yet available404
- if theid
is not present in the results cache