libcrawl

libcrawl is a framework for implementing a web crawler. It implements the core logic of a crawler, including the ability to fetch resources using libcurl, while leaving it to the application to provide implementations of queues and policies.

The public interface to libcrawl can be found in libcrawl.h.

Crawl context

A crawl context is represented by a CRAWL * pointer, which is created and destroyed via crawl_create() and crawl_destroy(). The CRAWL structure itself is opaque; applications must use the provided accessor methods.

Many of the properties of a crawl context are callbacks which an application should provide. An application does not have to provide implementations for all of the callbacks, only those which are relevant to it. For example, if an application doesn't need a URI policy handler, then it should not specify a URI policy callback.

Crawl objects

A crawl object, represented by a CRAWLOBJ * pointer, represents a resource which can or has been retrieved from some source. Properties of a crawl object are set and obtained using the crawl_obj_xxx() methods. Associated with a crawl object at a minimum will be a URI and a cache key, which is automatically generated by libcrawl (see Caches, below).

Caches

libcrawl uses a cache implementation to store and retrieve crawled resources. libcrawl represents metadata about a resource in the form of a JSON object, via Jansson. Cache implementations can be set by an application in two ways: either by looking up a built-in implementation by its URI scheme, or by providing a fully-populated CRAWLCACHEIMPL structure (the lifetime of which must exceed that of any crawl context that it's attached to).

An application can call crawl_set_cache_uri(), which will examine the provided URI, attempt to match the scheme to a known implementation and associate it with the context, and invoke crawl_set_cache_path() automatically. For example, if the cache URI was file:///var/cache/anansi, then the file cache implementation would be attached to the context, initialised, and the cache path /var/cache/anansi specified.

Built-in cache implementations

`file`

The file cache stores resources and metadata on disk beneath the path specified.

`s3`

The s3 cache stores resources and metadata in an Amazon S3 bucket (or a service providing a compatible API). An application must also provide an access and secret key, via crawl_set_username() and crawl_set_password(), and optionally crawl_set_endpoint() if s3.amazonaws.com should not be used as the S3 API host.

Queue

libcrawl tracks all resources via a specialised queue. An application must provide an implementation of crawl_next_cb, which is invoked whenever libcrawl is ready to fetch a resource. The queue implementation provided by the application should then return the URI and current status (using the CRAWLSTATE enumeration).

libcrawl does not provide a method to add things to a queue: this is a private matter between an application's resource processing (if any) and the queue implementation. All an application needs to do to implement a queue from libcrawl's perspective is provide an answer to the question "what should be fetched next?"

Policies and checkpoints

There are three types of policy callback, which provide an application an opportunity to abort a fetch at different points in the process:—

A URI policy callback is supplied the URI of the resource to fetch, and can opt to proceed, skip, or fail with an error.

A pre-fetch callback is similar, but its return value has no effect on the process. It is invoked later than the URI policy callback—immediately prior to libcurl being invoked to actually fetch the resource—and exists primarily as a debugging tool.

A checkpoint callback is invoked once all of the resource's HTTP headers (or equivalent) have been retrieved and returns a CRAWLSTATE constant. If the returned value is anything other than COS_ACCEPTED, the crawl object's state will be set accordingly, and the fetch aborted as soon as possible. The checkpoint callback is supplied the crawl object (which includes all of the processed response headers), and response status code. This allows an application to, for example, skip resources that are of a content type which it cannot process, or to consider 4xx and 5xx HTTP responses to be COS_FAILED. A checkpoint callback can even modify the HTTP status if it wishes, and that modified status will be cached along with other metadata.

Fetching

libcrawl uses libcurl to perform network requests. An application can specify the User-Agent and Accept headers which are sent; at present further customisation of the requests is not possible, although the prefetch callback would be a good place for this to happen.

Processing

libcrawl provides three further callbacks which are invoked after the fetch has occurred: crawl_updated_cb, crawl_unchanged_cb, crawl_failed_cb, along with their corresponding crawl_set_xxx() methods. These can be used by an application to update the queue, and to perform any further processing—for example, an "updated" implementation may trigger examination of the fetched resource to locate and queue any links.

Application-private data

An application can store and retrieve a pointer that it provides, via crawl_set_userdata() and crawl_userdata(). libcrawl will never modify this pointer while the crawl context remains valid.

Examples

There are a number of small examples in the util directory:—

crawl-fetch creates a crawl context configured with a cache URI, but does not provide a queue implementation: instead, it invokes crawl_fetch() to force retrieval of the resource, and dumps information about the crawl object.
crawl-locate attempts to locate a crawl object within a cache by its URI, and displays the result. It will not itself fetch any data.
crawl-mirror is a more complex example; it implements a minimal in-memory queue, which is seeded by the provided URI. libxml2 is used to parse any HTML files which are retrieved and push any links within them into the queue.

However, the most complete example is the Anansi crawler itself, which couples libcrawl with implementations of queues, policies and resource processing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly