-
Notifications
You must be signed in to change notification settings - Fork 0
libcrawl
libcrawl is a framework for implementing a web crawler. It implements the core logic of a crawler, including the ability to fetch resources using libcurl, while leaving it to the application to provide implementations of queues and policies.
The public interface to libcrawl can be found in libcrawl.h.
A crawl context is represented by a CRAWL *
pointer, which is created and destroyed via crawl_create()
and crawl_destroy()
. The CRAWL
structure itself is opaque; applications must use the provided accessor methods.
Many of the properties of a crawl context are callbacks which an application should provide. An application does not have to provide implementations for all of the callbacks, only those which are relevant to it. For example, if an application doesn't need a URI policy handler, then it should not specify a URI policy callback.
A crawl object, represented by a CRAWLOBJ *
pointer, represents a resource which can or has been retrieved from some source. Properties of a crawl object are set and obtained using the crawl_obj_xxx()
methods. Associated with a crawl object at a minimum will be a URI and a cache key, which is automatically generated by libcrawl (see Caches, below).
libcrawl uses a cache implementation to store and retrieve crawled resources. libcrawl represents metadata about a resource in the form of a JSON object, via Jansson. Cache implementations can be set by an application in two ways: either by looking up a built-in implementation by its URI scheme, or by providing a fully-populated CRAWLCACHEIMPL
structure (the lifetime of which must exceed that of any crawl context that it's attached to).
An application can call crawl_set_cache_uri()
, which will examine the provided URI, attempt to match the scheme to a known implementation and associate it with the context, and invoke crawl_set_cache_path()
automatically. For example, if the cache URI was file:///var/cache/anansi
, then the file
cache implementation would be attached to the context, initialised, and the cache path /var/cache/anansi
specified.
The file
cache stores resources and metadata on disk beneath the path specified.
The s3
cache stores resources and metadata in an Amazon S3 bucket (or a service providing a compatible API). An application must also provide an access and secret key, via crawl_set_username()
and crawl_set_password()
, and optionally crawl_set_endpoint()
if s3.amazonaws.com
should not be used as the S3 API host.
libcrawl tracks all resources via a specialised queue. An application must provide an implementation of crawl_next_cb
, which is invoked whenever libcrawl is ready to fetch a resource. The queue implementation provided by the application should then return the URI and current status (using the CRAWLSTATE
enumeration).
libcrawl does not provide a method to add things to a queue: this is a private matter between an application's resource processing (if any) and the queue implementation. All an application needs to do to implement a queue from libcrawl's perspective is provide an answer to the question "what should be fetched next?"
See also Processing below.
There are three types of policy callback, which provide an application an opportunity to abort a fetch at different points in the process:—
A URI policy callback is supplied the URI of the resource to fetch, and can opt to proceed, skip, or fail with an error.
A pre-fetch callback is similar, but its return value has no effect on the process. It is invoked later than the URI policy callback—immediately prior to libcurl
being invoked to actually fetch the resource—and exists primarily as a debugging tool.
A checkpoint callback is invoked once all of the resource's HTTP headers (or equivalent) have been retrieved and returns a CRAWLSTATE
constant. If the returned value is anything other than COS_ACCEPTED
, the crawl object's state will be set accordingly, and the fetch aborted as soon as possible. The checkpoint callback is supplied the crawl object (which includes all of the processed response headers), and response status code. This allows an application to, for example, skip resources that are of a content type which it cannot process, or to consider 4xx and 5xx HTTP responses to be COS_FAILED
. A checkpoint callback can even modify the HTTP status if it wishes, and that modified status will be cached along with other metadata.
libcrawl uses libcurl to perform network requests. An application can specify the User-Agent
and Accept
headers which are sent; at present further customisation of the requests is not possible, although the prefetch callback would be a good place for this to happen.
libcrawl provides three further callbacks which are invoked after the fetch has occurred: crawl_updated_cb
, crawl_unchanged_cb
, crawl_failed_cb
, along with their corresponding crawl_set_xxx()
methods. These can be used by an application to update the queue, and to perform any further processing—for example, an "updated
" implementation may trigger examination of the fetched resource to locate and queue any links.
An application can store and retrieve a pointer that it provides, via crawl_set_userdata()
and crawl_userdata()
. libcrawl will never modify this pointer while the crawl context remains valid.
There are a number of small examples in the util directory:—
-
crawl-fetch
creates a crawl context configured with a cache URI, but does not provide a queue implementation: instead, it invokescrawl_fetch()
to force retrieval of the resource, and dumps information about the crawl object. -
crawl-locate
attempts to locate a crawl object within a cache by its URI, and displays the result. It will not itself fetch any data. -
crawl-mirror
is a more complex example; it implements a minimal in-memory queue, which is seeded by the provided URI.libxml2
is used to parse any HTML files which are retrieved and push any links within them into the queue.
However, the most complete example is the Anansi crawler itself, which couples libcrawl with implementations of queues, policies and resource processing.