Home

Welcome to wiki of the CHAOS Base Harvester!

This is the home of get-started-easily information related to a boilerplate framework, useful when developing metadata harvesters of external services into a CHAOS repository.

If you have no idea on what CHAOS is, please visit chaos-community.org for more information.

When is this needed?

As mentioned, this boilerplate framework is useful if you are about to create some automated program, creating and maintaining CHAOS objects from external objects/records exposed via a web service (most likely outside of CHAOS-land).

The typical use case is an organization which hosts or has access to a CHAOS repository, that they need to fill and keep updated with metadata on objects from another service.

Requirements

The harvester was build and run on a Linux environment, but there should be no limitations to running it in a windows environment instead. The only requirements on the environment is

PHP 5.3.5+ is required.
The cURL plugin must be enabled in PHP.
The iconv plugun must be enabled in PHP (it is by default).

These are essentially the same requirements that the CHAOS PHP SDK has.

Building your own harvester

Building a working harvester from the base harvester involves the following steps in sequence:

Create a configuration file. The configuration file is a an XML file which validates against the schemas/ChaosHarvesterConfiguration.xsd XML schema.
Implement a class wrapping any external client, important for the harvester.
Make sure this wrapped class is reachable from one of the paths specified as IncludePath elements in the configuration file.

Defining a configuration

The configuration file defines how the harvester can operate, which CHAOS service to connect to and what values to use for the different objects, files and metadata produced when harvesting.

To get an overview of how to configure a harvester, please read the Configuration file page.

Running the harvester

Once you've got a configuration file specified, it is time to boot up the harvester. The configuration file specifies a set of modes that the harvester supports. This could be harvesting a single external record or harvesting the all of the records of the external service, etc.

Pop up a terminal, navigate to the Harvester-Base repository that you've just cloned onto your machine. Now invoke the php command-line interpreter with the following command:

php ./src/CHAOS/Harvester/ChaosHarvester.php --configuration="/path/to/configuration.xml" --mode=some-cool-mode

You might encounter a couple of PHP warnings telling you that values of tags should be fetched from environment variables, which has not been set. This is because the configuration file supports fetching its string values from environment variables, if for some reason you would like these values to be specified at runtime. This is useful when dealing with multiple CHAOS environments or simply as a means of not storing login credentials in the configuration files.

Additionally every chc:path element in the chc:ChaosHarvesterConfiguration/chc:IncludePaths element is evaluated, i.e. all of these paths must point to existing folders as absolute paths or relative to either the configuration files path or the value of the chc:ChaosHarvesterConfiguration/chc:BasePath element.

The --mode=some-cool-mode runtime option tells the harvester to start in the mode some-cool-mode which must match the name of a mode in the configuration file. Additionally the harvester might require a --reference=... option, if the mode is *ByReference mode (see below).

When developing your harvester it might make sense to have a more verbose output than you get from simply running the harvester, this is where the following runtime options makes sence:

--debug This makes the different processors and modes print a more information on what they are doing.
--debug-metadata This prints the XML result of any MetadataProcessor when it has finished processing an external object.

Setting up a periodic run of the harvester

One could think of a couple of ways to do this:

Running the src/CHAOS/Harvester/ChaosHarvester.php script as a cron job.
Running the src/CHAOS/Harvester/ChaosHarvester.php script in some continuous integration environment such as Jenkins CI, this is what what is currently used at DR.

Whats inside?

UML Class diagram of the system entities

Entities

Modes

The modes of a harvester is the different ways that the harvester can process the external objects, six different types of modes have been identified from the previous version of the harvester:

SingleByReference

This mode-type describes a mode that can fetch a single external object by the external reference (id/url etc.)

SingleByPosition

This mode-type describes a mode that can fetch a single external object by the position on an external list of all advailable external objects.

SetByPositionInterval

This mode-type describes a mode that can fetch a range of external objects by the interval of position values on an external list of all advailable external objects.

SetByReferenceInterval

This mode-type describes a mode that can fetch a range of external objects by the interval of references (ids/urls etc) values.

SetByReference

This mode-type describes a mode that can fetch a set of external object from a reference to this external set (id/url etc).

All

This mode-type describes a mode that can fetch all external object exposed by the service.

Processors

A processor is what is invoked on every one of the external objects, three types of processors have been identified.

ObjectProcessor

The outcome of an object processor is a chaos objects of a specific type.

MetadataProcessor

The outcome of a metadata processor is XML metadata valid in respects to a specific schema.

FileProcessor

The outcome of a file processor is file references.

PreProcessor

This is a processor that is called before another processor - it can be seen as a type of extension to an existing processor, manipulating the external object before it's sent to the particular processor. This can be used to correct systematic errors from a webservice or alike.

Processors filters

A processor filter is basically a simple piece of code that is called just before an external object is passed to a processor. If any of the filters complains the external object isn't processed.

Two types of filters exists:

A Filter based on a class defined in the include path, i.e. it can be implemented as a simple PHP class extending the abstract Filter class.
An embedded filter which is a small snippet of code, embedded directly in the configuration file. This is better for simpler filtering tasks.

External clients

An external client is usually an extension of a regular PHP client class, which implements the IExternalClient interface. A harvester can utilise any number of external clients.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly