Skip to content
Kræn Hansen edited this page Jun 10, 2013 · 21 revisions

Introducing the CHAOS Base Harvester

Welcome to wiki of the CHAOS Base Harvester!

This is the home of get-started-easily information related to a boilerplate framework, useful when developing metadata harvesters of external services into a CHAOS repository.

If you have no idea on what CHAOS is, please visit chaos-community.org for more information.

If you would like to know more about what's underneath the hood of this beast - please have a look at the What's inside? WIKI page.

When is this needed?

As mentioned, this boilerplate framework is useful if you are about to create some automated program, creating and maintaining CHAOS objects from external objects/records exposed via a web service (most likely outside of CHAOS-land).

The typical use case is an organization which hosts or has access to a CHAOS repository, that they need to fill and keep updated with metadata on objects from another service.

UML Class diagram of the system entities

Requirements

The harvester was build and run on a Linux environment, but there should be no limitations to running it in a windows environment instead. The only requirements on the environment is

  • PHP 5.3.5+ is required.
  • The cURL plugin must be enabled in PHP.
  • The iconv plugun must be enabled in PHP (it is by default).

These are essentially the same requirements that the CHAOS PHP SDK has.

Key terminology

Language is important. Throughout these wiki pages and when talking about metadata harvesting in general, some terms are used to describe concepts and the behaviour of key activities.

Service

The term service is widely used to describe a computer program running on a remote machine, which provides a set of actions to read, write, update or delete data on the remote machine. It is often also called a Web Service or an API (Application Programming Interface).

A common feature across services are, that they are made for machines to interact with machines. They often support the same features as a regular HTTP website for humans will do, but in a much simpler way, without the graphical user interface and therefore easier for machines to interact with.

The types of services which are often provided are ex. WSDL / SOAP services, RESTful web services, OAI-PMH services.

Records / Objects / Items / Assets / Posts / Elements / Content

These are all names which are used inconsistently for the same basic concept: The atomic entity of which we are harvesting. The Open Archives Initiative - Protocol for Metadata Harvesting (OAI-PMH) call this a record, the CHAOS system calls this an object, Flickr would call these photos, photosets or galleries. The important thing is that the name of these are not of importance as long as we can define a way to understand them, see the term Home#translation--mapping for this.

Filtering

Not all content is relevant for everyone. More specifically not all records are relevant for every external consumer of a service. This is why we often need to perform a filtering on the records consumed when harvesting. This filtering can happen at none, one or both sides of the bridge between two services.

Translation / Mapping

Having metadata on records in different structures is like having information stored in different human languages. Some might understand one of them independent of another, but they have no obvious connection.

Building your own harvester

Building a working harvester from the base harvester involves the following steps in sequence:

  1. Create a configuration file. The configuration file is a an XML file which validates against the schemas/ChaosHarvesterConfiguration.xsd XML schema.
  2. Implement a class wrapping any external client, important for the harvester.
  3. Make sure this wrapped class is reachable from one of the paths specified as IncludePath elements in the configuration file.

Defining a configuration

The configuration file defines how the harvester can operate, which CHAOS service to connect to and what values to use for the different objects, files and metadata produced when harvesting.

To get an overview of how to configure a harvester, please read the Configuration file page.

Running the harvester

Once you've got a configuration file specified, it is time to boot up the harvester. The configuration file specifies a set of modes that the harvester supports. This could be harvesting a single external record or harvesting the all of the records of the external service, etc.

Pop up a terminal, navigate to the Harvester-Base repository that you've just cloned onto your machine. Now invoke the php command-line interpreter with the following command:

php ./src/CHAOS/Harvester/ChaosHarvester.php --configuration="/path/to/configuration.xml" --mode=some-cool-mode

You might encounter a couple of PHP warnings telling you that values of tags should be fetched from environment variables, which has not been set. This is because the configuration file supports fetching its string values from environment variables, if for some reason you would like these values to be specified at runtime. This is useful when dealing with multiple CHAOS environments or simply as a means of not storing login credentials in the configuration files.

Additionally every chc:path element in the chc:ChaosHarvesterConfiguration/chc:IncludePaths element is evaluated, i.e. all of these paths must point to existing folders as absolute paths or relative to either the configuration files path, the value of the chc:ChaosHarvesterConfiguration/chc:BasePath element or the working directory.

The --mode=some-cool-mode runtime option tells the harvester to start in the mode some-cool-mode which must match the name of a mode in the configuration file. Additionally the harvester might require a --reference=... option, if the mode is *ByReference mode see the What's inside page.

When developing your harvester it might make sense to have a more verbose output than you get from simply running the harvester, this is where the following runtime options makes sence:

  • --debug This makes the different processors and modes print a more information on what they are doing.
  • --debug-metadata This prints the XML result of any MetadataProcessor when it has finished processing an external object.

Setting up a periodic run of the harvester

One could think of a couple of ways to do this: