README

Percolate is a simple-stupid application for combining command-line programs
into flexible, fault-tolerant data transformation workflows. Its goal is to
allow complex workflows to be expressed using only standard Ruby operators.


Percolate's features are:

- The ability to create complex, parallel workflows as plain Ruby code. The
  workflow paths are implicit in the code, being defined by method arguments
  and return values.

- Workflows may contain any combination of Ruby code, local system calls and
  asynchronous batch queue jobs.

- Partially complete workflows may be suspended and continued later.

- Workflows may be restarted after failure without repeating successful steps.
  Partially complete workflows may be paused and archived, to be continued later.

- Parallel workflows may be executed using fork/exec or by integration with
  Platform LSF for large clusters.

- Small and lightweight. These things are relative, of course, but the entire
  system is less than 2000 LOC, including the driver and auditor.


Percolate's restrictions are:

- Methods on the workflow execution path must adhere to a simple convention
  of being able to accept arguments and return values of 'nil' for resources
  that are unavailable at the time of invocation.

- Heavy compute should be done by the command-line programs called by the
  workflow, not the workflow script itself.


To create and run a workflow, the steps are:

1. Use Percolate's helpers to wrap each command-line program in a Ruby method so
   that the essential resources required for the run are represented by the
   method arguments and the resources created by the run are represented by the
   method return values. Choose whether to run synchronously (via system) or
   asynchronously (via fork/exec or on a cluster via Platform LSF).

2. Write the body of the workflow using these methods and any Ruby flow control
   operators. Within a single workflow, any combination of Ruby methods,
   fork/exec jobs or Platform LSF jobs are permitted.

3. Create a Workflow class as an entry point, having a 'run' method that
   invokes the workflow. Workflows may create and invoke more instances of any
   Workflow class.

4. Start the Beanstalk message queue.

5. Launch workflows by placing a YAML file into the Percolate 'in' directory.
   The file describes the Workflow class to instantiate and the arguments to
   the 'run' method.

6. Run the Percolate driver repeately at intervals (e.g. via cron) until the
   system moves your input YAML file to the Percolate 'pass' directory
   or to the 'fail' directory (if one of the steps has failed).

7. If there was a failure, look at the logs, fix the problem and move the YAML
   file back to the 'in' directory to resume the workflow.

8. Run the auditor on the log to see a breakdown of what happened during the run.


Percolate's dependencies are:

- Beanstalk (http://kr.github.com/beanstalkd/)

- The beanstalk-client Ruby gem (http://beanstalk.rubyforge.org/)
- The gibbler Ruby gem (http://github.com/delano/gibbler)

  and optionally, for the auditor

- The Ruport Ruby gem (http://www.rubyreports.org/)