This is a job submission library for Go (#golang) which is compatible to the DRMAA standard. The Go library is a wrapper around the DRMAA C library implementation provided by many distributed resource managers (cluster schedulers).
This library fully supports Open Cluster Scheduler and Gridware Cluster Scheduler.
The library was originally developed using Univa Grid Engine's libdrmaa.so. It was tested with different versions of Grid Engine, Torque, and SLURM, but it should work also other resource managers / cluster schedulers which provide libdrmaa.so.
The "gestatus" subpackage only works with Grid Engine (some values are only available on Univa Grid Engine).
The DRMAA (Distributed Resource Management Application API) standard is meanwhile available in version 2. DRMAA2 provides more functionality around cluster monitoring and job session management. DRMAA and DRMAA2 are not compatible hence it is expected that both libraries are co-existing for a while. The Go DRMAA2 can be found here.
First download the package:
export GOPATH=${GOPATH:-~/src/go}
mkdir -p $GOPATH
go get -d github.com/dgruber/drmaa
cd $GOPATH/github.com/dgruber/drmaa
Next, we need to compile the code.
For Univa Grid Engine and original SGE:
source /path/to/grid/engine/installation/default/settings.sh
./build.sh
cd examples/simplesubmit
go build
export LD_LIBRARY_PATH=$SGE_ROOT/lib/lx-amd64
./simplesubmit
For Son of Grid Engine ("loveshack"):
source /path/to/grid/engine/installation/default/settings.sh
./build.sh --sog
cd examples/simplesubmit
go build
export LD_LIBRARY_PATH=$SGE_ROOT/lib/lx-amd64
./simplesubmit
For Torque:
If your Torque drmaa.h header file is not located under /usr/include/torque, you will have to modify the build.sh script before running it.
./build.sh --torque
cd examples/simplesubmit
go build
./simplesubmit
For SLURM and the updated SLURM C drmaa binding
./build.sh --slurm /usr/local
The example program submits a sleep job into the system and prints out detailed job information as soon as the job is started.
Go DRMAA applications need to open a DRMAA session before the DRMAA calls can be executed. Opening a DRMAA session usually establishes a connection to the cluster scheduler (distributed resource manager). Hence if no more DRMAA calls are made the Exit() method of the session must be executed. This tears down the connection. When an application does not call the Exit() method this can leave a communication handle open on the cluster scheduler side (which can take a while to be removed automatically). It should be always avoided not to call Exit(). In Go the defer statement can be used but remember that the function is not executed when an os.Exit() call is made.
Creating a DRMAA session:
s, err := drmaa.MakeSession()
Usually jobs and job workflows are submitted within DRMAA applications. In order to submit a job first a job template needs to be allocated:
jt, errJT := s.AllocateJobTemplate()
if errJT != nil {
fmt.Printf("Error during allocating a new job template: %s\n", errJT)
return
}
Underneath a C job template is allocated which is out-of-scope of the Go system. Hence it must be ensured that the job template is deleted when it is not used anymore. Also here the Go defer statement is useful.
// prevent memory leaks by freeing the allocated C job template at the end
defer s.DeleteJobTemplate(&jt)
The job template contains the specification of the job, like the command to be executed and its parameters. Those can be set by the setter methods of the job.
// set the application to submit
jt.SetRemoteCommand("sleep")
// set the parameter (use SetArgs() when having more parameters)
jt.SetArg("1")
A job can be executed with the session RunJob() method. If the same command should be executed many times, running it as a job array would make sense. In Grid Engine each instance gets a task ID assigned which the job can see in the SGE_TASK_ID environment variable (which is set to unknown for normal jobs). This task ID can be used for finding the right data set the job (array job task) needs to process. Submitting an array job is done with the RunBulkJobs() method.
jobID, errSubmit := s.RunJob(&jt)
// submitting 1000 instances of the same job
jobIDs, errBulkSubmit := s.RunBulkJobs(&jt, 1, 1000, 1)
A job state can also be changed (suspended / resumed / put in hold / deleted):
errTerm := s.TerminateJob(jobID)
The JobInfo data structure contains the runtime information of the job, like exit status or the amount of used resources (memory / IO / etc.). The JobInfo data structure can be get with the Wait() method.
jinfo, errWait := s.Wait(jobID, drmaa.TimeoutWaitForever)
For more details please consult the documentation and the DRMAA standard specifications.
More examples can be found on my blog at http://www.gridengine.eu.