Skip to content

Latest commit

 

History

History
895 lines (793 loc) · 21.5 KB

spec_14.adoc

File metadata and controls

895 lines (793 loc) · 21.5 KB

14/Canonical Job Specification

A domain specific language based on YAML is defined to express the resource requirements and other attributes of one or more programs submitted to a Flux instance for execution. This RFC describes the canonical jobspec form, which represents a request to run exactly one program.

  • Name: github.com/flux-framework/rfc/spec_14.adoc

  • Editor: Tom Scogland <scogland1@llnl.gov>

  • State: raw

Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Goals

  • Express the resource requirements of a program to the scheduler.

  • Allow graph-oriented resource requirements to be expressed.

  • Express program attributes such as arguments, run time, and task layout, to be considered by the program execution service (RFC 12)

  • Express dependencies relative to other programs executing within the same Flux instance.

  • Emphasize expressivity over simplicity, as this canonical form may be generated from other user-friendly forms or interfaces.

  • Facilitate reproducible runs.

  • Promote sharing and reuse of jobspec.

Overview

This RFC describes the canonical form of "jobspec", a domain specific language based on YAML[1]. The canonical jobspec SHALL consist of a single YAML document representing a reusable request to run exactly one program. Hereafter, "jobspec" refers to the canonical form, and "non-canonical jobspec" refers to the non-canonical form.

Non-canonical jobspec SHALL be decomposed into jobspec before it is enqueued for the scheduler and program execution service.

User facing tools MAY generate jobspec from non-canonical jobspec, or other sources. Such tools MAY:

  • generate a batch of dependent jobspecs representing a scientific workflow

  • generate a stream of jobspecs representing a steered parameter study

  • convert simulation parameters into jobspec containing computed resource requirements, etc.

  • convert command line arguments to jobspec, e.g. "flux mpirun"

Jobspec and Program Life Cycle

The jobspec SHALL be submitted to a job submission service. Malformed jobspec SHALL be immediately rejected by the job submission service. A stack of plugins SHALL test jobspec against site or user defined criteria, and on failure, MAY reject the jobspec, or MAY warn the user and continue on. The job submission service SHALL enqueue the jobspec for consideration by the scheduler.

The scheduler SHALL consider each enqueued jobspec in the context of its dependencies and the pool of available resources. When the scheduler chooses to execute a job, it allocates resources, associates them with the jobspec, and notifies the program execution service to start the program(s).

The program execution service, described in RFC 12, launches the program(s). Task slots, containment, and task layout SHALL be created within the allocated resources as described by the jobspec, or if that is not possible, the job SHALL enter a failed state and resources SHALL be returned to the scheduler.

Once a job is retired, the jobspec SHALL be retained as part of its provenance record.

Resource Matching

Resources are represented as hierarchies or graphs, as described in RFC 4.

FIXME: describe how Flux hierarchical resource representation affects jobspec design.

Terminology

FIXME: Fill in

Jobspec Language Definition

A canonical jobspec YAML document SHALL consist of a dictionary defining the resources, tasks and other attributes of a single program. The dictionary MUST contain the keys resources, tasks, attributes, and version.

Each of the listed jobspec keys SHALL meet the form and requirements listed in detail in the sections below. For reference, a ruleset for compliant canonical jobspec is provided using JSON Content Rules[2] in the Content Rules at the end of this section.

Resources

The value of the resources key SHALL be a strict list which MUST define at least one resource. Each list element SHALL represent a resource vertex or resource descriptor object as a dictionary (described below). The list of resources defined under the resources key SHALL represent a composite resource request for the program defined in the jobspec.

A resource vertex SHALL contain the following keys:

type

The type key for a resource SHALL indicate the type of resource to be matched. Some type names MAY be reserved for use in the jobspec language itself. The currently reserved type is slot, used to define task slots. Reserved types are described in the Reserved Resource Types section below.

count

The count key SHALL indicate the desired number or range of resources matching the current vertex. The count SHALL have one of two possible values: either a single integer value representing a fixed count, or a dictionary which SHALL contain the following keys:

min

The minimum required count or amount of this resource

max

The maximum required count or amount of this resource

operator

An operator applied between min and max which returns the next acceptable value

operand

The operand used in conjunction with operator to get the next value between min and max.

A resource vertex MAY additionally contain one or more of the following keys

unit

The unit key, if supplied, SHALL have a string value indicating the chosen units applied to the count value or values.

exclusive

The exclusive key SHALL be a boolean indicating, when true, that the current resource is requested to be allocated exclusively to the current program. If unset, the default value for exclusive SHALL be false for vertices that are not within a task slot. The default value for exclusive SHALL be true for task slots (type: slot) and their associated resources.

with

The with key SHALL indicate an edge of type out from this resource vertex to another resource. Therefore, the value of the with key SHALL be a dictionary conforming to the resource vertex specification.

label

The label key SHALL be a string that may be used to reference this resource vertex from other locations within the same jobspec. label SHALL be local to the namespace of the current jobspec, and each label in the current jobspec must be unique. label SHALL be mandatory in resource vertices of type slot.

id

The value of the id key SHALL be a string indicating a set of matching resource identifiers.

Reserved Resource Types

slot

A resource type of type: slot SHALL indicate a grouping of resources into a named task slot. A slot SHALL be a valid resource spec including a label key, the value of which may be used to reference the named task slot during tasks definition. The label provided SHALL be local to the namespace of the current jobspec.

A task slot SHALL have at least one edge specified using with:, and the resources associated with a slot SHALL be exclusively allocated to the program described in the jobspec.

Tasks

The value of the tasks key SHALL be a strict list which MUST define at least one task. Each list element SHALL be a dictionary representing a task or tasks to run as part of the program. A task descriptor SHALL contain the following keys:

command

The value of the command key SHALL be a string OR list representing an executable and its arguments.

slot

The value of the slot key SHALL be used to indicate the task slot on which this task or tasks shall be contained and executed. The number of tasks executed per task slot SHALL be a function of the number of resource slots and total number of tasks requested to execute.

The value of the slot key SHALL be a dictionary with supported key of either label or type. The label key SHALL reference a label of a resource vertex of type slot, indicating an explicitly created and named task slot. The type key SHALL reference a real resource type, such as core or node, indicating an implicitly created task slot on which to map the defined tasks. type SHALL NOT be slot; slots must be referred to by their label.

count

The value of the count key SHALL be a dictionary supporting at least the keys per_slot and total, with other keys reserved for future or site-specific extensions.

per_slot

The value of per_slot SHALL be a number indicating the number of tasks to execute per task slot allocated to the program.

total

The value of the total field SHALL indicate the total number of tasks to be run across all task slots, possibly oversubscribed.

attributes

The attributes key SHALL be a free-form dictionary of keys which may be used for platform independent or optional extensions.

distribution

The value of the distribution key SHALL be a string, which MAY be used as input to the launcher’s algorithm for task placement and layout among task slots.

Attributes

The value of the attributes key SHALL be a dictionary of dictionaries. The attributes dictionary MAY contain one or more of the following keys which, if present, must have dictionary values:

user

Attributes in the user dictionary are unrestricted, and may be used as the application demands. Flux may provide addition tools that can identify jobs based on user attributes.

system

Attributes in the system dictionary are additional parameters to a Flux instance that affect program execution, scheduling, etc. All attributes in system are reserved words, however unrecognized words SHALL trigger no more than a warning. This permits jobspec reuse between multiple flux instances which may be configured differently and recognize different sets of attributes.

Most system attributes are optional. Flux modules SHALL provide reasonable defaults for any system attributes that they recognize when at all possible.

Some common system attributes are:

duration

The value of the duration attribute is a string representing time span. The scheduler will make an effort to allocate the requested resources for the time specified in duration.

Example Jobspec

Under the description above, the following is an example of a fully compliant version 1 jobspec. The example below declares a request for 4 "nodes" each of which with 1 task slot consisting of 2 cores each, for a total of 4 task slots. A single copy of the command app will be run on each task slot for a total of 4 tasks.

version: 1
resources:
  - type: node
    count: 4
    with:
      - type: slot
        count: 1
        label: default
        with:
          - type: core
            count: 2
tasks:
  - command: app
    slot:
      label: default
    count:
      per_slot: 1
attributes:
  system:
    duration: 1 hour

A simpler example using implicit task slot definition to run 4 tasks across 4 nodes

version: 1
resources:
  - type: node
    count: 4
tasks:
  - command: hostname
    slot:
      type: node
    count:
      per_slot: 1
attributes:
  system:
    duration: 1 hour

Content Rules

A jobspec conforming to version 1 of the language definition SHALL adhere to the following ruleset, described using JSON Content Rules[2] draft version 0.6.

# jcr-version 0.6

{
   "resources" : [ $vertex + ],
   "tasks" : tasks,
   "attributes" : { $system_attributes, $user_attributes },
   "version" : 1
}

$label := string

$vertex_common = {
    "count" : ( 1.. | $complex_range),
    ?"exclusive" : boolean,
    ?"with" : [ $vertex + ]
}

slot_vertex = {
    "type"  : "slot",
    "label" : $label,
    $vertex_common,
}

resource_vertex = {
    "type" : ( :string, + @{reject} ("slot")),
    $vertex_common
    ?"id" : string,
    ?"unit" : string,
}

vertex = ( $slot_vertex | $resource_vertex )

$complex_range = {
    "min" : 1..,
    "max" : 1..,
    "operator" : ( :"+" | :"*" | : "^" ),
    "operand" : 1..,
}

$tasks = {
    "command" : [ string + ],
    "slot" : { "label" : string  | "type": string },
    "count" : { "per_slot" : 1.. | "total" : 1.. },
    "distribution" : string,
    ?"attributes" : { /.*/ : any },
}

$system_attributes = {
    "duration" : string,
    /.*/ : string
}

$user_attributes = { /.*/ : string }

Basic Use Cases

To implement basic resource manager functionality, the following use cases SHALL be supported by the jobspec:

Section 1: Resource only requests

The following "resource only" requests are assumed to be the equivalent of existing resource manager batch job submission or allocation requests, i.e. equivalent to oarsub, qsub, and salloc. In terms of a canonical jobspec, these requests are assumed to be requests to start an instance, i.e. run a single copy of flux start per allocated node.


Use Case 1.1

Request Single Resource with Count

Specific Example

Request 4 nodes

Existing Equivalents

Slurm

salloc -N4

PBS

qsub -l nodes=4

Jobspec YAML
version: 1
resources:
  - type: node
    count: 4
tasks:
  - command: [ "flux", "start" ]
    slot:
      type: node
    count:
      per_slot: 1
attributes:
  system:
    duration: 1 hour

Use Case 1.2

Request a range of a type of resource

Specific Example

Request between 3 and 30 nodes

Existing Equivalents

Slurm

salloc -N3-30

Jobspec YAML
version: 1
resources:
  - type: node
    count:
      min: 3
      max: 30
      operator: "+"
      operand: 1
tasks:
  - command: [ "flux", "start" ]
    slot:
      type: node
    count:
      per_slot: 1
attributes:
  system:
    duration: 1 hour

Use Case 1.3

Request M nodes with a minimum number of sockets per node and cores per socket

Specific Example

Request 4 nodes with at least 2 sockets each, and 4 cores per socket

Existing Equivalents

Slurm (a)

srun -N4 --sockets-per-node=2 --cores-per-socket=4

Slurm (b)

srun -N4 -B '2:4:*'

OAR

oarsub -l nodes=4/sockets=2/cores=4

Jobspec YAML
version: 1
resources:
  - type: node
    count: 4
    with:
      - type: socket
        count: 2
        with:
          - type: core
            count: 4
tasks:
  - command: [ "flux", "start" ]
    slot:
      type: node
    count:
      per_slot: 1
attributes:
  system:
    duration: 1 hour

Use Case 1.4

Exclusively allocate nodes, while constraining cores and sockets.

Specific Example

Request an exclusive allocation of 4 nodes that have at least two sockets and 4 cores per socket:

Jobspec YAML
version: 1
resources:
  - type: slot
    with:
    - type: node
      count: 4
      with:
        - type: socket
          count: 2
          with:
            - type: core
              count: 4
tasks:
  - command: [ "flux", "start" ]
    slot:
      type: node
    count:
      per_slot: 1
attributes:
  system:
    duration: 1 hour

Use Case 1.5

Complex example from OAR

Specific Example
ask for 1 core on 2 nodes on the same cluster with 4096 GB of memory and Infiniband 10G + 1 cpu on 2 nodes on the same switch with bicore processors for a walltime of 4 hours
Existing Equivalents

OAR

oarsub -I -l "{memnode=4096 and ib10g='YES'}/cluster=1/nodes=2/core=1+{nbcore=2}/switch=1/nodes=2/cpu=1,walltime=4:0:0"

Jobspec YAML
version: 1
resources:
  - type: cluster
    count: 1
    with:
      - type: node
        count: 2
        with:
          - type: memory
            count: 4
            unit: GB
          - type: ib10g
            count: 4
      - type: switch
        with:
          type: node
            count: 2
            with:
                - type: core
                  count: 1
tasks:
  - command: [ "flux", "start" ]
    slot:
      type: node
    count:
      per_slot: 1
attributes:
  system:
    duration: 4 hours

Use Case 1.6

Request resources across multiple clusters

Specific Example

Ask for 1 core on 15 nodes across 2 clusters (total = 30 cores)

Existing Equivalents

OAR

oarsub -I -l /cluster=2/nodes=15/core=1

Jobspec YAML
version: 1
resources:
    - type: cluster
      count: 2
      with:
          - type: node
            count: 15
            with:
              - type: core
                count: 1
tasks:
  - command: [ "flux", "start" ]
    slot:
      type: node
    count:
      per_slot: 1
attributes:
  system:
    duration: 1 hour

Use Case 1.7

Request N cores across M switches

Specific Example

Request 3 cores across 3 switches

Existing Equivalents

OAR

oarsub -I -l /switch=3/core=1

Jobspec YAML
version: 1
resources:
  - type: switch
    count: 3
    with:
      - type: core
        count: 1
tasks:
  - command: [ "flux", "start" ]
    slot:
      type: node
    count:
      per_slot: 1
attributes:
  system:
    duration: 1h

Section 2: Resource and task jobspec

The following use cases include task specification in addition to resource request, demonstrating the use of task slot labels and types to relate tasks to resources.


Use Case 2.1

Run N tasks across M nodes

Specific Example

Run hostname 20 times on 4 nodes, 5 per node

Existing Equivalents

Slurm

srun -N4 -n20 hostname or srun -N4 --ntasks-per-node=5 hostname

PBS

qsub -l nodes=4,mppnppn=5

Jobspec YAML
version: 1
resources:
  - type: slot
    label: default
    count: 4
    with:
    - type: node
      count: 1
tasks:
  - command: hostname
    slot: default
    count:
      per_slot: 5
attributes:
  system:
    duration: 1 hour

Use Case 2.2

Run N tasks across M nodes, unequal distribution

Specific Example

Run 5 copies of hostname across 4 nodes, default distribution

Existing Equivalents

Slurm

srun -n5 -N4 hostname

Jobspec YAML
version: 1
resources:
  - type: node
    count: 4
tasks:
  - command: hostname
    slot:
      type: node
    count:
      total: 5
attributes:
  system:
    duration: 1 hour

or, with explicit task slots:

version: 1
resources:
  - type: slot
    count: 4
    label: myslot
    with:
      - type: node
        count: 1
tasks:
  - command: hostname
    slot:
      label: myslot
    count:
      total: 5
attributes:
  system:
    duration: 1 hour

Use Case 2.3

Run N tasks, Require M cores per task

Specific Example

Run 10 copies of myapp, require 2 cores per copy, for a total of 20 cores

Existing Equivalents

Slurm

srun -n10 -c 2 myapp

Jobspec YAML
version: 1
resources:
  - type: slot
    label: default
    count: 10
    with:
      - type: core
        count: 2
tasks:
  - command: myapp
    slot: default
    count:
      per_slot: 1
attributes:
  system:
    duration: 1 hour

Use Case 2.4

Run different binaries with differing resource requirements as single program

Specific Example

11 tasks, one node, first 10 using one core and 4G of RAM for read-db, last using 6 cores and 24G of RAM for db

Existing Equivalents

None Known

Jobspec YAML
version: 1
resources:
  - type: node
    with:
      - type: slot
        label: read-db
        count: 10
        with:
          - type: core
            count: 1
          - type: memory
            count: 4
            unit: GB
      - type: slot
        label: db
        count: 1
        with:
          - type: core
            count: 6
          - type: memory
            count: 24
            unit: GB
tasks:
  - command: read-db
    slot: read-db
    count:
      per_slot: 1
  - command: db
    slot: db
    count:
      per_slot: 1
attributes:
  system:
    duration: 1 hour

Use Case 2.5

Run command requesting minimum amount of RAM per core

Specific Example

Run 10 copies of app across 10 cores with at least 2GB per core

Existing Equivalents

Slurm

srun -n 10 --mem-per-cpu=2048 app

Jobspec YAML
version: 1
resources:
  - type: slot
    label: default
    count: 10
    with:
    - type: memory
      count: 2
      unit: GB
    - type: core
      count: 1
tasks:
  - command: app
    slot: default
    count:
      per_slot: 1
attributes:
  system:
    duration: 1 hour

Use Case 2.6

Run N copies of a command with minimum amount of RAM per node

Specific Example

Run 10 copies of app across 2 nodes with at least 4GB per node

Existing Equivalents

Slurm

srun -n10 -N2 --mem=4096 app

OAR

oarsub -p memnode=4096 -l nodes=2 "taktuk -c oarsh -f $OAR_FILE_NODES broadcast exec [app]"

Jobspec YAML
version: 1
resources:
  - type: slot
    label: 4GB-node
    count: 2
    with:
    - type: node
      count: 1
      with:
        - type: memory
          count: 4
          unit: GB
tasks:
  - command: app
    slot: 4GB-node
    count:
      total: 10
attributes:
  system:
    duration: 1 hour