s4

###Smallest Serviceable Slurm Substitute

What follows are the requirements to replace the SLURM version currently in use at LC, not a wish list for the perfect batch system. The requirements are listed as bullet items with minimal text to describe the item. This assumes an understanding of SLURM and its features. For further details, reference the SLURM man pages. References to SLURM commands are listed where appropriate. New features in the versions of SLURM beyond v2.3.3 are not listed.

Task launch

Specify number of tasks
Specify resources (at least nodes and cores)
- Number of resources (e.g., 4 nodes)
  - include ranges
- Named resources (e.g., cluster, node[4-8], core[0-3])
- Memory size
- Generic resources
- Features
Task distribution:
- Cyclic
- Block
- Plane
- Custom (base on configuration file)
Task to resource mapping
- Number of tasks per node (or core)
- Number of cores per task
Hardware threading (desired? allowed? disabled?)
Task containment - confine tasks to allocated resources: sockets, cores, memory
Wall clock limit
Task prolog and epilog options

Resource management

Resources managed: Clusters, nodes, sockets, cores, threads, memory, GPU’s, burst buffers, file systems, licenses, etc.
Add and remove resources from management
Report and change status of resources: up, down, draining, allocated, idle
Resource pools (aka partitions, queues)
Resource weights (governs priority for selection)
Resource sharing allowed (if so, to what degree?)
Network topology
- Contiguous resources
- Switch topology

Resource status (sinfo)

Summary of nodes and states (idle, allocated, down, draining)
Summarize for each node partition
Rich reports of specific resources
- By node (scontrol show node)
- By partition (scontrol show partition)

Job Specification

Job category
- Batch script (sbatch)
- Interactive (salloc)
  - includes xterm request (mxterm / sxterm)
- Single job step as job (srun)
User / group
Bank account
Workload characterization key
Min/max run times
Priority (includes nice factor if any)
QoS
Queue
Resource requirements
- Min/Max node counts
- Features, tags, processor architecture, processor speed
- (Minimum or specific) memory per (socket or node)
- (Minimum or specific) (sockets or cores) per node
- Tasks per node (or core)
- Cores per task
- Shared or exclusive
- Preferred network topology / node contiguity
- Licenses
- File systems
- Installed packages and libraries
Allocated resources
- By count (e.g., number of nodes and cores)
- By name (e.g., node names, cpu’s, gpu’s, etc.)
- Node on which batch script is running
State (includes reason for not running)
Dependency (other job(s) starting/completing/exit code)
Reservation
Prolog and Epilog
Re-queue request
- If preempted
- If resource fails
Terminate (or continue) on resource failure
Times
- Submit time
- Start-after time
- Estimated start time
- Actual start time
- Run time limit
- Actual run time
- Terminate time
Exit Status (includes if signaled and by which signal)
Job run info
- Job name
- Command
- Working directory
- Standard In / Out / Error
- batch script

Job Submission

Option to intercept submit request and alter, override, or insert policy-related options
Job submission fails at submit time (as opposed to run time) when invalid options are specified

Job status

One-line job summary (squeue)
- Queued as well as running jobs
- Includes jobs of other users
Verbose job record report (scontrol show job)
Job step reports
Includes record of associated batch script

Job control

Job removal and signaling (scancel)
Job signal prior to termination (per specified grace time)
Job modification (scontrol update job)
Job hold/release

Job prioritization factors

Fair share
Job size (favoring large or small)
Queued time (FIFO)
QoS contribution
Queue contribution
User nicing

Scheduling (starting with a prioritized queue)

Matches job’s requests with available resources
Supports multiple rules for resource selection:
- Best fit
- First fit
- Balanced workload
Job submission requires a bank account and user permission to use that account
Honors time and resource size limits imposed by
- Queue
- QoS
- User/Bank
Imposes limits on
- Number of jobs that can be queued at any given time
- Number of jobs that can be running at any given time
Accommodates sharing requests and allowed sharing levels
Waits specified time to accommodate node topology request
Backfill option
- Conservative backfill no higher priority job delayed
- EASY backfill just the top priority job cannot be delayed
Provides estimated start times
Considers jobs for multiple queues
Supports job dependencies from other clusters
Provide job preemption based on QoS or queue. Preemption action can be
- Suspension
- Checkpoint
- Terminate and Re-queue
- Terminate
Support for job growth and shrinkage

Quality of Service

Affects job priority
Allows exemptions from time and size limits
Can impose an associated set of time and size limits
Can amplify or dampen the usage charges

Bank Accounts

Fundamental to permitting user’s ability to submit jobs
Reflects the sponsors’ claim to the cluster’s resources (i.e., the shares in fair share)
Can impose an associated set of time and size limits

Reservations

Resources can be reserved in advance (DATs)
Permitted jobs can run within those reservations

Email user at job state transitions

Begin
End
Fail
Re-queue
All

Resource accounting

Resource utilization (sreport)
Times reported for specified time periods under the following categories:
- Allocated
- Idle
- Reserved
- System maintenance
- Unplanned down time

Job accounting

Individual job records (sacct)
- Job and job step records for a prescribed time period
- Includes most of the job parameters listed in Job Specification above
Composite job reports (sreport)
- Aggregate job reports based on user, account, and workload characterization key
- Over a prescribed time period
- Includes listing of top users and top accounts
- Includes reports by job size

Security

Jobs can only be run by submitting user
Job output can only be seen by submitting user
System parameters can only be changed by authorized roles (see next item)

Administration

Role-based system administration and overrides
- User can monitor and alter (some) of own job parameters
- Operator can alter other users’ job parameters
- Coordinator can populate bank account memberships and limits
- Administrator can do all above and alter resource definitions

User/bank management (sacctmgr)

Cluster/partition/user/bank granularity
Implicit permission to use bank
Limits imposed at each level of the hierarchy
Limits include:
- Max number of jobs running at any time in bank
- Max number of nodes for any jobs running in bank
- Max number of CPUs for any jobs running in bank
- Max number of pending + running jobs state at any time in bank
- Max wall clock time each job in bank can run
- Max (CPU*minutes) each job in bank can run

System

Save state and recover on restart
- Resources
- Jobs
- Usage statistics
- System can be restarted without losing queued jobs or killing running jobs
Reliability
- High availability backup to take over when primary dies or hangs
- Resilient able to adapt to failing or failed resources
- 24x7 operation
- System updates possible on a live system without losing queued or running jobs
Robust
- Atomic changes
- System can never get in a corrupt or inconsistent state
- Complete recovery after crashes
Performance
- Response to user commands to be less than one or two seconds.
- Scheduling loops under one minute
Scalability
- Thousands of jobs
- Thousands of resources
- Thousands of users
Visibility
- Pertinent info is logged
- System diagnostics facilitate a quick discovery of what went wrong
Configuration
- System configuration read from file or database
- System configuration parameters can be changed live

API

Library to retrieve remaining time (libyogrt)
Interface to lorenz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

s4

Clone this wiki locally