forked from garlick/flux-plans
-
Notifications
You must be signed in to change notification settings - Fork 0
Don Lipari edited this page Nov 19, 2014
·
1 revision
###Smallest Serviceable Slurm Substitute
What follows are the requirements to replace the SLURM version currently in use at LC, not a wish list for the perfect batch system. The requirements are listed as bullet items with minimal text to describe the item. This assumes an understanding of SLURM and its features. For further details, reference the SLURM man pages. References to SLURM commands are listed where appropriate. New features in the versions of SLURM beyond v2.3.3 are not listed.
- Task launch
- Specify number of tasks
- Specify resources (at least nodes and cores)
- Number of resources (e.g., 4 nodes)
- include ranges
- Named resources (e.g., cluster, node[4-8], core[0-3])
- Memory size
- Generic resources
- Features
- Number of resources (e.g., 4 nodes)
- Task distribution:
- Cyclic
- Block
- Plane
- Custom (base on configuration file)
- Task to resource mapping
- Number of tasks per node (or core)
- Number of cores per task
- Hardware threading (desired? allowed? disabled?)
- Task containment - confine tasks to allocated resources: sockets, cores, memory
- Wall clock limit
- Task prolog and epilog options
- Resource management
- Resources managed: Clusters, nodes, sockets, cores, threads, memory, GPU’s, burst buffers, file systems, licenses, etc.
- Add and remove resources from management
- Report and change status of resources: up, down, draining, allocated, idle
- Resource pools (aka partitions, queues)
- Resource weights (governs priority for selection)
- Resource sharing allowed (if so, to what degree?)
- Network topology
- Contiguous resources
- Switch topology
- Resource status (
sinfo
)
- Summary of nodes and states (idle, allocated, down, draining)
- Summarize for each node partition
- Rich reports of specific resources
- By node (
scontrol show node
) - By partition (
scontrol show partition
)
- By node (
- Job Specification
- Job category
- Batch script (
sbatch
) - Interactive (
salloc
)- includes xterm request (
mxterm
/sxterm
)
- includes xterm request (
- Single job step as job (
srun
)
- Batch script (
- User / group
- Bank account
- Workload characterization key
- Min/max run times
- Priority (includes nice factor if any)
- QoS
- Queue
- Resource requirements
- Min/Max node counts
- Features, tags, processor architecture, processor speed
- (Minimum or specific) memory per (socket or node)
- (Minimum or specific) (sockets or cores) per node
- Tasks per node (or core)
- Cores per task
- Shared or exclusive
- Preferred network topology / node contiguity
- Licenses
- File systems
- Installed packages and libraries
- Allocated resources
- By count (e.g., number of nodes and cores)
- By name (e.g., node names, cpu’s, gpu’s, etc.)
- Node on which batch script is running
- State (includes reason for not running)
- Dependency (other job(s) starting/completing/exit code)
- Reservation
- Prolog and Epilog
- Re-queue request
- If preempted
- If resource fails
- Terminate (or continue) on resource failure
- Times
- Submit time
- Start-after time
- Estimated start time
- Actual start time
- Run time limit
- Actual run time
- Terminate time
- Exit Status (includes if signaled and by which signal)
- Job run info
- Job name
- Command
- Working directory
- Standard In / Out / Error
- batch script
- Job Submission
- Option to intercept submit request and alter, override, or insert policy-related options
- Job submission fails at submit time (as opposed to run time) when invalid options are specified
- Job status
- One-line job summary (
squeue
)- Queued as well as running jobs
- Includes jobs of other users
- Verbose job record report (
scontrol show job
) - Job step reports
- Includes record of associated batch script
- Job control
- Job removal and signaling (
scancel
) - Job signal prior to termination (per specified grace time)
- Job modification (
scontrol update job
) - Job hold/release
- Job prioritization factors
- Fair share
- Job size (favoring large or small)
- Queued time (FIFO)
- QoS contribution
- Queue contribution
- User nicing
- Scheduling (starting with a prioritized queue)
- Matches job’s requests with available resources
- Supports multiple rules for resource selection:
- Best fit
- First fit
- Balanced workload
- Job submission requires a bank account and user permission to use that account
- Honors time and resource size limits imposed by
- Queue
- QoS
- User/Bank
- Imposes limits on
- Number of jobs that can be queued at any given time
- Number of jobs that can be running at any given time
- Accommodates sharing requests and allowed sharing levels
- Waits specified time to accommodate node topology request
- Backfill option
- Conservative backfill no higher priority job delayed
- EASY backfill just the top priority job cannot be delayed
- Provides estimated start times
- Considers jobs for multiple queues
- Supports job dependencies from other clusters
- Provide job preemption based on QoS or queue. Preemption action can be
- Suspension
- Checkpoint
- Terminate and Re-queue
- Terminate
- Support for job growth and shrinkage
- Quality of Service
- Affects job priority
- Allows exemptions from time and size limits
- Can impose an associated set of time and size limits
- Can amplify or dampen the usage charges
- Bank Accounts
- Fundamental to permitting user’s ability to submit jobs
- Reflects the sponsors’ claim to the cluster’s resources (i.e., the shares in fair share)
- Can impose an associated set of time and size limits
- Reservations
- Resources can be reserved in advance (DATs)
- Permitted jobs can run within those reservations
- Email user at job state transitions
- Begin
- End
- Fail
- Re-queue
- All
- Resource accounting
- Resource utilization (
sreport
) - Times reported for specified time periods under the following categories:
- Allocated
- Idle
- Reserved
- System maintenance
- Unplanned down time
- Job accounting
- Individual job records (
sacct
)- Job and job step records for a prescribed time period
- Includes most of the job parameters listed in Job Specification above
- Composite job reports (
sreport
)- Aggregate job reports based on user, account, and workload characterization key
- Over a prescribed time period
- Includes listing of top users and top accounts
- Includes reports by job size
- Security
- Jobs can only be run by submitting user
- Job output can only be seen by submitting user
- System parameters can only be changed by authorized roles (see next item)
- Administration
- Role-based system administration and overrides
- User can monitor and alter (some) of own job parameters
- Operator can alter other users’ job parameters
- Coordinator can populate bank account memberships and limits
- Administrator can do all above and alter resource definitions
- User/bank management (
sacctmgr
)
- Cluster/partition/user/bank granularity
- Implicit permission to use bank
- Limits imposed at each level of the hierarchy
- Limits include:
- Max number of jobs running at any time in bank
- Max number of nodes for any jobs running in bank
- Max number of CPUs for any jobs running in bank
- Max number of pending + running jobs state at any time in bank
- Max wall clock time each job in bank can run
- Max (CPU*minutes) each job in bank can run
- System
- Save state and recover on restart
- Resources
- Jobs
- Usage statistics
- System can be restarted without losing queued jobs or killing running jobs
- Reliability
- High availability backup to take over when primary dies or hangs
- Resilient able to adapt to failing or failed resources
- 24x7 operation
- System updates possible on a live system without losing queued or running jobs
- Robust
- Atomic changes
- System can never get in a corrupt or inconsistent state
- Complete recovery after crashes
- Performance
- Response to user commands to be less than one or two seconds.
- Scheduling loops under one minute
- Scalability
- Thousands of jobs
- Thousands of resources
- Thousands of users
- Visibility
- Pertinent info is logged
- System diagnostics facilitate a quick discovery of what went wrong
- Configuration
- System configuration read from file or database
- System configuration parameters can be changed live
- API
- Library to retrieve remaining time (
libyogrt
) - Interface to lorenz