-
Notifications
You must be signed in to change notification settings - Fork 50
Feature List: TOSS4: 2020 Q2
Stephen Herbein edited this page Jun 14, 2019
·
3 revisions
- Limit KVS Content Growth
- Garbage Collect after Restart
- Tolerate Compute Nodes Down
- Drain Nodes
- Detect/Monitor Nodes/Resources Up/Down
- User Management
- Admin role
- Dynamically add/remove users (watch
/etc/passwd
?)
- Configuration files
- I/O to/from files with file per process
- Multi-prog support (MPMD)
- pty support
- affinity/mapping
- Jobspec + R -> local
- Environment
- Debugger support
- MPIR
- Distributed Sync
- Co-locating processes
- Launch OpenMPI 3.1+
- PMI
- job completion log
- simple append interface
- offline & online query (x-post w/ porcelain)
- real job shell
- signal jobs (x-post w/ porcelain)
- Job Priorities (x-post w/ bank/accounting)
- Job Dependencies
- Job Feasibility
- Ingest plugin to ensure job request is not larger than cluster can provide
- Job request abides by QoS limits
- Query available/allocated/down resources (x-post w/ porcelain)
- Resource configuration language
- Resource discovery vs config file
- Connect to WhatsUp
- Provide kvs key with idset of "up" nodes
- List jobs in queue order with filtering
- Run/submit
- scheduler front-end work
- alter job priorities
- hold
- cancel
- expediate
- query completed jobs (x-post w/ execution system)
- Transition Tools
- flux srun
- signal jobs (x-post w/ execution system)
- Resource status summary tool (x-post w/ resource management)
- User guides for transitions to Flux commands
- Specify bank on submission
- Tools/storage for EOY analysis
- User permissions
- Fair-Share Scheduling
- Job Priorities (x-post w/ job submission)
- Slurm Database
- Resource matching interfaces w/ new exec system
- Scheduler ? support
- Scheduler performance optimization
- Scheduler resiliency improvements
- Support unload/load via job manager
- Scheduler memory optimization
- Planner optimization
- Queue Equivalent (e.g., job tags)
- W/ policy support (e.g., wall time limit)
- Power Monitoring
- monitoring support for job-level power/perf data
- from various databases
- Tools Interface
- Storage ???
- Burst Buffer support w/in simulator
- Add stage-in/out support in jobspec
- Data staging flux module
- GPU
- IMP + Contain
- IMP PAM Support
- IMP Prolog/Epilog support
- Fully-baked, bulletproof resiliency
- Node loss within a job allocation will result in job failure
- Crash/loss of management node will result in running jobs (i.e., they will be killed)
- Scheduling
- Resources besides nodes/cores/gpus
- Standby Jobs
- Pre-emption
- Email Notification
- Job Requeue
- Modifying job properties post-submission (e.g., walltime, num nodes/cores, queue)
- Providing "reasons" for job not currently running