Job Manager

Support setting different scheduling policies per VC.
- RF: Runnable first. Large jobs waiting for resources do not block later small jobs.
- FIFO: First-in first-out based on job queue time. A large job waiting for resources can block later small jobs.
Support setting max job running time (wall time) per VC. VC admins can adjust the setting for jobs.
Support limiting number of interactive GPUs per VC.
Support user global public keys, enabling users to access jobs in any cluster using their own private key.
Requeue preempted jobs at the head of the job queue.
Add an INIT process in jobs to manage signal broadcast and zombie process reap, propagating SIGTERM to user process.
Delete very old jobs in small batches to avoid locking DB.

Restful API

Allow specifying max retry count for each job.
Support changing parameters per VC:
- Max job time
- Max number of interactive GPUs
- Scheduling policy
Allow adding user IP for allowlist.
VC quota management proportional to GPU/CPU.

Support default storage quota per person (with configurable hard/soft limit and grace period).
Support multi-MDT in auto-deployment pipeline.
Support grouping OSTs into pool, mapping pools to VCs to achieve performance isolation.

Periodically compare the current allowed user IPs in DB and in Azure NSG rule, and make changes accordingly.
Expire user IPs after a specified number of days.