Releases
v1.7.0
Deep Learning Training Service v1.7.0
Latest
Job Manager
Support setting different scheduling policies per VC.
RF: Runnable first. Large jobs waiting for resources do not block later small jobs.
FIFO: First-in first-out based on job queue time. A large job waiting for resources can block later small jobs.
Support setting max job running time (wall time) per VC. VC admins can adjust the setting for jobs.
Support limiting number of interactive GPUs per VC.
Support user global public keys, enabling users to access jobs in any cluster using their own private key.
Requeue preempted jobs at the head of the job queue.
Add an INIT process in jobs to manage signal broadcast and zombie process reap, propagating SIGTERM to user process.
Delete very old jobs in small batches to avoid locking DB.
Restful API
Allow specifying max retry count for each job.
Support changing parameters per VC:
Max job time
Max number of interactive GPUs
Scheduling policy
Allow adding user IP for allowlist.
VC quota management proportional to GPU/CPU.
Dashboard
VC notification
Show worker node count for pure CPU cluster.
Add timeout column for jobs in View and Manage Jobs.
Show insight message(s) on job details page for running jobs.
Show repair message(s) on job details page for running jobs.
Add Visual Studio Code (alpha) as an endpoint on job details page.
Allow downloading full job logs.
Allow specifying max retry count on job submission page.
Show repair status for worker nodes.
Show snapshot time on STORAGE tab.
Support exporting STORGAE tab as csv.
Add SETTINGS tab for VC admins to manage VC parameters.
Add a hidden page for cluster admins to manage VC quota.
Add My SSH Keys page for users to upload global public keys.
Add My Allowed IP page for users to self-serve allowing their IP.
Monitoring and RepairManager
Fix incorrect mapping for DCGM GPU metrics.
Auto-manage repair cycle of nodes according to predefined set of rules.
Add a Node Repair State dashboard for repair monitoring.
Storage Manager
Delete an expired directory file-by-file to avoid locking NFS.
Take ctime into consideration when expiring files.
Lustre
Support default storage quota per person (with configurable hard/soft limit and grace period).
Support multi-MDT in auto-deployment pipeline.
Support grouping OSTs into pool, mapping pools to VCs to achieve performance isolation.
(Azure) AllowList Manager
Periodically compare the current allowed user IPs in DB and in Azure NSG rule, and make changes accordingly.
Expire user IPs after a specified number of days.
You can’t perform that action at this time.