Skip to content

Deep Learning Training Service v1.7.0

Latest
Compare
Choose a tag to compare
@Anbang-Hu Anbang-Hu released this 15 Jul 01:00
· 16 commits to v1.7 since this release
97268d2

Job Manager

  • Support setting different scheduling policies per VC.
    • RF: Runnable first. Large jobs waiting for resources do not block later small jobs.
    • FIFO: First-in first-out based on job queue time. A large job waiting for resources can block later small jobs.
  • Support setting max job running time (wall time) per VC. VC admins can adjust the setting for jobs.
  • Support limiting number of interactive GPUs per VC.
  • Support user global public keys, enabling users to access jobs in any cluster using their own private key.
  • Requeue preempted jobs at the head of the job queue.
  • Add an INIT process in jobs to manage signal broadcast and zombie process reap, propagating SIGTERM to user process.
  • Delete very old jobs in small batches to avoid locking DB.

Restful API

  • Allow specifying max retry count for each job.
  • Support changing parameters per VC:
    • Max job time
    • Max number of interactive GPUs
    • Scheduling policy
  • Allow adding user IP for allowlist.
  • VC quota management proportional to GPU/CPU.

Dashboard

  • VC notification
  • Show worker node count for pure CPU cluster.
  • Add timeout column for jobs in View and Manage Jobs.
  • Show insight message(s) on job details page for running jobs.
  • Show repair message(s) on job details page for running jobs.
  • Add Visual Studio Code (alpha) as an endpoint on job details page.
  • Allow downloading full job logs.
  • Allow specifying max retry count on job submission page.
  • Show repair status for worker nodes.
  • Show snapshot time on STORAGE tab.
  • Support exporting STORGAE tab as csv.
  • Add SETTINGS tab for VC admins to manage VC parameters.
  • Add a hidden page for cluster admins to manage VC quota.
  • Add My SSH Keys page for users to upload global public keys.
  • Add My Allowed IP page for users to self-serve allowing their IP.

Monitoring and RepairManager

  • Fix incorrect mapping for DCGM GPU metrics.
  • Auto-manage repair cycle of nodes according to predefined set of rules.
  • Add a Node Repair State dashboard for repair monitoring.

Storage Manager

  • Delete an expired directory file-by-file to avoid locking NFS.
  • Take ctime into consideration when expiring files.

Lustre

  • Support default storage quota per person (with configurable hard/soft limit and grace period).
  • Support multi-MDT in auto-deployment pipeline.
  • Support grouping OSTs into pool, mapping pools to VCs to achieve performance isolation.

(Azure) AllowList Manager

  • Periodically compare the current allowed user IPs in DB and in Azure NSG rule, and make changes accordingly.
  • Expire user IPs after a specified number of days.