Releases: microsoft/pai
v1.1.0: July 2020 Release
Release v1.1.0
New Features
- Storage:
- Support readonly storage. (#4523)
- Security
- If ssl is enabled, all requests will use https. (#4550)
- Authentication
- Support nested AD group in AAD Mode. (#4639)
- Marketplace
- Integrate with new version of PAI marketplace.
Improvements
- Add stress test for PAI API. (#4665)
- Resolve job always retry for port conflict. (#4384)
- Webportal/VScode use JS SDK + SDK improvement. (#4660)
- Align webportal submit default value with backend. (#4682)
- Document enhance. (#4700)
Bug Fixes
- Fix tensorboard v2 the logdir is not correct issue.
- Fix webPortal submit job help link broken.
- Fix ssh barrier bug.
v1.0.1: June 2020 Release
v1.0.0: May 2020 Release
Release v1.0.0
New Features
- AAD Support
- Introduce kubespray to deploy clusters #3757
- Framework Controller
- hivedscheduler provides a Kubernetes Scheduler Extender for multi-tenant GPU clusters.
- Hived scheduler deployment #3495, #3579
- Hived as the default k8s scheduler #3599
- Job Near FIFO scheduling #3726, #3731
- Expose LazyPreemptionStatus #3917
- Disable leader election #3928
- HiveD intra-vc preemption for restart #3861
- Check suggested nodes after preemption #3843
- Update hived config validation #3812
- HiveD reconfiguration #3768
- openpai-runtime is a module that provides runtime support to job containers.
- Port kube runtime #3013
- Job ssh for kube-runtime #3153, #3729
- Add PAI env variables in init scripts #3154
- Generate random ports for scheduling #3224
- Refine init and runtime script in k8s pods #3245
- Port conflict check #3259
- Kubernetes ErrorSpec #3585
- Add job exit code #3559
- Add sshbarrier to ssh plugin #3587
- Clean
${PAI_WORK_DIR}
before mv content to this folder #3695 - Force to flush after user command finished #3794
- Decompress the framework when the size is large #3820
- Apt package cache #4226
- openpaisdk provides JavaScript SDK designed to facilitate the developers of OpenPAI to offer user friendly experience.
- openpaimarketplace provides a webportal plugin, which stores examples and job templates. Users can use
openpaimarketplace
to share their jobs or run-and-learn others' sharing job. - Enable RBAC
- Device plugin
- Storage
- Limited internal storage and postgres db #3813
- Use postgres db to show job history #4164
- TensorBoard integration #3257
Improvements
- Deployment
- Support to manage a list of services in paictl #3432
- Choose services for different cluster type #3528
- Improve deployment process, reduce the time cost #4022
- Add deployment pre-check #1893
- Remove yarn version components in k8s version object model #4027
- Inform user of pai cluster id, configuration, username, password PR #4267
- Clean docker image in service-boot.sh PR #4248
- Disable yarn value in cluster-type #4445
- Remove yarn content from deployment doc #4447
- Webportal
- Quick wizards or templates for users #3430
- Enhance data storage functions #3416
- Available GPU chart and virtual cluster list #3265
- User filter #3310
- Show VC utilization and alerts notification #3411
- A confirm dialog for stop actions #3408
- Add a button in job-detail page to get merged stdout&stderr log #3282
- Add more information when SSH is disabled #3389
- Job history UI #3831
- User profile page #3804, #3853, #3884
- Add SSH config to webportal for pure-k8s based PAI #3596
- SSH Generator for webportal #3644
- Hide pages and links that are not supported in pure k8s version #3574
- Hide grafana dashboard in k8s webportal #3688
- Map rest server level job status
completing, retry pending
torunning, waiting
#3636 - Seperate 'waiting' and 'running' states on task role's statistics #3727
- Display stopped task count in task role's header #3840
- Disable utilization charts' animation #3730
- Render clone job button as a link if possible #3854
- Add stopping status to task #3868
- Lazy loading of monaco editor component #3871
- Render clone job button as a link if possible #3854
- Static content caching optimization #3852
- Webportal redirection to the origin page after login #3914
- Display stopped task count in task role's header #3840
- Show CPU/Mem usage in home page for normal user #3784
- new UX design
- Replace code injection in webportal with new plugin implementation #3823
- Rest-server
- Cluster info api #3281
- Update hived config validation #3812
- Job history API #3831
- Token API #3774 #3834 #3835
- Change to default k8s scheduler #3599
- Check groups every time restserver start #3458
- Add pod GPU number for default scheduler #3642
- Map rest server level job status
completing, retry pending
torunning, waiting
#3636 - Make -1 compatible in launcher completion policy #3870
- Update hived resource validation #3867
- Update restart policy to avoid stuck pending pods #3856
- Filter pods without nodename and completed pods #3841
- Reverse encoded framework name #3824
- Mask secret in framework annotations #3821
- Update priority class owner references #3808
- Refine all APIs and documents #4355
- Change gpuType(s) to skuType(s) #4362
- Update virtual cluster metrics using scheduler api #4329
- openpai-protocol provides a specification of OpenPAI job protocol.
- openpaivscode provides a VSCode extension to connect OpenPAI clusters, submit AI jobs, simulate jobs locally, manage files, and so on.
- Support AAD login in VSCode Extension #3647
- Security
- Migrate components to separate repos #4319 #4307 #4311 #4324
- AMD GPU Support #4127
- Others
- Access Log Manager through Pylon #3600
- Remove invalid chart in Grafana Dashboard #4020
- Watchdog auto delete leaked priorityclass #3866
- Check ACS Docker Image in initContainer #3572
- For basic authentication mode, prepare document to explain how to create a group and associate users to the group #4130
- Support to add Pod Creation http error pattern to errorspec #4125
- Change watchdog default mem limit #4413
Documentation
- Update job submission doc #3347
- AAD end2end document #3362
- Upgrade document #4238
- Add installation FAQs and troubleshooting PR #4249
- Rest-server API documents refinement #4355
- End-to-end manual for cluster users and administrators #4023
- Remove outdated docs #4446
Bug Fixes
- Docker's data-root will lost on Azure Node restart #3307
- Fix kubelet.service in add machines #3807
- Fix missing dependencies during installation #3800
- Fix VC view link bug #3689
- Fix job list page's stopping status #3869
- Fix bug in hived resources calculation #3595
- Fix grafana can not be accessed behind the gateway #3659
- Fix job issues on k8s based PAI #3555
- Remove framework owner reference for priority class & default not to create priority class #4131
- Job retry link invalid with unknown reason #4008
- Do not change the semantic meaning of user submitted/cloned job config #3823
- Storage manager constantly restart #4081
- Wrong retry log path in job history issue #4237
- API server overloaded by job detail page when containers are too many #4270 #4279
- Job API not return correct appLaunchedTime #4295
- Fix Azure File issues in storage #4438
- Fix job retry url #4442
- Add rate limit for RESTful API #4418 #4422
Known Issues
- Weave net cause MPI job hang #4394
- Hivedscheduler is prone to misconfig due to daemon Pods, such as weave net and nginx proxy #4331
- Cert expiration will fail the access to the bed #4216
- Can not access job pod in k8s-dashboard #4181
- A job (or a pod of a job) may get stuck in a state neither running nor waiting #4141
- Job config is modified after it is imported/uploaded/cloned #3823
- Get recursive nested AD Users in AAD Mode #3440
v0.17.0
This release is an intermediate release major for the upcoming PureK8S version release. As there are breaking changes from PAI's K8S+YARN version to PureK8S version, if you are currently using PAI's K8S+YARN version for production, please stay with 0.14.0 version and plan for upgrade later.
v0.14.0: July 2019 Release
Release v0.14.0
New Features
-
Web portal:
-
Python SDK:
- Sdk release v0.4.00 #3018
-
New scheduler:
- Dedicated vc support #2960
-
PAI vscode extension:
-
Team storage plugin:
- New team-wise manage cli #2943
Improvements
-
Web portal:
- Refine job detail page's task list #2953
- add new webHDFSUri in env.js.template (#3048)
- Tweak job submission page layout (#3043)
- css tweak (#3041)
- refine submit job UI page (#3037)
- refine home page's error handling (#3196)
- renew docker image list and add tooltip (#3181)
- remove prettier config file (#3184)
- remove tachyons css to avoid classname error (#3173)
- Add confirm dialog before batch edit admin's password (#3174)
- fix UI broken if choose all of the VC when create user (#3177)
- redesign batch edit behavior (#3172)
- trim the docker url after job submission; fix job detail page's clone button's padding (#3169)
- Change 'Import CSV' to 'Create Bulk Users' in user management (#3136)
- update command section's placeholder (#3150)
- move documents link to top nav bar (#3126)
- tweak home page's gpu chart's height (#3131)
- display red border when a task role is invalid (#3072)
- change label of container size to resources per instance (#3101)
- add error message when command is empty (#3098)
- disable edit user form's auto fill (#3107)
-
Rest server:
-
Yarn cluster && Framework launcher:
-
Deployement:
-
Plugins:
-
Security:
Documentation
- Rewording and some format fixes. #2927
- Chinese translation and placeholder. #2919
- add submit v2 job to readme #3017)
- api doc update (#3216)
- add release note (#3204)
- pai upgrade doc (#3195)
- add job submission docs (#3183)
- change link of external project like python sdk (#3237)
Bug Fixes
-
Web portal:
- hide 0 gpu nodes from Available Nodes Chart #2915
- fix job detail page's gpu attributes display bug. #3027
- disable submit if command is empty (#3080)
- auto remove empty lines in command (#3074)
- keep mount config selection state (0af835)
- fix the set-state warning after clone job (#3070)
- fix clone job bug (#3068)
- fix
<p>
tag and prop-types warning (#3067) - Return empty command if no teamwise mounts (4afd2e)
- Adjust team-mount-list view (1b4190)
- Align job submit page's submission section to task role (#3065)
- Add tooltips to job submission page's field label. (#3046)
- remove plugin in webportal config (#3063)
- remove pylon address dependency (#3040)
- fix export yaml bug (#3047)
- fix webhdfs wrong request (#3044)
- customized docker image inputField may disappear (#3033)
- add dependency of joi for node server (#3031)
- remove duplicated v of feedback (#3230)
- change webportal doc link (#3229)
- fix stdout/stderr's full log link bug when pylon is not used (#3219)
- change tutorial link of home page (#3213)
- fix bug of GPU available number (#3210)
- hot fix for hdfs CORS problem (#3145)
- fix docker bug #3134
- hot fix hdfscli proxy problem (#3130)
- fix virtual cluster's default value after job clone (#3128)
- refine hdfs check for robustness (#3116)
- change to lowercase letter for 'Completion Policy' (#3119)
- fix data command error (#3111)
- fix job submission page's
jobRetryCount
andtaskRetryCount
field (#3112) - redict v2 job to default submission papge if plugin not installed (#3091)
- fix docker (#3097)
- add empty key check to key-value list control (#3096)
- change command section's default comments to placeholder (#3095)
- disable submit if command is empty (#3080)
- fix deployment field missing bug (#3238)
- [Web Portal] fix port list bug (#3240)
-
Rest server:
- Add quotes for masked secrets field in protocol
- Trap SIGTERM in entrypoint to avoid yarn container early stop #2947
- fix bug #3009
- Update http errors in get job v2 #3022
- User Migrate Script Fix. (#3090)
- Fix issue in updateUserVirtualCluster of rest-server (565073)
- Fix user migration issue (#3036)
- api permission fix (#3211)
- Fix AAD group in dedicated vc create/delete (#3143)
- API to create vc and remove vc and do the same operation to group (#3064)
- Groupname schema (#3099)
-
Hadoop:
-
Deployment:
Known Issues
- All lines in command will be concatenated by
&&
, so use#
or\
in the command will cause bugs. This will be fixed in the future. - Based on official doc, the different gpu driver versions may support different cuda versions. As our tests, current 384.111 gpu driver version does not support cuda10 image.
v0.13.0: June 2019 Release
Release v0.13.0
New Features
-
OpenPAI protocol:
- Introduce OpenPAI protocol and job submission v2 (#2260)
- Add new job submission v2 plugin (#2461)
-
Web portal:
Improvements
-
OpenPAI protocol:
- Update example jobs in marketplace v2 for OpenPAI protocol (#2827)
-
Web portal:
-
Rest server:
-
Framework launcher:
- Upgrade to Hadoop 2.9.0 (#2704)
-
Job exporter:
-
Watchdog:
- Use
/api/v1/pods
to get all pods (#2750)
- Use
-
Deployement:
Documentation
- Refine document of VS Code extension (#2707)
- Add document for PAI storage (#2822)
- OpenPAI protocol specification document (#2260)
- Job submission v2 plugin document (#2820)
- Update RESTful API document for API v2 (#2816)
- Fix typos in document (#2818)
Bug Fixes
-
Web portal:
-
Rest server:
- Check duplicate job in submission v2 (#2837)
-
Hadoop:
Known Issues
- Deployments issues on NVIDIA DGX2 (#2742)
v0.12.0: April. 2019 Release
Release v0.12.0
New Features
-
Web portal:
-
Deployment
Improvements
- Web portal:
- REST server:
- Framework Launcher:
- Add more info into SummarizedFrameworkInfo #2435
- Alert manager:
Documentation
Bug Fixes
-
Web portal:
-
REST server:
-
Hadoop:
- Remove duplicate diagnostics #2527
-
Alart manager:
- Fix alert label error #2521
-
Drivers:
-
Storage plugin
- Add environment and handle corner cases #2525
Known Issues
N/A
Upgrading from Earlier Release
Please follow the Upgrading to v0.12 for detailed instructions.
v0.11.0: April. 2019 Release
Release v0.11.0
New Features
-
Support team wise NFS storage, including:
Refer to Simplified Job Submission for OpenPAI + NFS deployment for more details.
-
New alerts for unhealthy GPUs, currently including following alerts #2209:
- gpu used by zombie container
- gpu used by external process
- gpu ecc error
- gpu hangs
- gpu memory leak
-
Admin could know all running jobs on a node. #2197
-
Filter supports in Job List View. #302
-
Hold the Env for failed jobs which are casued by user error. #2272
Improvements
Service
-
Webportal:
-
Alert-manager:
Increase node memory and CPU threshold to reduce false alerts. #2345, #2296 -
Hadoop:
Persist yarn and hdfs service log to host. #2244 -
Runtime:
Support samba shares in container. #2318
Documentation
Examples
- Remove TensorFlow mpi example which cannot be run currently. #2337
Others
- Operations:
Add a commandline tool to query unhealthy gpu information from prometheus. #2319
Notable Fixes
- Hadoop: Scheduler may get stuck in a indefinite loop. #2365
- Hadoop: Sometimes, hadoop-ai can't detect ecc error. #2343
- Runtime: Users might see unallocated gpus. #2352
- Runtime: Jobs might get a free retry when using exceed memory. #1108
- Drivers: Fix IB installation bugs. #2278, #2271, #2269
Known Issues
- There might be a mismatch between linux kernel and driver. #2446
- Retry link of new job details page is missing. #2466
Upgrading from Earlier Release
Please follow the Upgrading to v0.11 for detailed instructions.
v0.10.1: Mar. 2019 Release
Release v0.10.1
New Features
- Admin can configure MaxCapacity through REST API for a given Virtual Cluster so that the virtual cluster can use iddle resources as bonus. #2147
- Support Azure RDMA. #2091; how-to doc
- New Disk Cleaner for abnormal disk usage: The disk cleaner will check disk usage every 60 second(configurable), and if the disk usage is above 94%(configurable), it will kill the container that uses largest disk space using specific signal(10), the container will exit with code 1, and the related job will fail. Admin/User can track the reason in job logs. #2119
- Web portal: add "My jobs" filter button. #2111
- "Submit Simple Job" web portal plugin. #2131 Document
Improvements
Service
- Hadoop: Improved log readability by disable a not in use HDFS shortcircuit setting. #2027
- Extended the job log retention time from 7 days to 30 days. Enabled the log retain time as configurable settings for Admin. #2034
- Optimized the RM and Yarn's default configurations for PAI to reduce the resource usage by AM. #2072
- Pylon: WebHDFS library compatibility. #2134
- Extend the NM expiry time from 15 mins to 60 mins to provide a better tolerable experience for NM downtime. #2142
- Alart Manager: Make it more clear in service not up. #2105
- Web Portal: Allow jsonc in job submission. #2084
Deployment
- Only restart docker deamon, if the configuration is updated. #2138
Documentation
- Update document about docker data root's configuration. #2052
- Improved how-to-setup-dev-box.md with more details. #2087
- Improved hdfs_service.md with more details. #2096
Examples
- Add an exmaple of horovod with rdma & intel mpi. #2112
Others
- Build: Add error message when image build failed. #2133
Bug Fixes
- Issue #2099 is fixed by
- Kubernetes: Disable kubernetes's pod eviction. #2124
- Grafana: Use yarn's metrics in cluster view. #2148
- Add /usr/local/cuda/extras/CUPTI/lib64 to LD_LIBRARY_PATH. #2043
Upgrading from Earlier Release
Known Issue
Issue: There is a known issue #2433 in v0.10.1 upgrade, some users might hit this issue. When hitting the issue, deploy kubernetes cluster with OpenPAI will hang.
Resolution: We had provided an hotfix #2441 for it. But if your organization does not have any urgency to upgrade to v0.10.1 by end of March 2019, you can postpone the upgrade plan for a week, by when we will release v0.11.0 #2307 in which the known issue has been officially fixed.
Please follow the Upgrading to v0.10 for detailed instructions.