Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Commit

Permalink
[WIP] 0.11.0 release notes (#2410)
Browse files Browse the repository at this point in the history
* add release note

* refine layout

* refine layout

* refine layout

* replace images

* Revert "replace images"

This reverts commit 361afac.

* replace images

* typo

* Update RELEASE_NOTES.md

* Update RELEASE_NOTES.md

* Update RELEASE_NOTES.md

* Update RELEASE_NOTES.md

* fix

* remove tail #

* remove trailing space

* add know issue

* add retry link missing

* add nfs wiki link
  • Loading branch information
mzmssg authored Apr 3, 2019
1 parent 248ced4 commit 4d63434
Show file tree
Hide file tree
Showing 5 changed files with 76 additions and 0 deletions.
76 changes: 76 additions & 0 deletions RELEASE_NOTES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Release v0.11.0

## New Features

* Support team wise NFS storage, including:
- An NFS configuration plug-in and a commandline tool. [#2346](https://github.com/Microsoft/pai/pull/2346)
- A simple NFS-job submit plug-in. [#2358](https://github.com/Microsoft/pai/pull/2358)

Refer to [Simplified Job Submission for OpenPAI + NFS deployment](https://github.com/Microsoft/pai/wiki/Simplified-Job-Submission-for-OpenPAI-with-NFS-deployment) for more details.

* New alerts for unhealthy GPUs, currently including following alerts [#2209](https://github.com/Microsoft/pai/pull/2209):
- gpu used by zombie container
- gpu used by external process
- gpu ecc error
- gpu hangs
- gpu memory leak

* Admin could know all running jobs on a node. [#2197](https://github.com/Microsoft/pai/pull/2197)
![](./docs/release_note/images/node_tasks.png)

* Filter supports in Job List View. [#302](https://github.com/Microsoft/pai/pull/302)
![](./docs/release_note/images/filter.png)

* Hold the Env for failed jobs which are casued by user error. [#2272](https://github.com/Microsoft/pai/pull/2272)


## Improvements

### Service

* Webportal:
- New job list page look and feel. [#302](https://github.com/Microsoft/pai/pull/302)
![](./docs/release_note/images/job_list.png)
- New job detail page: [#2211](https://github.com/Microsoft/pai/pull/2211)
![](./docs/release_note/images/job_detail.png)

* Alert-manager:
Increase node memory and CPU threshold to reduce false alerts. [#2345](https://github.com/Microsoft/pai/pull/2345), [#2296](https://github.com/Microsoft/pai/pull/2296)

* Hadoop:
Persist yarn and hdfs service log to host. [#2244](https://github.com/Microsoft/pai/pull/2244)

* Runtime:
Support samba shares in container. [#2318](https://github.com/Microsoft/pai/pull/2318)


### Documentation

* Add troubleshooting guide for jobs. [#2305](https://github.com/Microsoft/pai/pull/2305)
* Refine document for new user to submit job. [#2278](https://github.com/Microsoft/pai/pull/2278)

### Examples

* Remove TensorFlow mpi example which cannot be run currently. [#2337](https://github.com/Microsoft/pai/pull/2337)

### Others

* Operations:
Add a commandline tool to query unhealthy gpu information from prometheus. [#2319](https://github.com/Microsoft/pai/pull/2365)

## Notable Fixes

* Hadoop: Scheduler may get stuck in a indefinite loop. [#2365](https://github.com/Microsoft/pai/pull/2365)
* Hadoop: Sometimes, hadoop-ai can't detect ecc error. [#2343](https://github.com/Microsoft/pai/pull/2343)
* Runtime: Users might see unallocated gpus. [#2352](https://github.com/Microsoft/pai/pull/2352)
* Runtime: Jobs might get a free retry when using exceed memory. [#1108](https://github.com/Microsoft/pai/pull/1108)
* Drivers: Fix IB installation bugs. [#2278](https://github.com/Microsoft/pai/pull/2278), [#2271](https://github.com/Microsoft/pai/pull/2271), [#2269](https://github.com/Microsoft/pai/pull/2269)

## Known Issues

* There might be a mismatch between linux kernel and driver. [#2446](https://github.com/Microsoft/pai/pull/2446)
* Retry link of new job details page is missing. [#2466](https://github.com/Microsoft/pai/pull/2466)

## Upgrading from Earlier Release

Please follow the [Upgrading to v0.11](./docs/upgrade/upgrade_to_v0.11.md) for detailed instructions.
Binary file added docs/release_note/images/filter.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/release_note/images/job_detail.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/release_note/images/job_list.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/release_note/images/node_tasks.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 4d63434

Please sign in to comment.