Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Commit

Permalink
Dev weight sharing update (#579)
Browse files Browse the repository at this point in the history
* add pycharm project files to .gitignore list

* update pylintrc to conform vscode settings

* fix RemoteMachineMode for wrong trainingServicePlatform

* simple weight sharing

* update gitignore file

* change tuner codedir to relative path

* add python cache files to gitignore list

* move extract scalar reward logic from dispatcher to tuner

* update tuner code corresponding to last commit

* update doc for receive_trial_result api change

* add numpy to package whitelist of pylint

* distinguish param value from return reward for tuner.extract_scalar_reward

* update pylintrc

* add comments to dispatcher.handle_report_metric_data

* update install for mac support

* fix root mode bug on Makefile

* Quick fix bug: nnictl port value error (#245)

* fix port bug

* Dev exp stop more (#221)

* Exp stop refactor (#161)

* Update RemoteMachineMode.md (#63)

* Remove unused classes for SQuAD QA example.

* Remove more unused functions for SQuAD QA example.

* Fix default dataset config.

* Add Makefile README (#64)

* update document (#92)

* Edit readme.md

* updated a word

* Update GetStarted.md

* Update GetStarted.md

* refact readme, getstarted and write your trial md.

* Update README.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Fix nnictl bugs and add new feature (#75)

* fix nnictl bug

* fix nnictl create bug

* add experiment status logic

* add more information for nnictl

* fix Evolution Tuner bug

* refactor code

* fix code in updater.py

* fix nnictl --help

* fix classArgs bug

* update check response.status_code logic

* remove Buffer warning (#100)

* update readme in ga_squad

* update readme

* fix typo

* Update README.md

* Update README.md

* Update README.md

* Add support for debugging mode

* fix setup.py (#115)

* Add DAG model configuration format for SQuAD example.

* Explain config format for SQuAD QA model.

* Add more detailed introduction about the evolution algorithm.

* Fix install.sh add add trial log path (#109)

* fix nnictl bug

* fix nnictl create bug

* add experiment status logic

* add more information for nnictl

* fix Evolution Tuner bug

* refactor code

* fix code in updater.py

* fix nnictl --help

* fix classArgs bug

* update check response.status_code logic

* show trial log path

* update document

* fix install.sh

* set default vallue for maxTrialNum and maxExecDuration

* fix nnictl

* Dev smac (#116)

* support package install (#91)

* fix nnictl bug

* support package install

* update

* update package install logic

* Fix package install issue (#95)

* fix nnictl bug

* fix pakcage install

* support SMAC as a tuner on nni (#81)

* update doc

* update doc

* update doc

* update hyperopt installation

* update doc

* update doc

* update description in setup.py

* update setup.py

* modify encoding

* encoding

* add encoding

* remove pymc3

* update doc

* update builtin tuner spec

* support smac in sdk, fix logging issue

* support smac tuner

* add optimize_mode

* update config in nnictl

* add __init__.py

* update smac

* update import path

* update setup.py: remove entry_point

* update rest server validation

* fix bug in nnictl launcher

* support classArgs: optimize_mode

* quick fix bug

* test travis

* add dependency

* add dependency

* add dependency

* add dependency

* create smac python package

* fix trivial points

* optimize import of tuners, modify nnictl accordingly

* fix bug: incorrect algorithm_name

* trivial refactor

* for debug

* support virtual

* update doc of SMAC

* update smac requirements

* update requirements

* change debug mode

* update doc

* update doc

* refactor based on comments

* fix comments

* modify example config path to relative path and increase maxTrialNum (#94)

* modify example config path to relative path and increase maxTrialNum

* add document

* support conda (#90) (#110)

* support install from venv and travis CI

* support install from venv and travis CI

* support install from venv and travis CI

* support conda

* support conda

* modify example config path to relative path and increase maxTrialNum

* undo messy commit

* undo messy commit

* Support pip install as root (#77)

* Typo on #58 (#122)

* PAI Training Service implementation (#128)

* PAI Training service implementation
**1. Implement PAITrainingService
**2. Add trial-keeper python module, and modify setup.py to install the module
**3. Add PAItrainingService rest server to collect metrics from PAI container.

* fix datastore for multiple final result (#129)

* Update NNI v0.2 release notes (#132)

Update NNI v0.2 release notes

* Update setup.py Makefile and documents (#130)

* update makefile and setup.py

* update makefile and setup.py

* update document

* update document

* Update Makefile no travis

*  update doc

*  update doc

* fix convert from ss to pcs (#133)

* Fix bugs about webui (#131)

* Fix webui bugs

* Fix tslint

* webui logpath and document (#135)

* Add webui document and logpath as a href

* fix tslint

* fix comments by Chengmin

* Pai training service bug fix and enhancement (#136)

* Add NNI installation scripts

* Update pai script, update NNI_out_dir

* Update NNI dir in nni sdk local.py

* Create .nni folder in nni sdk local.py

* Add check before creating .nni folder

* Fix typo for PAI_INSTALL_NNI_SHELL_FORMAT

* Improve annotation (#138)

* Improve annotation

* Minor bugfix

* Selectively install through pip (#139)

Selectively install through pip 
* update setup.py

* fix paiTrainingService bugs (#137)

* fix nnictl bug

* add hdfs host validation

* fix bugs

* fix dockerfile

* fix install.sh

* update install.sh

* fix dockerfile

* Set timeout for HDFSUtility exists function

* remove unused TODO

* fix sdk

* add optional for outputDir and dataDir

* refactor dockerfile.base

* Remove unused import in hdfsclientUtility

* Add documentation for NNI PAI mode experiment (#141)

* Add documentation for NNI PAI mode

* Fix typo based on PR comments

* Exit with subprocess return code of trial keeper

* Remove additional exit code

* Fix typo based on PR comments

* update doc for smac tuner (#140)

* Revert "Selectively install through pip (#139)" due to potential pip install issue (#142)

* Revert "Selectively install through pip (#139)"

This reverts commit 1d17483.

* Add exit code of subprocess for trial_keeper

* Update README, add link to PAImode doc

* Merge branch V0.2 to Master (#143)

* webui logpath and document (#135)

* Add webui document and logpath as a href

* fix tslint

* fix comments by Chengmin

* Pai training service bug fix and enhancement (#136)

* Add NNI installation scripts

* Update pai script, update NNI_out_dir

* Update NNI dir in nni sdk local.py

* Create .nni folder in nni sdk local.py

* Add check before creating .nni folder

* Fix typo for PAI_INSTALL_NNI_SHELL_FORMAT

* Improve annotation (#138)

* Improve annotation

* Minor bugfix

* Selectively install through pip (#139)

Selectively install through pip 
* update setup.py

* fix paiTrainingService bugs (#137)

* fix nnictl bug

* add hdfs host validation

* fix bugs

* fix dockerfile

* fix install.sh

* update install.sh

* fix dockerfile

* Set timeout for HDFSUtility exists function

* remove unused TODO

* fix sdk

* add optional for outputDir and dataDir

* refactor dockerfile.base

* Remove unused import in hdfsclientUtility

* Add documentation for NNI PAI mode experiment (#141)

* Add documentation for NNI PAI mode

* Fix typo based on PR comments

* Exit with subprocess return code of trial keeper

* Remove additional exit code

* Fix typo based on PR comments

* update doc for smac tuner (#140)

* Revert "Selectively install through pip (#139)" due to potential pip install issue (#142)

* Revert "Selectively install through pip (#139)"

This reverts commit 1d17483.

* Add exit code of subprocess for trial_keeper

* Update README, add link to PAImode doc

* fix bug (#147)

* Refactor nnictl and add config_pai.yml (#144)

* fix nnictl bug

* add hdfs host validation

* fix bugs

* fix dockerfile

* fix install.sh

* update install.sh

* fix dockerfile

* Set timeout for HDFSUtility exists function

* remove unused TODO

* fix sdk

* add optional for outputDir and dataDir

* refactor dockerfile.base

* Remove unused import in hdfsclientUtility

* add config_pai.yml

* refactor nnictl create logic and add colorful print

* fix nnictl stop logic

* add annotation for config_pai.yml

* add document for start experiment

* fix config.yml

* fix document

* Fix trial keeper wrongly exit issue (#152)

* Fix trial keeper bug, use actual exitcode to exit rather than 1

* Fix bug of table sort (#145)

* Update doc for PAIMode and v0.2 release notes (#153)

* Update v0.2 documentation regards to release note and PAI training service

* Update document to describe NNI docker image

* fix antd (#159)

* refactor experiment stopping logic

* support change concurrency

* remove trialJobs.ts

* trivial changes

* fix bugs

* fix bug

* support updating maxTrialNum

* Modify IT scripts for supporting multiple experiments

* Update ci (#175)

* Update RemoteMachineMode.md (#63)

* Remove unused classes for SQuAD QA example.

* Remove more unused functions for SQuAD QA example.

* Fix default dataset config.

* Add Makefile README (#64)

* update document (#92)

* Edit readme.md

* updated a word

* Update GetStarted.md

* Update GetStarted.md

* refact readme, getstarted and write your trial md.

* Update README.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Fix nnictl bugs and add new feature (#75)

* fix nnictl bug

* fix nnictl create bug

* add experiment status logic

* add more information for nnictl

* fix Evolution Tuner bug

* refactor code

* fix code in updater.py

* fix nnictl --help

* fix classArgs bug

* update check response.status_code logic

* remove Buffer warning (#100)

* update readme in ga_squad

* update readme

* fix typo

* Update README.md

* Update README.md

* Update README.md

* Add support for debugging mode

* modify CI cuz of refracting exp stop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* file saving

* fix issues from code merge

* remove $(INSTALL_PREFIX)/nni/nni_manager before install

* fix indent

* fix merge issue

* socket close

* update port

* fix merge error

* modify ci logic in nnimanager

* fix ci

* fix bug

* change suspended to done

* update ci (#229)

* update ci

* update ci

* update ci (#232)

* update ci

* update ci

* update azure-pipelines

* update azure-pipelines

* update ci (#233)

* update ci

* update ci

* update azure-pipelines

* update azure-pipelines

* update azure-pipelines

* run.py (#238)

* Nnupdate ci (#239)

* run.py

* test ci

* Nnupdate ci (#240)

* run.py

* test ci

* test ci

* Udci (#241)

* run.py

* test ci

* test ci

* test ci

* update ci (#242)

* run.py

* test ci

* test ci

* test ci

* update ci

* revert install.sh (#244)

* run.py

* test ci

* test ci

* test ci

* update ci

* revert install.sh

* add comments

* remove assert

* trivial change

* trivial change

* update Makefile (#246)

* update Makefile

* update Makefile

* quick fix for ci (#248)

* add update trialNum and fix bugs (#261)

* Add builtin tuner to CI (#247)

* update Makefile

* update Makefile

* add builtin-tuner test

* add builtin-tuner test

* refractor ci

* update azure.yml

* add built-in tuner test

* fix bugs

* Doc refactor (#258)

* doc refactor

* image name refactor

* Refactor nnictl to support listing stopped experiments. (#256)

Refactor nnictl to support listing stopped experiments.

* Show experiment parameters more beautifully (#262)

* fix error on example of RemoteMachineMode (#269)

* add pycharm project files to .gitignore list

* update pylintrc to conform vscode settings

* fix RemoteMachineMode for wrong trainingServicePlatform

* Update docker file to use latest nni release (#263)

* fix bug about execDuration and endTime (#270)

* fix bug about execDuration and endTime

* modify time interval to 30 seconds

* refactor based on Gems's suggestion

* for triggering ci

* Refactor dockerfile (#264)

* refactor Dockerfile

* Support nnictl tensorboard (#268)

support tensorboard

* Sdk update (#272)

* Rename get_parameters to get_next_parameter

* annotations add get_next_parameter

* updates

* updates

* updates

* updates

* updates

* add experiment log path to experiment profile (#276)

* refactor extract reward from dict by tuner

* update Makefile for mac support, wait for aka.ms support

* refix Makefile for colorful echo

* unversion config.yml with machine information

* sync graph.py between tuners & trial of ga_squad

* sync graph.py between tuners & trial of ga_squad

* copy weight shared ga_squad under weight_sharing folder

* mv ga_squad code back to master

* simple tuner & trial ready

* Fix nnictl multiThread option

* weight sharing with async dispatcher simple example ready

* update for ga_squad

* fix bug

* modify multihead attention name

* add min_layer_num to Graph

* fix bug

* update share id calc

* fix bug

* add save logging

* fix ga_squad tuner bug

* sync bug fix for ga_squad tuner

* fix same hash_id bug

* add lock to simple tuner in weight sharing

* Add readme to simple weight sharing

* update

* update

* add paper link

* update

* reformat with autopep8

* add documentation for weight sharing

* test for weight sharing

* delete irrelevant files

* move details of weight sharing in to code comments

* add example section

* update weight sharing tutorial
  • Loading branch information
leckie-chn authored Jan 8, 2019
1 parent 02ee0ac commit 6a03a73
Show file tree
Hide file tree
Showing 2 changed files with 25 additions and 11 deletions.
36 changes: 25 additions & 11 deletions docs/AdvancedNAS.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,31 @@ This is a tutorial on how to enable weight sharing in NNI.
## Weight Sharing among trials
Currently we recommend sharing weights through NFS (Network File System), which supports sharing files across machines, and is light-weighted, (relatively) efficient. We also welcome contributions from the community on more efficient techniques.

### Weight Sharing through NFS file
With the NFS setup (see below), trial code can share model weight through loading & saving files. Here we recommend that user feed the tuner with the storage path:
```yaml
tuner:
codeDir: path/to/customer_tuner
classFileName: customer_tuner.py
className: CustomerTuner
classArgs:
...
save_dir_root: /nfs/storage/path/
```
And let tuner decide where to save & load weights and feed the paths to trials through `nni.get_next_parameters()`:

![weight_sharing_design](./img/weight_sharing.png)

For example, in tensorflow:
```python
# save models
saver = tf.train.Saver()
saver.save(sess, os.path.join(params['save_path'], 'model.ckpt'))
# load models
tf.init_from_checkpoint(params['restore_path'])
```
where `'save_path'` and `'restore_path'` in hyper-parameter can be managed by the tuner.

### NFS Setup
In NFS, files are physically stored on a server machine, and trials on the client machine can read/write those files in the same way that they access local files.

Expand Down Expand Up @@ -34,17 +59,6 @@ sudo mount -t nfs 10.10.10.10:/tmp/nni/shared /mnt/nfs/nni
```
where `10.10.10.10` should be replaced by the real IP of NFS server machine in practice.

### Weight Sharing through NFS file
With the NFS setup, trial code can share model weight through loading & saving files. For example, in tensorflow:
```python
# save models
saver = tf.train.Saver()
saver.save(sess, os.path.join(params['save_path'], 'model.ckpt'))
# load models
tf.init_from_checkpoint(params['restore_path'])
```
where `'save_path'` and `'restore_path'` in hyper-parameter can be managed by the tuner.

## Asynchornous Dispatcher Mode for trial dependency control
The feature of weight sharing enables trials from different machines, in which most of the time **read after write** consistency must be assured. After all, the child model should not load parent model before parent trial finishes training. To deal with this, users can enable **asynchronous dispatcher mode** with `multiThread: true` in `config.yml` in NNI, where the dispatcher assign a tuner thread each time a `NEW_TRIAL` request comes in, and the tuner thread can decide when to submit a new trial by blocking and unblocking the thread itself. For example:
```python
Expand Down
Binary file added docs/img/weight_sharing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 6a03a73

Please sign in to comment.