Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Commit

Permalink
Dev weight sharing (#568) (#576)
Browse files Browse the repository at this point in the history
* Dev weight sharing (#568)

* add pycharm project files to .gitignore list

* update pylintrc to conform vscode settings

* fix RemoteMachineMode for wrong trainingServicePlatform

* simple weight sharing

* update gitignore file

* change tuner codedir to relative path

* add python cache files to gitignore list

* move extract scalar reward logic from dispatcher to tuner

* update tuner code corresponding to last commit

* update doc for receive_trial_result api change

* add numpy to package whitelist of pylint

* distinguish param value from return reward for tuner.extract_scalar_reward

* update pylintrc

* add comments to dispatcher.handle_report_metric_data

* update install for mac support

* fix root mode bug on Makefile

* Quick fix bug: nnictl port value error (#245)

* fix port bug

* Dev exp stop more (#221)

* Exp stop refactor (#161)

* Update RemoteMachineMode.md (#63)

* Remove unused classes for SQuAD QA example.

* Remove more unused functions for SQuAD QA example.

* Fix default dataset config.

* Add Makefile README (#64)

* update document (#92)

* Edit readme.md

* updated a word

* Update GetStarted.md

* Update GetStarted.md

* refact readme, getstarted and write your trial md.

* Update README.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Fix nnictl bugs and add new feature (#75)

* fix nnictl bug

* fix nnictl create bug

* add experiment status logic

* add more information for nnictl

* fix Evolution Tuner bug

* refactor code

* fix code in updater.py

* fix nnictl --help

* fix classArgs bug

* update check response.status_code logic

* remove Buffer warning (#100)

* update readme in ga_squad

* update readme

* fix typo

* Update README.md

* Update README.md

* Update README.md

* Add support for debugging mode

* fix setup.py (#115)

* Add DAG model configuration format for SQuAD example.

* Explain config format for SQuAD QA model.

* Add more detailed introduction about the evolution algorithm.

* Fix install.sh add add trial log path (#109)

* fix nnictl bug

* fix nnictl create bug

* add experiment status logic

* add more information for nnictl

* fix Evolution Tuner bug

* refactor code

* fix code in updater.py

* fix nnictl --help

* fix classArgs bug

* update check response.status_code logic

* show trial log path

* update document

* fix install.sh

* set default vallue for maxTrialNum and maxExecDuration

* fix nnictl

* Dev smac (#116)

* support package install (#91)

* fix nnictl bug

* support package install

* update

* update package install logic

* Fix package install issue (#95)

* fix nnictl bug

* fix pakcage install

* support SMAC as a tuner on nni (#81)

* update doc

* update doc

* update doc

* update hyperopt installation

* update doc

* update doc

* update description in setup.py

* update setup.py

* modify encoding

* encoding

* add encoding

* remove pymc3

* update doc

* update builtin tuner spec

* support smac in sdk, fix logging issue

* support smac tuner

* add optimize_mode

* update config in nnictl

* add __init__.py

* update smac

* update import path

* update setup.py: remove entry_point

* update rest server validation

* fix bug in nnictl launcher

* support classArgs: optimize_mode

* quick fix bug

* test travis

* add dependency

* add dependency

* add dependency

* add dependency

* create smac python package

* fix trivial points

* optimize import of tuners, modify nnictl accordingly

* fix bug: incorrect algorithm_name

* trivial refactor

* for debug

* support virtual

* update doc of SMAC

* update smac requirements

* update requirements

* change debug mode

* update doc

* update doc

* refactor based on comments

* fix comments

* modify example config path to relative path and increase maxTrialNum (#94)

* modify example config path to relative path and increase maxTrialNum

* add document

* support conda (#90) (#110)

* support install from venv and travis CI

* support install from venv and travis CI

* support install from venv and travis CI

* support conda

* support conda

* modify example config path to relative path and increase maxTrialNum

* undo messy commit

* undo messy commit

* Support pip install as root (#77)

* Typo on #58 (#122)

* PAI Training Service implementation (#128)

* PAI Training service implementation
**1. Implement PAITrainingService
**2. Add trial-keeper python module, and modify setup.py to install the module
**3. Add PAItrainingService rest server to collect metrics from PAI container.

* fix datastore for multiple final result (#129)

* Update NNI v0.2 release notes (#132)

Update NNI v0.2 release notes

* Update setup.py Makefile and documents (#130)

* update makefile and setup.py

* update makefile and setup.py

* update document

* update document

* Update Makefile no travis

*  update doc

*  update doc

* fix convert from ss to pcs (#133)

* Fix bugs about webui (#131)

* Fix webui bugs

* Fix tslint

* webui logpath and document (#135)

* Add webui document and logpath as a href

* fix tslint

* fix comments by Chengmin

* Pai training service bug fix and enhancement (#136)

* Add NNI installation scripts

* Update pai script, update NNI_out_dir

* Update NNI dir in nni sdk local.py

* Create .nni folder in nni sdk local.py

* Add check before creating .nni folder

* Fix typo for PAI_INSTALL_NNI_SHELL_FORMAT

* Improve annotation (#138)

* Improve annotation

* Minor bugfix

* Selectively install through pip (#139)

Selectively install through pip 
* update setup.py

* fix paiTrainingService bugs (#137)

* fix nnictl bug

* add hdfs host validation

* fix bugs

* fix dockerfile

* fix install.sh

* update install.sh

* fix dockerfile

* Set timeout for HDFSUtility exists function

* remove unused TODO

* fix sdk

* add optional for outputDir and dataDir

* refactor dockerfile.base

* Remove unused import in hdfsclientUtility

* Add documentation for NNI PAI mode experiment (#141)

* Add documentation for NNI PAI mode

* Fix typo based on PR comments

* Exit with subprocess return code of trial keeper

* Remove additional exit code

* Fix typo based on PR comments

* update doc for smac tuner (#140)

* Revert "Selectively install through pip (#139)" due to potential pip install issue (#142)

* Revert "Selectively install through pip (#139)"

This reverts commit 1d174836d3146a0363e9c9c88094bf9cff865faa.

* Add exit code of subprocess for trial_keeper

* Update README, add link to PAImode doc

* Merge branch V0.2 to Master (#143)

* webui logpath and document (#135)

* Add webui document and logpath as a href

* fix tslint

* fix comments by Chengmin

* Pai training service bug fix and enhancement (#136)

* Add NNI installation scripts

* Update pai script, update NNI_out_dir

* Update NNI dir in nni sdk local.py

* Create .nni folder in nni sdk local.py

* Add check before creating .nni folder

* Fix typo for PAI_INSTALL_NNI_SHELL_FORMAT

* Improve annotation (#138)

* Improve annotation

* Minor bugfix

* Selectively install through pip (#139)

Selectively install through pip 
* update setup.py

* fix paiTrainingService bugs (#137)

* fix nnictl bug

* add hdfs host validation

* fix bugs

* fix dockerfile

* fix install.sh

* update install.sh

* fix dockerfile

* Set timeout for HDFSUtility exists function

* remove unused TODO

* fix sdk

* add optional for outputDir and dataDir

* refactor dockerfile.base

* Remove unused import in hdfsclientUtility

* Add documentation for NNI PAI mode experiment (#141)

* Add documentation for NNI PAI mode

* Fix typo based on PR comments

* Exit with subprocess return code of trial keeper

* Remove additional exit code

* Fix typo based on PR comments

* update doc for smac tuner (#140)

* Revert "Selectively install through pip (#139)" due to potential pip install issue (#142)

* Revert "Selectively install through pip (#139)"

This reverts commit 1d174836d3146a0363e9c9c88094bf9cff865faa.

* Add exit code of subprocess for trial_keeper

* Update README, add link to PAImode doc

* fix bug (#147)

* Refactor nnictl and add config_pai.yml (#144)

* fix nnictl bug

* add hdfs host validation

* fix bugs

* fix dockerfile

* fix install.sh

* update install.sh

* fix dockerfile

* Set timeout for HDFSUtility exists function

* remove unused TODO

* fix sdk

* add optional for outputDir and dataDir

* refactor dockerfile.base

* Remove unused import in hdfsclientUtility

* add config_pai.yml

* refactor nnictl create logic and add colorful print

* fix nnictl stop logic

* add annotation for config_pai.yml

* add document for start experiment

* fix config.yml

* fix document

* Fix trial keeper wrongly exit issue (#152)

* Fix trial keeper bug, use actual exitcode to exit rather than 1

* Fix bug of table sort (#145)

* Update doc for PAIMode and v0.2 release notes (#153)

* Update v0.2 documentation regards to release note and PAI training service

* Update document to describe NNI docker image

* fix antd (#159)

* refactor experiment stopping logic

* support change concurrency

* remove trialJobs.ts

* trivial changes

* fix bugs

* fix bug

* support updating maxTrialNum

* Modify IT scripts for supporting multiple experiments

* Update ci (#175)

* Update RemoteMachineMode.md (#63)

* Remove unused classes for SQuAD QA example.

* Remove more unused functions for SQuAD QA example.

* Fix default dataset config.

* Add Makefile README (#64)

* update document (#92)

* Edit readme.md

* updated a word

* Update GetStarted.md

* Update GetStarted.md

* refact readme, getstarted and write your trial md.

* Update README.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Fix nnictl bugs and add new feature (#75)

* fix nnictl bug

* fix nnictl create bug

* add experiment status logic

* add more information for nnictl

* fix Evolution Tuner bug

* refactor code

* fix code in updater.py

* fix nnictl --help

* fix classArgs bug

* update check response.status_code logic

* remove Buffer warning (#100)

* update readme in ga_squad

* update readme

* fix typo

* Update README.md

* Update README.md

* Update README.md

* Add support for debugging mode

* modify CI cuz of refracting exp stop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* file saving

* fix issues from code merge

* remove $(INSTALL_PREFIX)/nni/nni_manager before install

* fix indent

* fix merge issue

* socket close

* update port

* fix merge error

* modify ci logic in nnimanager

* fix ci

* fix bug

* change suspended to done

* update ci (#229)

* update ci

* update ci

* update ci (#232)

* update ci

* update ci

* update azure-pipelines

* update azure-pipelines

* update ci (#233)

* update ci

* update ci

* update azure-pipelines

* update azure-pipelines

* update azure-pipelines

* run.py (#238)

* Nnupdate ci (#239)

* run.py

* test ci

* Nnupdate ci (#240)

* run.py

* test ci

* test ci

* Udci (#241)

* run.py

* test ci

* test ci

* test ci

* update ci (#242)

* run.py

* test ci

* test ci

* test ci

* update ci

* revert install.sh (#244)

* run.py

* test ci

* test ci

* test ci

* update ci

* revert install.sh

* add comments

* remove assert

* trivial change

* trivial change

* update Makefile (#246)

* update Makefile

* update Makefile

* quick fix for ci (#248)

* add update trialNum and fix bugs (#261)

* Add builtin tuner to CI (#247)

* update Makefile

* update Makefile

* add builtin-tuner test

* add builtin-tuner test

* refractor ci

* update azure.yml

* add built-in tuner test

* fix bugs

* Doc refactor (#258)

* doc refactor

* image name refactor

* Refactor nnictl to support listing stopped experiments. (#256)

Refactor nnictl to support listing stopped experiments.

* Show experiment parameters more beautifully (#262)

* fix error on example of RemoteMachineMode (#269)

* add pycharm project files to .gitignore list

* update pylintrc to conform vscode settings

* fix RemoteMachineMode for wrong trainingServicePlatform

* Update docker file to use latest nni release (#263)

* fix bug about execDuration and endTime (#270)

* fix bug about execDuration and endTime

* modify time interval to 30 seconds

* refactor based on Gems's suggestion

* for triggering ci

* Refactor dockerfile (#264)

* refactor Dockerfile

* Support nnictl tensorboard (#268)

support tensorboard

* Sdk update (#272)

* Rename get_parameters to get_next_parameter

* annotations add get_next_parameter

* updates

* updates

* updates

* updates

* updates

* add experiment log path to experiment profile (#276)

* refactor extract reward from dict by tuner

* update Makefile for mac support, wait for aka.ms support

* refix Makefile for colorful echo

* unversion config.yml with machine information

* sync graph.py between tuners & trial of ga_squad

* sync graph.py between tuners & trial of ga_squad

* copy weight shared ga_squad under weight_sharing folder

* mv ga_squad code back to master

* simple tuner & trial ready

* Fix nnictl multiThread option

* weight sharing with async dispatcher simple example ready

* update for ga_squad

* fix bug

* modify multihead attention name

* add min_layer_num to Graph

* fix bug

* update share id calc

* fix bug

* add save logging

* fix ga_squad tuner bug

* sync bug fix for ga_squad tuner

* fix same hash_id bug

* add lock to simple tuner in weight sharing

* Add readme to simple weight sharing

* update

* update

* add paper link

* update

* reformat with autopep8

* add documentation for weight sharing

* test for weight sharing

* delete irrelevant files

* move details of weight sharing in to code comments

* Dev weight sharing update doc (#577)

* add pycharm project files to .gitignore list

* update pylintrc to conform vscode settings

* fix RemoteMachineMode for wrong trainingServicePlatform

* simple weight sharing

* update gitignore file

* change tuner codedir to relative path

* add python cache files to gitignore list

* move extract scalar reward logic from dispatcher to tuner

* update tuner code corresponding to last commit

* update doc for receive_trial_result api change

* add numpy to package whitelist of pylint

* distinguish param value from return reward for tuner.extract_scalar_reward

* update pylintrc

* add comments to dispatcher.handle_report_metric_data

* update install for mac support

* fix root mode bug on Makefile

* Quick fix bug: nnictl port value error (#245)

* fix port bug

* Dev exp stop more (#221)

* Exp stop refactor (#161)

* Update RemoteMachineMode.md (#63)

* Remove unused classes for SQuAD QA example.

* Remove more unused functions for SQuAD QA example.

* Fix default dataset config.

* Add Makefile README (#64)

* update document (#92)

* Edit readme.md

* updated a word

* Update GetStarted.md

* Update GetStarted.md

* refact readme, getstarted and write your trial md.

* Update README.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Fix nnictl bugs and add new feature (#75)

* fix nnictl bug

* fix nnictl create bug

* add experiment status logic

* add more information for nnictl

* fix Evolution Tuner bug

* refactor code

* fix code in updater.py

* fix nnictl --help

* fix classArgs bug

* update check response.status_code logic

* remove Buffer warning (#100)

* update readme in ga_squad

* update readme

* fix typo

* Update README.md

* Update README.md

* Update README.md

* Add support for debugging mode

* fix setup.py (#115)

* Add DAG model configuration format for SQuAD example.

* Explain config format for SQuAD QA model.

* Add more detailed introduction about the evolution algorithm.

* Fix install.sh add add trial log path (#109)

* fix nnictl bug

* fix nnictl create bug

* add experiment status logic

* add more information for nnictl

* fix Evolution Tuner bug

* refactor code

* fix code in updater.py

* fix nnictl --help

* fix classArgs bug

* update check response.status_code logic

* show trial log path

* update document

* fix install.sh

* set default vallue for maxTrialNum and maxExecDuration

* fix nnictl

* Dev smac (#116)

* support package install (#91)

* fix nnictl bug

* support package install

* update

* update package install logic

* Fix package install issue (#95)

* fix nnictl bug

* fix pakcage install

* support SMAC as a tuner on nni (#81)

* update doc

* update doc

* update doc

* update hyperopt installation

* update doc

* update doc

* update description in setup.py

* update setup.py

* modify encoding

* encoding

* add encoding

* remove pymc3

* update doc

* update builtin tuner spec

* support smac in sdk, fix logging issue

* support smac tuner

* add optimize_mode

* update config in nnictl

* add __init__.py

* update smac

* update import path

* update setup.py: remove entry_point

* update rest server validation

* fix bug in nnictl launcher

* support classArgs: optimize_mode

* quick fix bug

* test travis

* add dependency

* add dependency

* add dependency

* add dependency

* create smac python package

* fix trivial points

* optimize import of tuners, modify nnictl accordingly

* fix bug: incorrect algorithm_name

* trivial refactor

* for debug

* support virtual

* update doc of SMAC

* update smac requirements

* update requirements

* change debug mode

* update doc

* update doc

* refactor based on comments

* fix comments

* modify example config path to relative path and increase maxTrialNum (#94)

* modify example config path to relative path and increase maxTrialNum

* add document

* support conda (#90) (#110)

* support install from venv and travis CI

* support install from venv and travis CI

* support install from venv and travis CI

* support conda

* support conda

* modify example config path to relative path and increase maxTrialNum

* undo messy commit

* undo messy commit

* Support pip install as root (#77)

* Typo on #58 (#122)

* PAI Training Service implementation (#128)

* PAI Training service implementation
**1. Implement PAITrainingService
**2. Add trial-keeper python module, and modify setup.py to install the module
**3. Add PAItrainingService rest server to collect metrics from PAI container.

* fix datastore for multiple final result (#129)

* Update NNI v0.2 release notes (#132)

Update NNI v0.2 release notes

* Update setup.py Makefile and documents (#130)

* update makefile and setup.py

* update makefile and setup.py

* update document

* update document

* Update Makefile no travis

*  update doc

*  update doc

* fix convert from ss to pcs (#133)

* Fix bugs about webui (#131)

* Fix webui bugs

* Fix tslint

* webui logpath and document (#135)

* Add webui document and logpath as a href

* fix tslint

* fix comments by Chengmin

* Pai training service bug fix and enhancement (#136)

* Add NNI installation scripts

* Update pai script, update NNI_out_dir

* Update NNI dir in nni sdk local.py

* Create .nni folder in nni sdk local.py

* Add check before creating .nni folder

* Fix typo for PAI_INSTALL_NNI_SHELL_FORMAT

* Improve annotation (#138)

* Improve annotation

* Minor bugfix

* Selectively install through pip (#139)

Selectively install through pip 
* update setup.py

* fix paiTrainingService bugs (#137)

* fix nnictl bug

* add hdfs host validation

* fix bugs

* fix dockerfile

* fix install.sh

* update install.sh

* fix dockerfile

* Set timeout for HDFSUtility exists function

* remove unused TODO

* fix sdk

* add optional for outputDir and dataDir

* refactor dockerfile.base

* Remove unused import in hdfsclientUtility

* Add documentation for NNI PAI mode experiment (#141)

* Add documentation for NNI PAI mode

* Fix typo based on PR comments

* Exit with subprocess return code of trial keeper

* Remove additional exit code

* Fix typo based on PR comments

* update doc for smac tuner (#140)

* Revert "Selectively install through pip (#139)" due to potential pip install issue (#142)

* Revert "Selectively install through pip (#139)"

This reverts commit 1d174836d3146a0363e9c9c88094bf9cff865faa.

* Add exit code of subprocess for trial_keeper

* Update README, add link to PAImode doc

* Merge branch V0.2 to Master (#143)

* webui logpath and document (#135)

* Add webui document and logpath as a href

* fix tslint

* fix comments by Chengmin

* Pai training service bug fix and enhancement (#136)

* Add NNI installation scripts

* Update pai script, update NNI_out_dir

* Update NNI dir in nni sdk local.py

* Create .nni folder in nni sdk local.py

* Add check before creating .nni folder

* Fix typo for PAI_INSTALL_NNI_SHELL_FORMAT

* Improve annotation (#138)

* Improve annotation

* Minor bugfix

* Selectively install through pip (#139)

Selectively install through pip 
* update setup.py

* fix paiTrainingService bugs (#137)

* fix nnictl bug

* add hdfs host validation

* fix bugs

* fix dockerfile

* fix install.sh

* update install.sh

* fix dockerfile

* Set timeout for HDFSUtility exists function

* remove unused TODO

* fix sdk

* add optional for outputDir and dataDir

* refactor dockerfile.base

* Remove unused import in hdfsclientUtility

* Add documentation for NNI PAI mode experiment (#141)

* Add documentation for NNI PAI mode

* Fix typo based on PR comments

* Exit with subprocess return code of trial keeper

* Remove additional exit code

* Fix typo based on PR comments

* update doc for smac tuner (#140)

* Revert "Selectively install through pip (#139)" due to potential pip install issue (#142)

* Revert "Selectively install through pip (#139)"

This reverts commit 1d174836d3146a0363e9c9c88094bf9cff865faa.

* Add exit code of subprocess for trial_keeper

* Update README, add link to PAImode doc

* fix bug (#147)

* Refactor nnictl and add config_pai.yml (#144)

* fix nnictl bug

* add hdfs host validation

* fix bugs

* fix dockerfile

* fix install.sh

* update install.sh

* fix dockerfile

* Set timeout for HDFSUtility exists function

* remove unused TODO

* fix sdk

* add optional for outputDir and dataDir

* refactor dockerfile.base

* Remove unused import in hdfsclientUtility

* add config_pai.yml

* refactor nnictl create logic and add colorful print

* fix nnictl stop logic

* add annotation for config_pai.yml

* add document for start experiment

* fix config.yml

* fix document

* Fix trial keeper wrongly exit issue (#152)

* Fix trial keeper bug, use actual exitcode to exit rather than 1

* Fix bug of table sort (#145)

* Update doc for PAIMode and v0.2 release notes (#153)

* Update v0.2 documentation regards to release note and PAI training service

* Update document to describe NNI docker image

* fix antd (#159)

* refactor experiment stopping logic

* support change concurrency

* remove trialJobs.ts

* trivial changes

* fix bugs

* fix bug

* support updating maxTrialNum

* Modify IT scripts for supporting multiple experiments

* Update ci (#175)

* Update RemoteMachineMode.md (#63)

* Remove unused classes for SQuAD QA example.

* Remove more unused functions for SQuAD QA example.

* Fix default dataset config.

* Add Makefile README (#64)

* update document (#92)

* Edit readme.md

* updated a word

* Update GetStarted.md

* Update GetStarted.md

* refact readme, getstarted and write your trial md.

* Update README.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Fix nnictl bugs and add new feature (#75)

* fix nnictl bug

* fix nnictl create bug

* add experiment status logic

* add more information for nnictl

* fix Evolution Tuner bug

* refactor code

* fix code in updater.py

* fix nnictl --help

* fix classArgs bug

* update check response.status_code logic

* remove Buffer warning (#100)

* update readme in ga_squad

* update readme

* fix typo

* Update README.md

* Update README.md

* Update README.md

* Add support for debugging mode

* modify CI cuz of refracting exp stop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* file saving

* fix issues from code merge

* remove $(INSTALL_PREFIX)/nni/nni_manager before install

* fix indent

* fix merge issue

* socket close

* update port

* fix merge error

* modify ci logic in nnimanager

* fix ci

* fix bug

* change suspended to done

* update ci (#229)

* update ci

* update ci

* update ci (#232)

* update ci

* update ci

* update azure-pipelines

* update azure-pipelines

* update ci (#233)

* update ci

* update ci

* update azure-pipelines

* update azure-pipelines

* update azure-pipelines

* run.py (#238)

* Nnupdate ci (#239)

* run.py

* test ci

* Nnupdate ci (#240)

* run.py

* test ci

* test ci

* Udci (#241)

* run.py

* test ci

* test ci

* test ci

* update ci (#242)

* run.py

* test ci

* test ci

* test ci

* update ci

* revert install.sh (#244)

* run.py

* test ci

* test ci

* test ci

* update ci

* revert install.sh

* add comments

* remove assert

* trivial change

* trivial change

* update Makefile (#246)

* update Makefile

* update Makefile

* quick fix for ci (#248)

* add update trialNum and fix bugs (#261)

* Add builtin tuner to CI (#247)

* update Makefile

* update Makefile

* add builtin-tuner test

* add builtin-tuner test

* refractor ci

* update azure.yml

* add built-in tuner test

* fix bugs

* Doc refactor (#258)

* doc refactor

* image name refactor

* Refactor nnictl to support listing stopped experiments. (#256)

Refactor nnictl to support listing stopped experiments.

* Show experiment parameters more beautifully (#262)

* fix error on example of RemoteMachineMode (#269)

* add pycharm project files to .gitignore list

* update pylintrc to conform vscode settings

* fix RemoteMachineMode for wrong trainingServicePlatform

* Update docker file to use latest nni release (#263)

* fix bug about execDuration and endTime (#270)

* fix bug about execDuration and endTime

* modify time interval to 30 seconds

* refactor based on Gems's suggestion

* for triggering ci

* Refactor dockerfile (#264)

* refactor Dockerfile

* Support nnictl tensorboard (#268)

support tensorboard

* Sdk update (#272)

* Rename get_parameters to get_next_parameter

* annotations add get_next_parameter

* updates

* updates

* updates

* updates

* updates

* add experiment log path to experiment profile (#276)

* refactor extract reward from dict by tuner

* update Makefile for mac support, wait for aka.ms support

* refix Makefile for colorful echo

* unversion config.yml with machine information

* sync graph.py between tuners & trial of ga_squad

* sync graph.py between tuners & trial of ga_squad

* copy weight shared ga_squad under weight_sharing folder

* mv ga_squad code back to master

* simple tuner & trial ready

* Fix nnictl multiThread option

* weight sharing with async dispatcher simple example ready

* update for ga_squad

* fix bug

* modify multihead attention name

* add min_layer_num to Graph

* fix bug

* update share id calc

* fix bug

* add save logging

* fix ga_squad tuner bug

* sync bug fix for ga_squad tuner

* fix same hash_id bug

* add lock to simple tuner in weight sharing

* Add readme to simple weight sharing

* update

* update

* add paper link

* update

* reformat with autopep8

* add documentation for weight sharing

* test for weight sharing

* delete irrelevant files

* move details of weight sharing in to code comments

* add example section

* Dev weight sharing update (#579)

* add pycharm project files to .gitignore list

* update pylintrc to conform vscode settings

* fix RemoteMachineMode for wrong trainingServicePlatform

* simple weight sharing

* update gitignore file

* change tuner codedir to relative path

* add python cache files to gitignore list

* move extract scalar reward logic from dispatcher to tuner

* update tuner code corresponding to last commit

* update doc for receive_trial_result api change

* add numpy to package whitelist of pylint

* distinguish param value from return reward for tuner.extract_scalar_reward

* update pylintrc

* add comments to dispatcher.handle_report_metric_data

* update install for mac support

* fix root mode bug on Makefile

* Quick fix bug: nnictl port value error (#245)

* fix port bug

* Dev exp stop more (#221)

* Exp stop refactor (#161)

* Update RemoteMachineMode.md (#63)

* Remove unused classes for SQuAD QA example.

* Remove more unused functions for SQuAD QA example.

* Fix default dataset config.

* Add Makefile README (#64)

* update document (#92)

* Edit readme.md

* updated a word

* Update GetStarted.md

* Update GetStarted.md

* refact readme, getstarted and write your trial md.

* Update README.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Fix nnictl bugs and add new feature (#75)

* fix nnictl bug

* fix nnictl create bug

* add experiment status logic

* add more information for nnictl

* fix Evolution Tuner bug

* refactor code

* fix code in updater.py

* fix nnictl --help

* fix classArgs bug

* update check response.status_code logic

* remove Buffer warning (#100)

* update readme in ga_squad

* update readme

* fix typo

* Update README.md

* Update README.md

* Update README.md

* Add support for debugging mode

* fix setup.py (#115)

* Add DAG model configuration format for SQuAD example.

* Explain config format for SQuAD QA model.

* Add more detailed introduction about the evolution algorithm.

* Fix install.sh add add trial log path (#109)

* fix nnictl bug

* fix nnictl create bug

* add experiment status logic

* add more information for nnictl

* fix Evolution Tuner bug

* refactor code

* fix code in updater.py

* fix nnictl --help

* fix classArgs bug

* update check response.status_code logic

* show trial log path

* update document

* fix install.sh

* set default vallue for maxTrialNum and maxExecDuration

* fix nnictl

* Dev smac (#116)

* support package install (#91)

* fix nnictl bug

* support package install

* update

* update package install logic

* Fix package install issue (#95)

* fix nnictl bug

* fix pakcage install

* support SMAC as a tuner on nni (#81)

* update doc

* update doc

* update doc

* update hyperopt installation

* update doc

* update doc

* update description in setup.py

* update setup.py

* modify encoding

* encoding

* add encoding

* remove pymc3

* update doc

* update builtin tuner spec

* support smac in sdk, fix logging issue

* support smac tuner

* add optimize_mode

* update config in nnictl

* add __init__.py

* update smac

* update import path

* update setup.py: remove entry_point

* update rest server validation

* fix bug in nnictl launcher

* support classArgs: optimize_mode

* quick fix bug

* test travis

* add dependency

* add dependency

* add dependency

* add dependency

* create smac python package

* fix trivial points

* optimize import of tuners, modify nnictl accordingly

* fix bug: incorrect algorithm_name

* trivial refactor

* for debug

* support virtual

* update doc of SMAC

* update smac requirements

* update requirements

* change debug mode

* update doc

* update doc

* refactor based on comments

* fix comments

* modify example config path to relative path and increase maxTrialNum (#94)

* modify example config path to relative path and increase maxTrialNum

* add document

* support conda (#90) (#110)

* support install from venv and travis CI

* support install from venv and travis CI

* support install from venv and travis CI

* support conda

* support conda

* modify example config path to relative path and increase maxTrialNum

* undo messy commit

* undo messy commit

* Support pip install as root (#77)

* Typo on #58 (#122)

* PAI Training Service implementation (#128)

* PAI Training service implementation
**1. Implement PAITrainingService
**2. Add trial-keeper python module, and modify setup.py to install the module
**3. Add PAItrainingService rest server to collect metrics from PAI container.

* fix datastore for multiple final result (#129)

* Update NNI v0.2 release notes (#132)

Update NNI v0.2 release notes

* Update setup.py Makefile and documents (#130)

* update makefile and setup.py

* update makefile and setup.py

* update document

* update document

* Update Makefile no travis

*  update doc

*  update doc

* fix convert from ss to pcs (#133)

* Fix bugs about webui (#131)

* Fix webui bugs

* Fix tslint

* webui logpath and document (#135)

* Add webui document and logpath as a href

* fix tslint

* fix comments by Chengmin

* Pai training service bug fix and enhancement (#136)

* Add NNI installation scripts

* Update pai script, update NNI_out_dir

* Update NNI dir in nni sdk local.py

* Create .nni folder in nni sdk local.py

* Add check before creating .nni folder

* Fix typo for PAI_INSTALL_NNI_SHELL_FORMAT

* Improve annotation (#138)

* Improve annotation

* Minor bugfix

* Selectively install through pip (#139)

Selectively install through pip 
* update setup.py

* fix paiTrainingService bugs (#137)

* fix nnictl bug

* add hdfs host validation

* fix bugs

* fix dockerfile

* fix install.sh

* update install.sh

* fix dockerfile

* Set timeout for HDFSUtility exists function

* remove unused TODO

* fix sdk

* add optional for outputDir and dataDir

* refactor dockerfile.base

* Remove unused import in hdfsclientUtility

* Add documentation for NNI PAI mode experiment (#141)

* Add documentation for NNI PAI mode

* Fix typo based on PR comments

* Exit with subprocess return code of trial keeper

* Remove additional exit code

* Fix typo based on PR comments

* update doc for smac tuner (#140)

* Revert "Selectively install through pip (#139)" due to potential pip install issue (#142)

* Revert "Selectively install through pip (#139)"

This reverts commit 1d174836d3146a0363e9c9c88094bf9cff865faa.

* Add exit code of subprocess for trial_keeper

* Update README, add link to PAImode doc

* Merge branch V0.2 to Master (#143)

* webui logpath and document (#135)

* Add webui document and logpath as a href

* fix tslint

* fix comments by Chengmin

* Pai training service bug fix and enhancement (#136)

* Add NNI installation scripts

* Update pai script, update NNI_out_dir

* Update NNI dir in nni sdk local.py

* Create .nni folder in nni sdk local.py

* Add check before creating .nni folder

* Fix typo for PAI_INSTALL_NNI_SHELL_FORMAT

* Improve annotation (#138)

* Improve annotation

* Minor bugfix

* Selectively install through pip (#139)

Selectively install through pip 
* update setup.py

* fix paiTrainingService bugs (#137)

* fix nnictl bug

* add hdfs host validation

* fix bugs

* fix dockerfile

* fix install.sh

* update install.sh

* fix dockerfile

* Set timeout for HDFSUtility exists function

* remove unused TODO

* fix sdk

* add optional for outputDir and dataDir

* refactor dockerfile.base

* Remove unused import in hdfsclientUtility

* Add documentation for NNI PAI mode experiment (#141)

* Add documentation for NNI PAI mode

* Fix typo based on PR comments

* Exit with subprocess return code of trial keeper

* Remove additional exit code

* Fix typo based on PR comments

* update doc for smac tuner (#140)

* Revert "Selectively install through pip (#139)" due to potential pip install issue (#142)

* Revert "Selectively install through pip (#139)"

This reverts commit 1d174836d3146a0363e9c9c88094bf9cff865faa.

* Add exit code of subprocess for trial_keeper

* Update README, add link to PAImode doc

* fix bug (#147)

* Refactor nnictl and add config_pai.yml (#144)

* fix nnictl bug

* add hdfs host validation

* fix bugs

* fix dockerfile

* fix install.sh

* update install.sh

* fix dockerfile

* Set timeout for HDFSUtility exists function

* remove unused TODO

* fix sdk

* add optional for outputDir and dataDir

* refactor dockerfile.base

* Remove unused import in hdfsclientUtility

* add config_pai.yml

* refactor nnictl create logic and add colorful print

* fix nnictl stop logic

* add annotation for config_pai.yml

* add document for start experiment

* fix config.yml

* fix document

* Fix trial keeper wrongly exit issue (#152)

* Fix trial keeper bug, use actual exitcode to exit rather than 1

* Fix bug of table sort (#145)

* Update doc for PAIMode and v0.2 release notes (#153)

* Update v0.2 documentation regards to release note and PAI training service

* Update document to describe NNI docker image

* fix antd (#159)

* refactor experiment stopping logic

* support change concurrency

* remove trialJobs.ts

* trivial changes

* fix bugs

* fix bug

* support updating maxTrialNum

* Modify IT scripts for supporting multiple experiments

* Update ci (#175)

* Update RemoteMachineMode.md (#63)

* Remove unused classes for SQuAD QA example.

* Remove more unused functions for SQuAD QA example.

* Fix default dataset config.

* Add Makefile README (#64)

* update document (#92)

* Edit readme.md

* updated a word

* Update GetStarted.md

* Update GetStarted.md

* refact readme, getstarted and write your trial md.

* Update README.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Fix nnictl bugs and add new feature (#75)

* fix nnictl bug

* fix nnictl create bug

* add experiment status logic

* add more information for nnictl

* fix Evolution Tuner bug

* refactor code

* fix code in updater.py

* fix nnictl --help

* fix classArgs bug

* update check response.status_code logic

* remove Buffer warning (#100)

* update readme in ga_squad

* update readme

* fix typo

* Update README.md

* Update README.md

* Update README.md

* Add support for debugging mode

* modify CI cuz of refracting exp stop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* file saving

* fix issues from code merge

* remove $(INSTALL_PREFIX)/nni/nni_manager before install

* fix indent

* fix merge issue

* socket close

* update port

* fix merge error

* modify ci logic in nnimanager

* fix ci

* fix bug

* change suspended to done

* update ci (#229)

* update ci

* update ci

* update ci (#232)

* update ci

* update ci

* update azure-pipelines

* update azure-pipelines

* update ci (#233)

* update ci

* update ci

* update azure-pipelines

* update azure-pipelines

* update azure-pipelines

* run.py (#238)

* Nnupdate ci (#239)

* run.py

* test ci

* Nnupdate ci (#240)

* run.py

* test ci

* test ci

* Udci (#241)

* run.py

* test ci

* test ci

* test ci

* update ci (#242)

* run.py

* test ci

* test ci

* test ci

* update ci

* revert install.sh (#244)

* run.py

* test ci

* test ci

* test ci

* update ci

* revert install.sh

* add comments

* remove assert

* trivial change

* trivial change

* update Makefile (#246)

* update Makefile

* update Makefile

* quick fix for ci (#248)

* add update trialNum and fix bugs (#261)

* Add builtin tuner to CI (#247)

* update Makefile

* update Makefile

* add builtin-tuner test

* add builtin-tuner test

* refractor ci

* update azure.yml

* add built-in tuner test

* fix bugs

* Doc refactor (#258)

* doc refactor

* image name refactor

* Refactor nnictl to support listing stopped experiments. (#256)

Refactor nnictl to support listing stopped experiments.

* Show experiment parameters more beautifully (#262)

* fix error on example of RemoteMachineMode (#269)

* add pycharm project files to .gitignore list

* update pylintrc to conform vscode settings

* fix RemoteMachineMode for wrong trainingServicePlatform

* Update docker file to use latest nni release (#263)

* fix bug about execDuration and endTime (#270)

* fix bug about execDuration and endTime

* modify time interval to 30 seconds

* refactor based on Gems's suggestion

* for triggering ci

* Refactor dockerfile (#264)

* refactor Dockerfile

* Support nnictl tensorboard (#268)

support tensorboard

* Sdk update (#272)

* Rename get_parameters to get_next_parameter

* annotations add get_next_parameter

* updates

* updates

* updates

* updates

* updates

* add experiment log path to experiment profile (#276)

* refactor extract reward from dict by tuner

* update Makefile for mac support, wait for aka.ms support

* refix Makefile for colorful echo

* unversion config.yml with machine information

* sync graph.py between tuners & trial of ga_squad

* sync graph.py between tuners & trial of ga_squad

* copy weight shared ga_squad under weight_sharing folder

* mv ga_squad code back to master

* simple tuner & trial ready

* Fix nnictl multiThread option

* weight sharing with async dispatcher simple example ready

* update for ga_squad

* fix bug

* modify multihead attention name

* add min_layer_num to Graph

* fix bug

* update share id calc

* fix bug

* add save logging

* fix ga_squad tuner bug

* sync bug fix for ga_squad tuner

* fix same hash_id bug

* add lock to simple tuner in weight sharing

* Add readme to simple weight sharing

* update

* update

* add paper link

* update

* reformat with autopep8

* add documentation for weight sharing

* test for weight sharing

* delete irrelevant files

* move details of weight sharing in to code comments

* add example section

* update weight sharing tutorial

* Dev weight sharing (#581)

* add pycharm project files to .gitignore list

* update pylintrc to conform vscode settings

* fix RemoteMachineMode for wrong trainingServicePlatform

* simple weight sharing

* update gitignore file

* change tuner codedir to relative path

* add python cache files to gitignore list

* move extract scalar reward logic from dispatcher to tuner

* update tuner code corresponding to last commit

* update doc for receive_trial_result api change

* add numpy to package whitelist of pylint

* distinguish param value from return reward for tuner.extract_scalar_reward

* update pylintrc

* add comments to dispatcher.handle_report_metric_data

* update install for mac support

* fix root mode bug on Makefile

* Quick fix bug: nnictl port value error (#245)

* fix port bug

* Dev exp stop more (#221)

* Exp stop refactor (#161)

* Update RemoteMachineMode.md (#63)

* Remove unused classes for SQuAD QA example.

* Remove more unused functions for SQuAD QA example.

* Fix default dataset config.

* Add Makefile README (#64)

* update document (#92)

* Edit readme.md

* updated a word

* Update GetStarted.md

* Update GetStarted.md

* refact readme, getstarted and write your trial md.

* Update README.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Fix nnictl bugs and add new feature (#75)

* fix nnictl bug

* fix nnictl create bug

* add experiment status logic

* add more information for nnictl

* fix Evolution Tuner bug

* refactor code

* fix code in updater.py

* fix nnictl --help

* fix classArgs bug

* update check response.status_code logic

* remove Buffer warning (#100)

* update readme in ga_squad

* update readme

* fix typo

* Update README.md

* Update README.md

* Update README.md

* Add support for debugging mode

* fix setup.py (#115)

* Add DAG model configuration format for SQuAD example.

* Explain config format for SQuAD QA model.

* Add more detailed introduction about the evolution algorithm.

* Fix install.sh add add trial log path (#109)

* fix nnictl bug

* fix nnictl create bug

* add experiment status logic

* add more information for nnictl

* fix Evolution Tuner bug

* refactor code

* fix code in updater.py

* fix nnictl --help

* fix classArgs bug

* update check response.status_code logic

* show trial log path

* update document

* fix install.sh

* set default vallue for maxTrialNum and maxExecDuration

* fix nnictl

* Dev smac (#116)

* support package install (#91)

* fix nnictl bug

* support package install

* update

* update package install logic

* Fix package install issue (#95)

* fix nnictl bug

* fix pakcage install

* support SMAC as a tuner on nni (#81)

* update doc

* update doc

* update doc

* update hyperopt installation

* update doc

* update doc

* update description in setup.py

* update setup.py

* modify encoding

* encoding

* add encoding

* remove pymc3

* update doc

* update builtin tuner spec

* support smac in sdk, fix logging issue

* support smac tuner

* add optimize_mode

* update config in nnictl

* add __init__.py

* update smac

* update import path

* update setup.py: remove entry_point

* update rest server validation

* fix bug in nnictl launcher

* support classArgs: optimize_mode

* quick fix bug

* test travis

* add dependency

* add dependency

* add dependency

* add dependency

* create smac python package

* fix trivial points

* optimize import of tuners, modify nnictl accordingly

* fix bug: incorrect algorithm_name

* trivial refactor

* for debug

* support virtual

* update doc of SMAC

* update smac requirements

* update requirements

* change debug mode

* update doc

* update doc

* refactor based on comments

* fix comments

* modify example config path to relative path and increase maxTrialNum (#94)

* modify example config path to relative path and increase maxTrialNum

* add document

* support conda (#90) (#110)

* support install from venv and travis CI

* support install from venv and travis CI

* support install from venv and travis CI

* support conda

* support conda

* modify example config path to relative path and increase maxTrialNum

* undo messy commit

* undo messy commit

* Support pip install as root (#77)

* Typo on #58 (#122)

* PAI Training Service implementation (#128)

* PAI Training service implementation
**1. Implement PAITrainingService
**2. Add trial-keeper python module, and modify setup.py to install the module
**3. Add PAItrainingService rest server to collect metrics from PAI container.

* fix datastore for multiple final result (#129)

* Update NNI v0.2 release notes (#132)

Update NNI v0.2 release notes

* Update setup.py Makefile and documents (#130)

* update makefile and setup.py

* update makefile and setup.py

* update document

* update document

* Update Makefile no travis

*  update doc

*  update doc

* fix convert from ss to pcs (#133)

* Fix bugs about webui (#131)

* Fix webui bugs

* Fix tslint

* webui logpath and document (#135)

* Add webui document and logpath as a href

* fix tslint

* fix comments by Chengmin

* Pai training service bug fix and enhancement (#136)

* Add NNI installation scripts

* Update pai script, update NNI_out_dir

* Update NNI dir in nni sdk local.py

* Create .nni folder in nni sdk local.py

* Add check before creating .nni folder

* Fix typo for PAI_INSTALL_NNI_SHELL_FORMAT

* Improve annotation (#138)

* Improve annotation

* Minor bugfix

* Selectively install through pip (#139)

Selectively install through pip 
* update setup.py

* fix paiTrainingService bugs (#137)

* fix nnictl bug

* add hdfs host validation

* fix bugs

* fix dockerfile

* fix install.sh

* update install.sh

* fix dockerfile

* Set timeout for HDFSUtility exists function

* remove unused TODO

* fix sdk

* add optional for outputDir and dataDir

* refactor dockerfile.base

* Remove unused import in hdfsclientUtility

* Add documentation for NNI PAI mode experiment (#141)

* Add documentation for NNI PAI mode

* Fix typo based on PR comments

* Exit with subprocess return code of trial keeper

* Remove additional exit code

* Fix typo based on PR comments

* update doc for smac tuner (#140)

* Revert "Selectively install through pip (#139)" due to potential pip install issue (#142)

* Revert "Selectively install through pip (#139)"

This reverts commit 1d174836d3146a0363e9c9c88094bf9cff865faa.

* Add exit code of subprocess for trial_keeper

* Update README, add link to PAImode doc

* Merge branch V0.2 to Master (#143)

* webui logpath and document (#135)

* Add webui document and logpath as a href

* fix tslint

* fix comments by Chengmin

* Pai training service bug fix and enhancement (#136)

* Add NNI installation scripts

* Update pai script, update NNI_out_dir

* Update NNI dir in nni sdk local.py

* Create .nni folder in nni sdk local.py

* Add check before creating .nni folder

* Fix typo for PAI_INSTALL_NNI_SHELL_FORMAT

* Improve annotation (#138)

* Improve annotation

* Minor bugfix

* Selectively install through pip (#139)

Selectively install through pip 
* update setup.py

* fix paiTrainingService bugs (#137)

* fix nnictl bug

* add hdfs host validation

* fix bugs

* fix dockerfile

* fix install.sh

* update install.sh

* fix dockerfile

* Set timeout for HDFSUtility exists function

* remove unused TODO

* fix sdk

* add optional for outputDir and dataDir

* refactor dockerfile.base

* Remove unused import in hdfsclientUtility

* Add documentation for NNI PAI mode experiment (#141)

* Add documentation for NNI PAI mode

* Fix typo based on PR comments

* Exit with subprocess return code of trial keeper

* Remove additional exit code

* Fix typo based on PR comments

* update doc for smac tuner (#140)

* Revert "Selectively install through pip (#139)" due to potential pip install issue (#142)

* Revert "Selectively install through pip (#139)"

This reverts commit 1d174836d3146a0363e9c9c88094bf9cff865faa.

* Add exit code of subprocess for trial_keeper

* Update README, add link to PAImode doc

* fix bug (#147)

* Refactor nnictl and add config_pai.yml (#144)

* fix nnictl bug

* add hdfs host validation

* fix bugs

* fix dockerfile

* fix install.sh

* update install.sh

* fix dockerfile

* Set timeout for HDFSUtility exists function

* remove unused TODO

* fix sdk

* add optional for outputDir and dataDir

* refactor dockerfile.base

* Remove unused import in hdfsclientUtility

* add config_pai.yml

* refactor nnictl create logic and add colorful print

* fix nnictl stop logic

* add annotation for config_pai.yml

* add document for start experiment

* fix config.yml

* fix document

* Fix trial keeper wrongly exit issue (#152)

* Fix trial keeper bug, use actual exitcode to exit rather than 1

* Fix bug of table sort (#145)

* Update doc for PAIMode and v0.2 release notes (#153)

* Update v0.2 documentation regards to release note and PAI training service

* Update document to describe NNI docker image

* fix antd (#159)

* refactor experiment stopping logic

* support change concurrency

* remove trialJobs.ts

* trivial changes

* fix bugs

* fix bug

* support updating maxTrialNum

* Modify IT scripts for supporting multiple experiments

* Update ci (#175)

* Update RemoteMachineMode.md (#63)

* Remove unused classes for SQuAD QA example.

* Remove more unused functions for SQuAD QA example.

* Fix default dataset config.

* Add Makefile README (#64)

* update document (#92)

* Edit readme.md

* updated a word

* Update GetStarted.md

* Update GetStarted.md

* refact readme, getstarted and write your trial md.

* Update README.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Fix nnictl bugs and add new feature (#75)

* fix nnictl bug

* fix nnictl create bug

* add experiment status logic

* add more information for nnictl

* fix Evolution Tuner bug

* refactor code

* fix code in updater.py

* fix nnictl --help

* fix classArgs bug

* update check response.status_code logic

* remove Buffer warning (#100)

* update readme in ga_squad

* update readme

* fix typo

* Update README.md

* Update README.md

* Update README.md

* Add support for debugging mode

* modify CI cuz of refracting exp stop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* update CI for expstop

* file saving

* fix issues from code merge

* remove $(INSTALL_PREFIX)/nni/nni_manager before install

* fix indent

* fix merge issue

* socket close

* update port

* fix merge error

* modify ci logic in nnimanager

* fix ci

* fix bug

* change suspended to done

* update ci (#229)

* update ci

* update ci

* update ci (#232)

* update ci

* update ci

* update azure-pipelines

* update azure-pipelines

* update ci (#233)

* update ci

* update ci

* update azure-pipelines

* update azure-pipelines

* update azure-pipelines

* run.py (#238)

* Nnupdate ci (#239)

* run.py

* test ci

* Nnupdate ci (#240)

* run.py

* test ci

* test ci

* Udci (#241)

* run.py

* test ci

* test ci

* test ci

* update ci (#242)

* run.py

* test ci

* test ci

* test ci

* update ci

* revert install.sh (#244)

* run.py

* test ci

* test ci

* test ci

* update ci

* revert install.sh

* add comments

* remove assert

* trivial change

* trivial change

* update Makefile (#246)

* update Makefile

* update Makefile

* quick fix for ci (#248)

* add update trialNum and fix bugs (#261)

* Add builtin tuner to CI (#247)

* update Makefile

* update Makefile

* add builtin-tuner test

* add builtin-tuner test

* refractor ci

* update azure.yml

* add built-in tuner test

* fix bugs

* Doc refactor (#258)

* doc refactor

* image name refactor

* Refactor nnictl to support listing stopped experiments. (#256)

Refactor nnictl to support listing stopped experiments.

* Show experiment parameters more beautifully (#262)

* fix error on example of RemoteMachineMode (#269)

* add pycharm project files to .gitignore list

* update pylintrc to conform vscode settings

* fix RemoteMachineMode for wrong trainingServicePlatform

* Update docker file to use latest nni release (#263)

* fix bug about execDuration and endTime (#270)

* fix bug about execDuration and endTime

* modify time interval to 30 seconds

* refactor based on Gems's suggestion

* for triggering ci

* Refactor dockerfile (#264)

* refactor Dockerfile

* Support nnictl tensorboard (#268)

support tensorboard

* Sdk update (#272)

* Rename get_parameters to get_next_parameter

* annotations add get_next_parameter

* updates

* updates

* updates

* updates

* updates

* add experiment log path to experiment profile (#276)

* refactor extract reward from dict by tuner

* update Makefile for mac support, wait for aka.ms support

* refix Makefile for colorful echo

* unversion config.yml with machine information

* sync graph.py between tuners & trial of ga_squad

* sync graph.py between tuners & trial of ga_squad

* copy weight shared ga_squad under weight_sharing folder

* mv ga_squad code back to master

* simple tuner & trial ready

* Fix nnictl multiThread option

* weight sharing with async dispatcher simple example ready

* update for ga_squad

* fix bug

* modify multihead attention name

* add min_layer_num to Graph

* fix bug

* update share id calc

* fix bug

* add save logging

* fix ga_squad tuner bug

* sync bug fix for ga_squad tuner

* fix same hash_id bug

* add lock to simple tuner in weight sharing

* Add readme to simple weight sharing

* update

* update

* add paper link

* update

* reformat with autopep8

* add documentation for weight sharing

* test for weight sharing

* delete irrelevant files

* move details of weight sharing in to code comments

* add example section

* update weight sharing tutorial

* fix divide by zero risk

* update tuner thread exception handling

* fix bug for async test
  • Loading branch information
leckie-chn authored Jan 8, 2019
1 parent e6eb6ea commit 358efb2
Show file tree
Hide file tree
Showing 26 changed files with 3,086 additions and 9 deletions.
87 changes: 87 additions & 0 deletions docs/AdvancedNAS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Tutorial for Advanced Neural Architecture Search
Currently many of the NAS algorithms leverage the technique of **weight sharing** among trials to accelerate its training process. For example, [ENAS][1] delivers 1000x effiency with '_parameter sharing between child models_', compared with the previous [NASNet][2] algorithm. Other NAS algorithms such as [DARTS][3], [Network Morphism][4], and [Evolution][5] is also leveraging, or has the potential to leverage weight sharing.

This is a tutorial on how to enable weight sharing in NNI.

## Weight Sharing among trials
Currently we recommend sharing weights through NFS (Network File System), which supports sharing files across machines, and is light-weighted, (relatively) efficient. We also welcome contributions from the community on more efficient techniques.

### Weight Sharing through NFS file
With the NFS setup (see below), trial code can share model weight through loading & saving files. Here we recommend that user feed the tuner with the storage path:
```yaml
tuner:
codeDir: path/to/customer_tuner
classFileName: customer_tuner.py
className: CustomerTuner
classArgs:
...
save_dir_root: /nfs/storage/path/
```
And let tuner decide where to save & load weights and feed the paths to trials through `nni.get_next_parameters()`:

![weight_sharing_design](./img/weight_sharing.png)

For example, in tensorflow:
```python
# save models
saver = tf.train.Saver()
saver.save(sess, os.path.join(params['save_path'], 'model.ckpt'))
# load models
tf.init_from_checkpoint(params['restore_path'])
```
where `'save_path'` and `'restore_path'` in hyper-parameter can be managed by the tuner.

### NFS Setup
In NFS, files are physically stored on a server machine, and trials on the client machine can read/write those files in the same way that they access local files.

#### Install NFS on server machine
First, install NFS server:
```bash
sudo apt-get install nfs-kernel-server
```
Suppose `/tmp/nni/shared` is used as the physical storage, then run:
```bash
sudo mkdir -p /tmp/nni/shared
sudo echo "/tmp/nni/shared *(rw,sync,no_subtree_check,no_root_squash)" >> /etc/exports
sudo service nfs-kernel-server restart
```
You can check if the above directory is successfully exported by NFS using `sudo showmount -e localhost`

#### Install NFS on client machine
First, install NFS client:
```bash
sudo apt-get install nfs-common
```
Then create & mount the mounted directory of shared files:
```bash
sudo mkdir -p /mnt/nfs/nni/
sudo mount -t nfs 10.10.10.10:/tmp/nni/shared /mnt/nfs/nni
```
where `10.10.10.10` should be replaced by the real IP of NFS server machine in practice.

## Asynchornous Dispatcher Mode for trial dependency control
The feature of weight sharing enables trials from different machines, in which most of the time **read after write** consistency must be assured. After all, the child model should not load parent model before parent trial finishes training. To deal with this, users can enable **asynchronous dispatcher mode** with `multiThread: true` in `config.yml` in NNI, where the dispatcher assign a tuner thread each time a `NEW_TRIAL` request comes in, and the tuner thread can decide when to submit a new trial by blocking and unblocking the thread itself. For example:
```python
def generate_parameters(self, parameter_id):
self.thread_lock.acquire()
indiv = # configuration for a new trial
self.events[parameter_id] = threading.Event()
self.thread_lock.release()
if indiv.parent_id is not None:
self.events[indiv.parent_id].wait()
def receive_trial_result(self, parameter_id, parameters, reward):
self.thread_lock.acquire()
# code for processing trial results
self.thread_lock.release()
self.events[parameter_id].set()
```

## Examples
For details, please refer to this [simple weight sharing example](../test/async_sharing_test). We also provided a [practice example](../examples/trials/weight_sharing/ga_squad) for reading comprehension, based on previous [ga_squad](../examples/trials/ga_squad) example.

[1]: https://arxiv.org/abs/1802.03268
[2]: https://arxiv.org/abs/1707.07012
[3]: https://arxiv.org/abs/1806.09055
[4]: https://arxiv.org/abs/1806.10282
[5]: https://arxiv.org/abs/1703.01041
Binary file added docs/img/weight_sharing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 3 additions & 3 deletions examples/trials/ga_squad/trial.py
Original file line number Diff line number Diff line change
Expand Up @@ -338,7 +338,7 @@ def train_with_graph(graph, qp_pairs, dev_qp_pairs):
answers = generate_predict_json(
position1, position2, ids, contexts)
if save_path is not None:
with open(save_path + 'epoch%d.prediction' % epoch, 'w') as file:
with open(os.path.join(save_path, 'epoch%d.prediction' % epoch), 'w') as file:
json.dump(answers, file)
else:
answers = json.dumps(answers)
Expand All @@ -359,8 +359,8 @@ def train_with_graph(graph, qp_pairs, dev_qp_pairs):
bestacc = acc

if save_path is not None:
saver.save(sess, save_path + 'epoch%d.model' % epoch)
with open(save_path + 'epoch%d.score' % epoch, 'wb') as file:
saver.save(os.path.join(sess, save_path + 'epoch%d.model' % epoch))
with open(os.path.join(save_path, 'epoch%d.score' % epoch), 'wb') as file:
pickle.dump(
(position1, position2, ids, contexts), file)
logger.debug('epoch %d acc %g bestacc %g' %
Expand Down
171 changes: 171 additions & 0 deletions examples/trials/weight_sharing/ga_squad/attention.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
# Copyright (c) Microsoft Corporation
# All rights reserved.
#
# MIT License
#
# Permission is hereby granted, free of charge,
# to any person obtaining a copy of this software and associated
# documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included
# in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

import math

import tensorflow as tf
from tensorflow.python.ops.rnn_cell_impl import RNNCell


def _get_variable(variable_dict, name, shape, initializer=None, dtype=tf.float32):
if name not in variable_dict:
variable_dict[name] = tf.get_variable(
name=name, shape=shape, initializer=initializer, dtype=dtype)
return variable_dict[name]


class DotAttention:
'''
DotAttention
'''

def __init__(self, name,
hidden_dim,
is_vanilla=True,
is_identity_transform=False,
need_padding=False):
self._name = '/'.join([name, 'dot_att'])
self._hidden_dim = hidden_dim
self._is_identity_transform = is_identity_transform
self._need_padding = need_padding
self._is_vanilla = is_vanilla
self._var = {}

@property
def is_identity_transform(self):
return self._is_identity_transform

@property
def is_vanilla(self):
return self._is_vanilla

@property
def need_padding(self):
return self._need_padding

@property
def hidden_dim(self):
return self._hidden_dim

@property
def name(self):
return self._name

@property
def var(self):
return self._var

def _get_var(self, name, shape, initializer=None):
with tf.variable_scope(self.name):
return _get_variable(self.var, name, shape, initializer)

def _define_params(self, src_dim, tgt_dim):
hidden_dim = self.hidden_dim
self._get_var('W', [src_dim, hidden_dim])
if not self.is_vanilla:
self._get_var('V', [src_dim, hidden_dim])
if self.need_padding:
self._get_var('V_s', [src_dim, src_dim])
self._get_var('V_t', [tgt_dim, tgt_dim])
if not self.is_identity_transform:
self._get_var('T', [tgt_dim, src_dim])
self._get_var('U', [tgt_dim, hidden_dim])
self._get_var('b', [1, hidden_dim])
self._get_var('v', [hidden_dim, 1])

def get_pre_compute(self, s):
'''
:param s: [src_sequence, batch_size, src_dim]
:return: [src_sequence, batch_size. hidden_dim]
'''
hidden_dim = self.hidden_dim
src_dim = s.get_shape().as_list()[-1]
assert src_dim is not None, 'src dim must be defined'
W = self._get_var('W', shape=[src_dim, hidden_dim])
b = self._get_var('b', shape=[1, hidden_dim])
return tf.tensordot(s, W, [[2], [0]]) + b

def get_prob(self, src, tgt, mask, pre_compute, return_logits=False):
'''
:param s: [src_sequence_length, batch_size, src_dim]
:param h: [batch_size, tgt_dim] or [tgt_sequence_length, batch_size, tgt_dim]
:param mask: [src_sequence_length, batch_size]\
or [tgt_sequence_length, src_sequence_length, batch_sizse]
:param pre_compute: [src_sequence_length, batch_size, hidden_dim]
:return: [src_sequence_length, batch_size]\
or [tgt_sequence_length, src_sequence_length, batch_size]
'''
s_shape = src.get_shape().as_list()
h_shape = tgt.get_shape().as_list()
src_dim = s_shape[-1]
tgt_dim = h_shape[-1]
assert src_dim is not None, 'src dimension must be defined'
assert tgt_dim is not None, 'tgt dimension must be defined'

self._define_params(src_dim, tgt_dim)

if len(h_shape) == 2:
tgt = tf.expand_dims(tgt, 0)
if pre_compute is None:
pre_compute = self.get_pre_compute(src)

buf0 = pre_compute
buf1 = tf.tensordot(tgt, self.var['U'], axes=[[2], [0]])
buf2 = tf.tanh(tf.expand_dims(buf0, 0) + tf.expand_dims(buf1, 1))

if not self.is_vanilla:
xh1 = tgt
xh2 = tgt
s1 = src
if self.need_padding:
xh1 = tf.tensordot(xh1, self.var['V_t'], 1)
xh2 = tf.tensordot(xh2, self.var['S_t'], 1)
s1 = tf.tensordot(s1, self.var['V_s'], 1)
if not self.is_identity_transform:
xh1 = tf.tensordot(xh1, self.var['T'], 1)
xh2 = tf.tensordot(xh2, self.var['T'], 1)
buf3 = tf.expand_dims(s1, 0) * tf.expand_dims(xh1, 1)
buf3 = tf.tanh(tf.tensordot(buf3, self.var['V'], axes=[[3], [0]]))
buf = tf.reshape(tf.tanh(buf2 + buf3), shape=tf.shape(buf3))
else:
buf = buf2
v = self.var['v']
e = tf.tensordot(buf, v, [[3], [0]])
e = tf.squeeze(e, axis=[3])
tmp = tf.reshape(e + (mask - 1) * 10000.0, shape=tf.shape(e))
prob = tf.nn.softmax(tmp, 1)
if len(h_shape) == 2:
prob = tf.squeeze(prob, axis=[0])
tmp = tf.squeeze(tmp, axis=[0])
if return_logits:
return prob, tmp
return prob

def get_att(self, s, prob):
'''
:param s: [src_sequence_length, batch_size, src_dim]
:param prob: [src_sequence_length, batch_size]\
or [tgt_sequence_length, src_sequence_length, batch_size]
:return: [batch_size, src_dim] or [tgt_sequence_length, batch_size, src_dim]
'''
buf = s * tf.expand_dims(prob, axis=-1)
att = tf.reduce_sum(buf, axis=-3)
return att
31 changes: 31 additions & 0 deletions examples/trials/weight_sharing/ga_squad/config_remote.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
authorName: default
experimentName: ga_squad_weight_sharing
trialConcurrency: 2
maxExecDuration: 1h
maxTrialNum: 200
#choice: local, remote, pai
trainingServicePlatform: remote
#choice: true, false
useAnnotation: false
multiThread: true
tuner:
codeDir: ../../../tuners/weight_sharing/ga_customer_tuner
classFileName: customer_tuner.py
className: CustomerTuner
classArgs:
optimize_mode: maximize
population_size: 32
save_dir_root: /mnt/nfs/nni/ga_squad
trial:
command: python3 trial.py --input_file /mnt/nfs/nni/train-v1.1.json --dev_file /mnt/nfs/nni/dev-v1.1.json --max_epoch 1 --embedding_file /mnt/nfs/nni/glove.6B.300d.txt
codeDir: .
gpuNum: 1
machineList:
- ip: remote-ip-0
port: 8022
username: root
passwd: screencast
- ip: remote-ip-1
port: 8022
username: root
passwd: screencast
Loading

0 comments on commit 358efb2

Please sign in to comment.