-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Heterps]Refactor heterogenous worker #37244
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Thanks for your contribution! |
Thunderbrook
previously approved these changes
Nov 16, 2021
zmxdream
changed the title
Refactor heter trainer
[heterps]Refactor heterogenrous worker
Nov 16, 2021
zmxdream
changed the title
[heterps]Refactor heterogenrous worker
[heterps]Refactor heterogenous worker
Nov 16, 2021
Thunderbrook
approved these changes
Nov 17, 2021
zmxdream
added a commit
to zmxdream/Paddle
that referenced
this pull request
Nov 22, 2021
* fix. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * refactor heter trainer. test=develop * fix. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop
fuyinno4
pushed a commit
that referenced
this pull request
Nov 23, 2021
* bug fix for DeserializeSelectedRows. test=develop (#36520) * fix SerializeSelectedRows (#36543) * bug fix for DeserializeSelectedRows. test=develop * fix bug for SerializeSelectedRows. test=develop * update. test=develop * [Heterps]Refactor Heter Pipeline Parameter Server (#36845) * change username * fix * fix * fix * fix * fix * update * update * update unittests * fix * update * fix * update * fix * fix * fix * update * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update send_and_recv op. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix ut. test=develop * fix unit. notest,test=coverage * fix ut. notest, test=coverage * update. notest,test=coverage * fix ut. notest, test=coverage * fix ut. notest, test=coverage * fix. notest, test=coverage * fix. notest, test=coverage * fix ut. notest, test=coverage * fix ut. notest, test=coverage * fix ut. notest, test=coverage * fix ut. notest, test=coverage * add func. notest, test=coverage * fix ut. notest, test=coverage * fix. test=develop * fix. test=develop * Fix unit test for send_and_recv_cpu & send_and_recv_gpu (#37129) * [heterps]fix ut for heter_pipeline_trainer.cc (#37136) * fix ut. test=develop * fix ut. test=develop * [heterps]bug fix for local training with --heter_worker_num (#37166) * fix. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * [heterps]Refactor heterogenous worker (#37244) * fix. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * refactor heter trainer. test=develop * fix. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * [heterps]add heterps mode judgement (#37298) * [heterps]change default executor for heter trainer (#37314) * fix pslib. test=develop * add device to train_from_dataset. test=develop * refine fleet.stop_worker. test=develop * fix ut. test=develop * fix ut. test=develop * fix executor & ut. test=develop * fix executor & ut. test=develop * fix executor & ut. test=develop * [heterps]remove api for heter pipeline ps (#37396) * fix api. test=develop * fix api. test=develop * fix code style. test=release/2.2 * fix CMakeLists. test=develop (#37454)
zmxdream
changed the title
[heterps]Refactor heterogenous worker
[Heterps]Refactor heterogenous worker
Mar 10, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR types
Others
PR changes
Others
Describe
背景:异构参数服务器的流水线训练,trainer的类型有cpu trainer和heter trainer, heter trainer主要就是使用各种加速器训练模型子图。
修改原因:之前的设计不好,heter worker需要使用dataset作为输入,原因主要是heter worker需要知道cpu trainer开了多少个线程,这个线程数是从 dataset->GetReaders() 获取的,而dataset会根据实际训练的文件数动态调整线程数,因此还需要每个cpu trainer都读取同样的文件数。现在这个PR修改了这个设计,使得每个cpu trainer可以读取不同数量的文件,从而开数量不同的线程,而heter trainer会根据cpu trainer发送过来的消息动态调整线程.