Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Heterps]Refactor heterogenous worker #37244

Merged
merged 25 commits into from
Nov 17, 2021

Conversation

zmxdream
Copy link
Contributor

@zmxdream zmxdream commented Nov 16, 2021

PR types

Others

PR changes

Others

Describe

背景:异构参数服务器的流水线训练,trainer的类型有cpu trainer和heter trainer, heter trainer主要就是使用各种加速器训练模型子图。
修改原因:之前的设计不好,heter worker需要使用dataset作为输入,原因主要是heter worker需要知道cpu trainer开了多少个线程,这个线程数是从 dataset->GetReaders() 获取的,而dataset会根据实际训练的文件数动态调整线程数,因此还需要每个cpu trainer都读取同样的文件数。现在这个PR修改了这个设计,使得每个cpu trainer可以读取不同数量的文件,从而开数量不同的线程,而heter trainer会根据cpu trainer发送过来的消息动态调整线程.

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Thunderbrook
Thunderbrook previously approved these changes Nov 16, 2021
@zmxdream zmxdream changed the title Refactor heter trainer [heterps]Refactor heterogenrous worker Nov 16, 2021
@zmxdream zmxdream changed the title [heterps]Refactor heterogenrous worker [heterps]Refactor heterogenous worker Nov 16, 2021
@Thunderbrook Thunderbrook merged commit 54d2626 into PaddlePaddle:develop Nov 17, 2021
zmxdream added a commit to zmxdream/Paddle that referenced this pull request Nov 22, 2021
* fix. test=develop

* fix. test=develop

* fix. test=develop

* fix. test=develop

* fix. test=develop

* fix ut. test=develop

* fix ut. test=develop

* fix ut. test=develop

* refactor heter trainer. test=develop

* fix. test=develop

* fix ut. test=develop

* fix ut. test=develop

* fix ut. test=develop

* fix ut. test=develop

* fix ut. test=develop

* fix ut. test=develop

* fix ut. test=develop

* fix ut. test=develop

* fix. test=develop

* fix. test=develop

* fix. test=develop

* fix. test=develop

* fix ut. test=develop

* fix ut. test=develop

* fix ut. test=develop
fuyinno4 pushed a commit that referenced this pull request Nov 23, 2021
* bug fix for  DeserializeSelectedRows. test=develop (#36520)

* fix SerializeSelectedRows (#36543)

* bug fix for  DeserializeSelectedRows. test=develop

* fix bug for SerializeSelectedRows. test=develop

* update. test=develop

* [Heterps]Refactor Heter Pipeline Parameter Server (#36845)

* change username

* fix

* fix

* fix

* fix

* fix

* update

* update

* update unittests

* fix

* update

* fix

* update

* fix

* fix

* fix

* update

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update send_and_recv op. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* update. test=develop

* fix. test=develop

* fix. test=develop

* fix. test=develop

* fix. test=develop

* fix ut. test=develop

* fix unit. notest,test=coverage

* fix ut. notest, test=coverage

* update. notest,test=coverage

* fix ut. notest, test=coverage

* fix ut. notest, test=coverage

* fix. notest, test=coverage

* fix. notest, test=coverage

* fix ut. notest, test=coverage

* fix ut. notest, test=coverage

* fix ut. notest, test=coverage

* fix ut. notest, test=coverage

* add func. notest, test=coverage

* fix ut. notest, test=coverage

* fix. test=develop

* fix. test=develop

* Fix unit test for send_and_recv_cpu & send_and_recv_gpu (#37129)

* [heterps]fix ut for heter_pipeline_trainer.cc  (#37136)

* fix ut. test=develop

* fix ut. test=develop

* [heterps]bug fix for local training with --heter_worker_num (#37166)

* fix. test=develop

* fix. test=develop

* fix. test=develop

* fix. test=develop

* fix. test=develop

* fix ut. test=develop

* fix ut. test=develop

* fix ut. test=develop

* [heterps]Refactor heterogenous worker (#37244)

* fix. test=develop

* fix. test=develop

* fix. test=develop

* fix. test=develop

* fix. test=develop

* fix ut. test=develop

* fix ut. test=develop

* fix ut. test=develop

* refactor heter trainer. test=develop

* fix. test=develop

* fix ut. test=develop

* fix ut. test=develop

* fix ut. test=develop

* fix ut. test=develop

* fix ut. test=develop

* fix ut. test=develop

* fix ut. test=develop

* fix ut. test=develop

* fix. test=develop

* fix. test=develop

* fix. test=develop

* fix. test=develop

* fix ut. test=develop

* fix ut. test=develop

* fix ut. test=develop

* [heterps]add heterps mode judgement (#37298)

* [heterps]change default executor for heter trainer (#37314)

* fix pslib. test=develop

* add device to train_from_dataset. test=develop

* refine fleet.stop_worker. test=develop

* fix ut. test=develop

* fix ut. test=develop

* fix executor & ut. test=develop

* fix executor & ut. test=develop

* fix executor & ut. test=develop

* [heterps]remove api for heter pipeline ps (#37396)

* fix api. test=develop

* fix api. test=develop

* fix code style. test=release/2.2

* fix CMakeLists. test=develop (#37454)
@zmxdream zmxdream changed the title [heterps]Refactor heterogenous worker [Heterps]Refactor heterogenous worker Mar 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants