Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fleet executor] Comm init for dist model inf #39012

Merged
merged 1 commit into from
Jan 18, 2022

Conversation

FeixLiu
Copy link
Contributor

@FeixLiu FeixLiu commented Jan 18, 2022

PR types

Others

PR changes

Others

Describe

comm init for the dist model inf system.

Test with the following codes:

import paddle.distributed.fleet as fleet
import paddle
from paddle.fluid import core

paddle.enable_static()
fleet.init(is_collective=True)

config = core.DistModelConfig()
config.model_dir = "./inference_model/rank_" + str(fleet.worker_index()) + "/step_0"
config.place = 'GPU'
config.device_id = fleet.worker_index()
config.current_endpoint = "127.0.0.1:700" + str(fleet.worker_index())
config.trainer_endpoints = ["127.0.0.1:7000", "127.0.0.1:7001", "127.0.0.1:7002", "127.0.0.1:7003", "127.0.0.1:7004", "127.0.0.1:7005", "127.0.0.1:7006", "127.0.0.1:7007"]
config.pp_degree = 2
config.mp_degree = 4
config.mp_ring_id = 0
if fleet.worker_index() <= 3:
    config.pp_downstream_ring_id = 20
    config.pp_upstream_ring_id = -1
if fleet.worker_index() >= 4:
    config.pp_downstream_ring_id = -1
    config.pp_upstream_ring_id = 20
config.local_rank = fleet.worker_index()
config.nranks = 8

dist = core.DistModel(config)
dist.init()

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

VLOG(3) << "Init comm group for mp.";
std::vector<std::string> peer_endpoints;
for (int64_t
idx = (config_.local_rank / config_.mp_degree) * config_.mp_degree,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

记得CoordSys吗,后面最好抽象一下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉没啥必要,inf组网的维度最多只有pp和mp,为了这两个再搞一个coord sys感觉有点多余。其实主要是之前把c++端的coord sys移到python 端了。。。不想再移回来😂

comm_init_block, config_.pp_downstream_ring_id);
}
}
framework::NaiveExecutor e(place_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

其实可以不用executor执行op来跑的,直接掉api就行,不过这样也没啥问题

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这样比较简洁吧,以后如果需要其它op可以直接加在这里

@wangxicoding wangxicoding merged commit 4c46eed into PaddlePaddle:develop Jan 18, 2022
@FeixLiu FeixLiu deleted the comm_init branch January 18, 2022 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants