[fleet executor] Comm init for dist model inf #39012

FeixLiu · 2022-01-18T03:55:29Z

PR types

Others

PR changes

Others

Describe

comm init for the dist model inf system.

Test with the following codes:

import paddle.distributed.fleet as fleet
import paddle
from paddle.fluid import core

paddle.enable_static()
fleet.init(is_collective=True)

config = core.DistModelConfig()
config.model_dir = "./inference_model/rank_" + str(fleet.worker_index()) + "/step_0"
config.place = 'GPU'
config.device_id = fleet.worker_index()
config.current_endpoint = "127.0.0.1:700" + str(fleet.worker_index())
config.trainer_endpoints = ["127.0.0.1:7000", "127.0.0.1:7001", "127.0.0.1:7002", "127.0.0.1:7003", "127.0.0.1:7004", "127.0.0.1:7005", "127.0.0.1:7006", "127.0.0.1:7007"]
config.pp_degree = 2
config.mp_degree = 4
config.mp_ring_id = 0
if fleet.worker_index() <= 3:
    config.pp_downstream_ring_id = 20
    config.pp_upstream_ring_id = -1
if fleet.worker_index() >= 4:
    config.pp_downstream_ring_id = -1
    config.pp_upstream_ring_id = 20
config.local_rank = fleet.worker_index()
config.nranks = 8

dist = core.DistModel(config)
dist.init()

paddle-bot-old · 2022-01-18T03:55:41Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

wangxicoding · 2022-01-18T11:12:32Z

paddle/fluid/distributed/fleet_executor/dist_model.cc

+    VLOG(3) << "Init comm group for mp.";
+    std::vector<std::string> peer_endpoints;
+    for (int64_t
+             idx = (config_.local_rank / config_.mp_degree) * config_.mp_degree,


记得CoordSys吗，后面最好抽象一下

感觉没啥必要，inf组网的维度最多只有pp和mp，为了这两个再搞一个coord sys感觉有点多余。其实主要是之前把c++端的coord sys移到python 端了。。。不想再移回来😂

wangxicoding · 2022-01-18T11:22:58Z

paddle/fluid/distributed/fleet_executor/dist_model.cc

+                   comm_init_block, config_.pp_downstream_ring_id);
+    }
+  }
+  framework::NaiveExecutor e(place_);


其实可以不用executor执行op来跑的，直接掉api就行，不过这样也没啥问题

这样比较简洁吧，以后如果需要其它op可以直接加在这里

FeixLiu force-pushed the comm_init branch from 74d446d to b69aebe Compare January 18, 2022 08:21

[fleet executor] add comm init for dist model inf

ffffada

FeixLiu force-pushed the comm_init branch from b69aebe to ffffada Compare January 18, 2022 08:24

FeixLiu requested a review from wangxicoding January 18, 2022 11:02

wangxicoding approved these changes Jan 18, 2022

View reviewed changes

wangxicoding merged commit 4c46eed into PaddlePaddle:develop Jan 18, 2022

FeixLiu deleted the comm_init branch January 18, 2022 11:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fleet executor] Comm init for dist model inf #39012

[fleet executor] Comm init for dist model inf #39012

FeixLiu commented Jan 18, 2022 •

edited

Loading

paddle-bot-old bot commented Jan 18, 2022

wangxicoding Jan 18, 2022

FeixLiu Jan 18, 2022

wangxicoding Jan 18, 2022

FeixLiu Jan 18, 2022

[fleet executor] Comm init for dist model inf #39012

[fleet executor] Comm init for dist model inf #39012

Conversation

FeixLiu commented Jan 18, 2022 • edited Loading

PR types

PR changes

Describe

paddle-bot-old bot commented Jan 18, 2022

wangxicoding Jan 18, 2022

Choose a reason for hiding this comment

FeixLiu Jan 18, 2022

Choose a reason for hiding this comment

wangxicoding Jan 18, 2022

Choose a reason for hiding this comment

FeixLiu Jan 18, 2022

Choose a reason for hiding this comment

FeixLiu commented Jan 18, 2022 •

edited

Loading