Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【PIR】PIR下的分布式算子注册 #60436

Closed
xingmingyyj opened this issue Dec 28, 2023 · 11 comments
Closed

【PIR】PIR下的分布式算子注册 #60436

xingmingyyj opened this issue Dec 28, 2023 · 11 comments
Assignees
Labels
PFCC Paddle Framework Contributor Club,https://github.com/PaddlePaddle/community/tree/master/pfcc status/close 已关闭

Comments

@xingmingyyj
Copy link
Contributor

xingmingyyj commented Dec 28, 2023

一、需求背景

飞桨正在构建一套新的IR体系.在新IR下飞桨基于动态图的更规范的算子定义(ops.yaml、legacy_ops.yaml)生成了新IR体系下的算子.在新的IR体系下仍然需要保证旧IR的兼容性.为此飞桨提供了ProgramTranslator(相关代码位于paddle/fluid/ir_adaptor/translator/),它可以将旧IR表示下的计算图翻译为新IR下的计算图.目前,ProgramTranslator的核心工作是完成单个OP的翻译.也就是将旧IR下定义的OP(一般定义在paddle/fluid/operators文件夹下)翻译为新IR下定义的算子.

现在有一部分分布式算子在新IR下是没有定义的.我们需要在新IR下为它们补充定义并保证ProgramTranslator可以成功完成翻译.

需要注册的分布式算子如下:

序号 单测 认领人/状态/PR号
1 push_sparse_v2 @enkilee #60473
2 distributed_push_sparse @enkilee #60805
3 c_allreduce_min @enkilee #60584
4 global_scatter @xiaoyewww #62579
5 partial_allgather @xiaoyewww #62735
6 c_scatter @DrRyanHuang
@enkilee #62369
7 c_reduce_prod @DrRyanHuang
@enkilee #62270
8 dgc @xiaoyewww #62781
9 partial_recv @enkilee #62412
10 pull_gpups_sparse @xiaoyewww #62935
11 dgc_momentum @xiaoyewww #63013
12 all_reduce @xiaoyewww #62634
13 partial_send @Difers #60484
14 send_and_recv @Difers #62589
@xiaoyewww #64203
15 push_dense @Difers
@enkilee #62505
16 c_split @DrRyanHuang
@enkilee #62416
17 barrier @xiaoyewww #62802
18 lars_momentum @enkilee #60838
19 pull_box_sparse @LittleNoob2333
@enkilee #62982
20 global_gather @Eacient
@xingmingyyj #63867
21 c_allreduce_prod @enkilee #60790
22 pull_sparse_v2 @xiaoyewww #63014
23 c_reduce_max @enkilee #62270
24 distributed_lookup_table @xiaoyewww #60911
25 distributed_fused_lamb_init @xiaoyewww #62050
26 limit_by_capacity @xiaoyewww #62579
27 distributed_fused_lamb @enkilee #61293
28 random_routing @xiaoyewww #62443 #62781
29 prune_gate_by_capacity @xiaoyewww #62494
30 nop @xiaoyewww #62541

PR提交模板

  • PR标题
【PIR Dist Op Reg No.1】 reg c_reduce_min
  • PR内容
### PR types
Others

### PR changes
Others

### Description


注册算子 `c_reduce_min`

认领方式

请大家以 comment 的形式认领任务,如:

【报名】:1、3、12-13

多个任务之间需要使用中文顿号分隔,报名多个连续任务可用横线表示,如 2-5
PR 提交格式:在 PR 的标题中以 【PIR OpTest Fix No.xxx】 开头,注明任务编号

看板信息

任务方向 任务数量 提交作品 / 任务认领 提交率 完成 完成率
快乐开源 30 29 / 29 96.67% 29 96.67%

二、Tutorial

每个任务的主要工作可以分为

  • 注册算子
  • 编写单测
  • 修改test/ir/pir/translator/CMakeLists.txt

三个部分,下面展开介绍:

2.1 算子注册

关于算子注册的步骤可以参考 #59382二、Tutorial.

2.2 编写单测

为了验证我们新注册的分布式算子可以被成功的翻译.需要编写一个单测进行验证.

首先,编写的所有单测需要放置在test/ir/pir/translator文件夹下,并且继承 TestOpTranscriber. 并且继承TestOpTranslatorTestOpWithBackwardTranslator,对于只需要注册前向算子的单测需要继承TestOpTranslator,前向和反向算子同时注册时需要继承TestOpWithBackwardTranslator.

class TestOpTranslator(unittest.TestCase):
    def setUp(self):
        self.place = core.Place()
        self.place.set_place(paddle.CPUPlace())
        self.new_scope = paddle.static.Scope()
        self.main_program = paddle.static.Program()

    def append_op(self):
        raise Exception("Define the op to be tested here!")

    def build_model(self):
        with paddle.static.scope_guard(self.new_scope):
            with paddle.static.program_guard(self.main_program):
                self.append_op()

    def check(self):
        self.build_model()
        l = pir.translate_to_pir(self.main_program.desc)
        assert hasattr(self, "op_type"), "Op_type should be specified!"
        assert self.op_type in str(l), (
            self.op_type
            + " should be translated to pd_op."
            + self.op_type
            + '!'
        )

继承TestOpTranscribe时, 继承TestOpTranslator时,需要重写append_op方法,在组网时将待测试的Op加入.check的主要思路是将旧IR下表示的计算图使用ProgramTranslator翻译为新IR表示的计算图,然后将新IR表示的计算图进行打印,如果计算图中包含待注册的Op,则说明翻译成功.
这里的类名统一采用TestXXXOpTranslator的形式,

class TestCReduceMinOpTranslator(test_op_transcriber.TestOpTranslator):
    def append_op(self):
        self.op_type = "c_reduce_min"
        x = paddle.ones(shape=(100, 2, 3), dtype='float32')
        y = paddle.ones(shape=(100, 2, 3), dtype='float32')
        attrs = {'ring_id': 0, 'root_id': 0, 'use_calc_stream': False}
        helper = LayerHelper(self.op_type)
        helper.append_op(
            type=self.op_type,
            inputs={"X": x},
            outputs={"Out": y},
            attrs=attrs,
        )

    def test_translator(self):
        self.check()


if __name__ == "__main__":
    unittest.main()

上述代码是对c_reduce_min进行测试的例子.

2.3 修改test/ir/pir/translator/CMakeLists.txt

因为现在注册的是分布式算子,如果编译选项WITH_DISTRIBUTE不打开的话,这部分算子是不会被编译注册的.所以,即便完成上述操作在某些CI上仍然可能遇到下述问题:

ValueError: Operator "xxx" has not been registered.

解决方法是修改CMakeLists.

file(
  GLOB TEST_INTERP_CASES
  RELATIVE "${CMAKE_CURRENT_SOURCE_DIR}"
  "test_*.py")
string(REPLACE ".py" "" TEST_INTERP_CASES "${TEST_INTERP_CASES}")

set(DISTRIBUTED_OP_TRANSLATOR_TEST test_c_reduce_min_translator)

if(NOT WITH_DISTRIBUTE)
  list(REMOVE_ITEM TEST_INTERP_CASES ${DISTRIBUTED_OP_TRANSLATOR_TEST})
endif()

foreach(target ${TEST_INTERP_CASES})
  py_test_modules(${target} MODULES ${target})
endforeach()

可以看出DISTRIBUTED_OP_TRANSLATOR_TEST中记录了分布式算子对应的单测,在WITH_DISTRIBUTE选项没有打开时,这些单测将会从TEST_INTERP_CASES删除,这样在CI上就不会执行该单测了.
c_allreduce_min这个算子为例,单测名称对应为test_c_allreduce_min_translator,所以,

set(DISTRIBUTED_OP_TRANSLATOR_TEST test_c_reduce_min_translator
                                   test_c_allreduce_min_translator)

将对应单测名称加入集合就可以了.

三、Q&A

1.反向算子定义的位置?

A:取决于前向算子定义的位置.如果前向定义在paddle/phi/api/yaml/ops.yaml, 反向就需要定义在 paddle/phi/api/yaml/backward.yaml.如果前向定义在 paddle/fluid/pir/dialect/operator/ir/ops.yaml,就把反向定义在paddle/fluid/pir/dialect/operator/ir/ops_backward.yaml.

统计信息

排名不分先后 @enkilee (12) @xiaoyewww (15) @Difers (1) @xingmingyyj (1)

@DrRyanHuang
Copy link
Member

【报名】:6、7、16

@enkilee
Copy link
Contributor

enkilee commented Dec 28, 2023

【报名】:1、3

@xiaoyewww
Copy link
Contributor

【报名】:24、25

@paddle-bot paddle-bot bot added the PFCC Paddle Framework Contributor Club,https://github.com/PaddlePaddle/community/tree/master/pfcc label Dec 28, 2023
@Difers
Copy link
Contributor

Difers commented Dec 30, 2023

【报名】:13、14、15

@xiaoyewww
Copy link
Contributor

【报名】12、17、26

@PaddlePaddle PaddlePaddle deleted a comment from sanbuphy Mar 8, 2024
@xiaoyewww
Copy link
Contributor

【报名】4、5、8、10、11

@Eacient
Copy link

Eacient commented Mar 18, 2024

【报名】:20

@LittleNoob2333
Copy link

【报名】:19

@xiaoyewww
Copy link
Contributor

【报名】:22

@luotao1
Copy link
Contributor

luotao1 commented May 11, 2024

【PIR】PIR下的分布式算子注册 已全部完成,感谢参与的小伙伴们!

排名不分先后 @enkilee (12) @xiaoyewww (15) @Difers (1) @xingmingyyj (1)

欢迎继续参与快乐开源的其他任务

@luotao1 luotao1 closed this as completed May 11, 2024
@paddle-bot paddle-bot bot added the status/close 已关闭 label May 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PFCC Paddle Framework Contributor Club,https://github.com/PaddlePaddle/community/tree/master/pfcc status/close 已关闭
Projects
Development

No branches or pull requests