PIR Python API 适配升级任务的 bug 修复手册 #58259

MarioLulab · 2023-10-20T03:01:59Z

亲爱的开发者们👋，

大家好！大家在「PIR Python API 适配升级任务」中可能会遇到🐛和解决🔧各种各样的 bug，同时遇到的 bug 可能会普遍出现在其他 API 适配的场景，因此为了降低开发成本，我们维护了这个 bug 修复手册。大家可以以 comment 的形式对 bug 描述及解决方法进行记录。

comment 回复模板可以参考下面的形式：

## 问题描述

简短地描述你在迁移的过程中遇到的问题，可配以截图和报错栈

## 用来复现问题的代码

请提供一个最小化的例子来复现你遇到的 bug

## 解决方法 or 思路

- xxx

MarioLulab · 2023-10-24T09:40:45Z

问题描述

在 @test_with_pir_api 装饰器包裹下，executor 执行 base.default_main_program() 出现报错：

Traceback (most recent call last):
  File "/luq/docker/paddle-docker/Paddle-bak/build/python/paddle/pir_utils.py", line 119, in impl
    func(*args, **kwargs)
  File "/luq/docker/paddle-docker/Paddle-bak/test/legacy_test/test_mse_loss.py", line 53, in test_mse_loss
    fetch_list=[output],
  File "/luq/docker/paddle-docker/Paddle-bak/build/python/paddle/base/executor.py", line 1633, in run
    return_numpy=return_numpy,
  File "/luq/docker/paddle-docker/Paddle-bak/build/python/paddle/base/executor.py", line 1936, in _run_pir_impl
    scope,
  File "/luq/docker/paddle-docker/Paddle-bak/build/python/paddle/base/executor.py", line 1026, in get_pir_program_and_executor
    program, fetch_list=fetch_list, fetch_var_name=fetch_var_name
  File "/luq/docker/paddle-docker/Paddle-bak/build/python/paddle/base/executor.py", line 511, in _add_pir_fetch_ops
    global_block, fetch_list, fetch_var_name, fetch_op
  File "/luq/docker/paddle-docker/Paddle-bak/build/python/paddle/base/executor.py", line 426, in has_fetch_operations
    if op.name() == fetch_op:
AttributeError: 'Operator' object has no attribute 'name'

用来复现问题的代码

创建单测文件：

class TestMseLoss(unittest.TestCase):
    @test_with_pir_api
    def test_mse_loss(self):
        paddle.enable_static()
        input_val = np.random.uniform(0.1, 0.5, (2, 3)).astype("float32")
        label_val = np.random.uniform(0.1, 0.5, (2, 3)).astype("float32")

        sub = input_val - label_val
        np_result = np.mean(sub * sub)

        input_var = paddle.static.data(
            name="input", shape=[-1, 3], dtype="float32"
        )
        label_var = paddle.static.data(
            name="label", shape=[-1, 3], dtype="float32"
        )

        output = paddle.nn.functional.mse_loss(input=input_var, label=label_var)
        for use_cuda in (
            [False, True] if core.is_compiled_with_cuda() else [False]
        ):
            place = base.CUDAPlace(0) if use_cuda else base.CPUPlace()
            exe = Executor(place)
            (result,) = exe.run(
                base.default_main_program(),
                feed={"input": input_val, "label": label_val},
                fetch_list=[output],
            )

            np.testing.assert_allclose(np_result, result, rtol=1e-05)

运行，报错

解决方法 or 思路

在构建静态图前，实例化两个 paddle.static.Program() 分别作为 main_program 和 startup_program，并在静态图构图时使用 paddle.static.program_guard 管理上下文即可

class TestMseLoss(unittest.TestCase):
    @test_with_pir_api
    def test_mse_loss(self):
        paddle.enable_static()
        input_val = np.random.uniform(0.1, 0.5, (2, 3)).astype("float32")
        label_val = np.random.uniform(0.1, 0.5, (2, 3)).astype("float32")

        sub = input_val - label_val
        np_result = np.mean(sub * sub)

        main = paddle.static.Program()
        startup = paddle.static.Program()
        with paddle.static.program_guard(main, startup):
            input_var = paddle.static.data(
                name="input", shape=[-1, 3], dtype="float32"
            )
            label_var = paddle.static.data(
                name="label", shape=[-1, 3], dtype="float32"
            )

            output = paddle.nn.functional.mse_loss(input=input_var, label=label_var)
            for use_cuda in (
                [False, True] if core.is_compiled_with_cuda() else [False]
            ):
                place = base.CUDAPlace(0) if use_cuda else base.CPUPlace()
                exe = Executor(place)
                (result,) = exe.run(
                    main,
                    feed={"input": input_val, "label": label_val},
                    fetch_list=[output],
                )

                np.testing.assert_allclose(np_result, result, rtol=1e-05)

问题出现原因排查

IRGuard 没有对 paddle.base.default_main_program 和 paddle.base.default_startup_program 进行替换，代码见

Paddle/python/paddle/pir_utils.py

Lines 72 to 77 in 21d7d04

    
           # paddle.base.default_main_program = ( 
        
           #     paddle.pir.core.default_main_program 
        
           # ) 
        
           # paddle.base.default_startup_program = ( 
        
           #     paddle.pir.core.default_startup_program 
        
           # )

该代码在 PR #57956 中被注释掉了，原因为：

pir_guard中没有切换base.default_main_program(), 
切换后会导致 OpTest 中 pir 使用 get_kernel_signature 时无法获取就静态图的 proto, 
合理的做法是在 pir 下写新的 get_kernel_signature , 不依赖于旧 ir 的结构，
暂时可以将其他单测使用 base.defalut_program 改为 static.default_program.

MarioLulab · 2023-11-01T12:24:59Z

问题描述

在 @test_with_pir_api 装饰器包裹下，两个不同的网络在同一个 Program 下组网，第一次执行第一个网络，第二次执行第二个网络。执行器在执行第二个网络时出现错误：

Traceback (most recent call last):
  File "/home/aistudio/Paddle-gpu/build/python/paddle/pir_utils.py", line 119, in impl
    func(*args, **kwargs)
  File "/home/aistudio/Paddle-gpu/test/legacy_test/test_fused_feedforward_op.py", line 319, in test_static
    fetch_list=[res],
  File "/home/aistudio/Paddle-gpu/build/python/paddle/base/executor.py", line 1644, in run
    return_numpy=return_numpy,
  File "/home/aistudio/Paddle-gpu/build/python/paddle/base/executor.py", line 1947, in _run_pir_impl
    scope,
  File "/home/aistudio/Paddle-gpu/build/python/paddle/base/executor.py", line 1037, in get_pir_program_and_executor
    program, fetch_list=fetch_list, fetch_var_name=fetch_var_name
  File "/home/aistudio/Paddle-gpu/build/python/paddle/base/executor.py", line 511, in _add_pir_fetch_ops
    global_block, fetch_list, fetch_var_name, fetch_op
  File "/home/aistudio/Paddle-gpu/build/python/paddle/base/executor.py", line 430, in has_fetch_operations
    "There is a fetch op in Program which will fetch variable that is not belong to fetch_targets."
Exception: There is a fetch op in Program which will fetch variable that is not belong to fetch_targets.

用来复现问题的代码

创建单测文件：

class APITestStaticFusedFFN(unittest.TestCase):
 @test_with_pir_api
 def test_static(self):
     paddle.enable_static()
     main = paddle.static.Program()
     startup = paddle.static.Program()
     main.random_seed = 42

     dtype = "float32"
     layer_norm_dtype = "float32"
     batch_size = 1
     d_model = 8
     dim_feedforward = 8

     x_data = np.random.random(
         (batch_size, d_model, dim_feedforward)
     ).astype(dtype)
     linear1_weight_data = np.random.random(
         (d_model, dim_feedforward)
     ).astype(dtype)
     linear1_bias_data = np.zeros(dim_feedforward).astype(dtype)
     linear2_weight_data = np.random.random(
         (dim_feedforward, d_model)
     ).astype(dtype)
     linear2_bias_data = np.zeros(d_model).astype(dtype)

     ln1_scale_data = np.ones(d_model).astype(layer_norm_dtype)
     ln1_bias_data = np.zeros(d_model).astype(layer_norm_dtype)
     ln2_scale_data = np.ones(d_model).astype(layer_norm_dtype)
     ln2_bias_data = np.zeros(d_model).astype(layer_norm_dtype)

     with paddle.static.program_guard(main, startup):
         x = paddle.static.data(
             name='x',
             shape=[batch_size, d_model, dim_feedforward],
             dtype=dtype,
         )
         linear1_weight = paddle.static.data(
             name='linear1_weight',
             shape=[d_model, dim_feedforward],
             dtype=dtype,
         )
         linear1_bias = paddle.static.data(
             name='linear1_bias', shape=[dim_feedforward]
         )
         linear2_weight = paddle.static.data(
             name='linear2_weight',
             shape=[dim_feedforward, d_model],
             dtype=dtype,
         )
         linear2_bias = paddle.static.data(
             name='linear2_bias', shape=[d_model]
         )
         ln1_scale = paddle.static.data(name='ln1_scale', shape=[d_model])
         ln1_bias = paddle.static.data(name='ln1_bias', shape=[d_model])
         ln2_scale = paddle.static.data(name='ln2_scale', shape=[d_model])
         ln2_bias = paddle.static.data(name='ln2_bias', shape=[d_model])
         
         # <------------ 第一个网络组网 ------------>
         fused_out = incubate_f.fused_feedforward(
             x,
             linear1_weight,
             linear2_weight,
             linear1_bias,
             linear2_bias,
             ln1_scale,
             ln1_bias,
             ln2_scale,
             ln2_bias,
             0.0,
             0.0,
             activation="relu",
             pre_layer_norm=False,
         )


         # <------------ 第二个网络组网 ------------>
         # base ffn
         linear1_out = F.linear(x, linear1_weight, linear1_bias)
         act_out = F.relu(linear1_out)
         dropout1_out = F.dropout(x=act_out, p=0.0, training=False)
         linear2_out = F.linear(dropout1_out, linear2_weight, linear2_bias)
         dropout2_out = x + F.dropout(x=linear2_out, p=0.0, training=False)
         ln_out = F.layer_norm(
             dropout2_out,
             normalized_shape=[d_model],
             weight=ln2_scale,
             bias=ln2_bias,
         )

         exe = paddle.static.Executor(paddle.CUDAPlace(0))

         res_list = [fused_out, ln_out]
         real_res = []

         for res in res_list:          # <---------------- 在同一个 program 下执行两个网络
             fetch = exe.run(
                 feed={
                     'x': x_data,
                     'linear1_weight': linear1_weight_data,
                     'linear1_bias': linear1_bias_data,
                     'linear2_weight': linear2_weight_data,
                     'linear2_bias': linear2_bias_data,
                     'ln1_scale': ln1_scale_data,
                     'ln1_bias': ln1_bias_data,
                     'ln2_scale': ln2_scale_data,
                     'ln2_bias': ln2_bias_data,
                 },
                 fetch_list=[res],
             )
             real_res.append(fetch)
         np.testing.assert_allclose(
             real_res[0], real_res[1], rtol=1e-05, atol=0.001
         )

运行，在执行器在执行第二个网络时出现错误

解决方法 or 思路

将两个网络分别在两个 program 下组网并执行即可

class APITestStaticFusedFFN(unittest.TestCase):
    @test_with_pir_api
    def test_static(self):
        paddle.enable_static()

        dtype = "float32"
        layer_norm_dtype = "float32"
        batch_size = 1
        d_model = 8
        dim_feedforward = 8

        x_data = np.random.random(
            (batch_size, d_model, dim_feedforward)
        ).astype(dtype)
        linear1_weight_data = np.random.random(
            (d_model, dim_feedforward)
        ).astype(dtype)
        linear1_bias_data = np.zeros(dim_feedforward).astype(dtype)
        linear2_weight_data = np.random.random(
            (dim_feedforward, d_model)
        ).astype(dtype)
        linear2_bias_data = np.zeros(d_model).astype(dtype)

        ln1_scale_data = np.ones(d_model).astype(layer_norm_dtype)
        ln1_bias_data = np.zeros(d_model).astype(layer_norm_dtype)
        ln2_scale_data = np.ones(d_model).astype(layer_norm_dtype)
        ln2_bias_data = np.zeros(d_model).astype(layer_norm_dtype)

        main_1 = paddle.static.Program()
        startup_1 = paddle.static.Program()
        main_1.random_seed = 42
         # <------------ 第一个网络组网 ------------>
        with paddle.static.program_guard(main_1, startup_1):
             # <------------ code: 第一个网络组网代码 ------------>
            ... ...
           
        main_2 = paddle.static.Program()
        startup_2 = paddle.static.Program()
        main_2.random_seed = 42
         # <------------ 第二个网络组网 ------------>
        with paddle.static.program_guard(main_2, startup_2):
             # <------------ code: 第二个网络组网代码 ------------>
            ... ...     
        
        # 比较两者结果
        ... ...

问题出现原因排查

现阶段 pir.Program 在多个网络共用 program 时可能会有问题，具体原因待排查

MarioLulab added status/new-issue 新建 type/others 其他问题 labels Oct 20, 2023

paddle-bot bot assigned lugimzzz Oct 20, 2023

MarioLulab mentioned this issue Oct 20, 2023

新IR Python API适配升级 #58067

Closed

MarioLulab changed the title ~~新 IR Python API 适配升级任务的 bug 修复手册~~ PIR Python API 适配升级任务的 bug 修复手册 Oct 20, 2023

paddle-bot bot added the PFCC Paddle Framework Contributor Club，https://github.com/PaddlePaddle/community/tree/master/pfcc label Oct 20, 2023

Ligoml removed status/new-issue 新建 type/others 其他问题 labels Oct 24, 2023

Ligoml unassigned lugimzzz Oct 24, 2023

MarioLulab mentioned this issue Nov 1, 2023

【PIR API adaptor No.219 、220】 Migrate pinv/svd into pir #58446

Merged

MarioLulab mentioned this issue Dec 5, 2023

WAVE SUMMIT+2023下半年飞桨开源之星评选-信息征集 PaddlePaddle/community#765

Closed

zbt78 mentioned this issue Jul 17, 2024

PIR 单测推全交流平台 #66134

Open

Luohongzhige mentioned this issue Jul 23, 2024

【Fix PIR Unittest BUAA No. 76】Fix test_transfer_layout_op #66391

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PIR Python API 适配升级任务的 bug 修复手册 #58259

PIR Python API 适配升级任务的 bug 修复手册 #58259

MarioLulab commented Oct 20, 2023 •

edited

Loading

MarioLulab commented Oct 24, 2023 •

edited

Loading

MarioLulab commented Nov 1, 2023

PIR Python API 适配升级任务的 bug 修复手册 #58259

PIR Python API 适配升级任务的 bug 修复手册 #58259

Comments

MarioLulab commented Oct 20, 2023 • edited Loading

MarioLulab commented Oct 24, 2023 • edited Loading

问题描述

用来复现问题的代码

解决方法 or 思路

问题出现原因排查

MarioLulab commented Nov 1, 2023

问题描述

用来复现问题的代码

解决方法 or 思路

问题出现原因排查

MarioLulab commented Oct 20, 2023 •

edited

Loading

MarioLulab commented Oct 24, 2023 •

edited

Loading