-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PIR Python API 适配升级任务的 bug 修复手册 #58259
Comments
问题描述在 Traceback (most recent call last):
File "/luq/docker/paddle-docker/Paddle-bak/build/python/paddle/pir_utils.py", line 119, in impl
func(*args, **kwargs)
File "/luq/docker/paddle-docker/Paddle-bak/test/legacy_test/test_mse_loss.py", line 53, in test_mse_loss
fetch_list=[output],
File "/luq/docker/paddle-docker/Paddle-bak/build/python/paddle/base/executor.py", line 1633, in run
return_numpy=return_numpy,
File "/luq/docker/paddle-docker/Paddle-bak/build/python/paddle/base/executor.py", line 1936, in _run_pir_impl
scope,
File "/luq/docker/paddle-docker/Paddle-bak/build/python/paddle/base/executor.py", line 1026, in get_pir_program_and_executor
program, fetch_list=fetch_list, fetch_var_name=fetch_var_name
File "/luq/docker/paddle-docker/Paddle-bak/build/python/paddle/base/executor.py", line 511, in _add_pir_fetch_ops
global_block, fetch_list, fetch_var_name, fetch_op
File "/luq/docker/paddle-docker/Paddle-bak/build/python/paddle/base/executor.py", line 426, in has_fetch_operations
if op.name() == fetch_op:
AttributeError: 'Operator' object has no attribute 'name' 用来复现问题的代码
class TestMseLoss(unittest.TestCase):
@test_with_pir_api
def test_mse_loss(self):
paddle.enable_static()
input_val = np.random.uniform(0.1, 0.5, (2, 3)).astype("float32")
label_val = np.random.uniform(0.1, 0.5, (2, 3)).astype("float32")
sub = input_val - label_val
np_result = np.mean(sub * sub)
input_var = paddle.static.data(
name="input", shape=[-1, 3], dtype="float32"
)
label_var = paddle.static.data(
name="label", shape=[-1, 3], dtype="float32"
)
output = paddle.nn.functional.mse_loss(input=input_var, label=label_var)
for use_cuda in (
[False, True] if core.is_compiled_with_cuda() else [False]
):
place = base.CUDAPlace(0) if use_cuda else base.CPUPlace()
exe = Executor(place)
(result,) = exe.run(
base.default_main_program(),
feed={"input": input_val, "label": label_val},
fetch_list=[output],
)
np.testing.assert_allclose(np_result, result, rtol=1e-05)
解决方法 or 思路
class TestMseLoss(unittest.TestCase):
@test_with_pir_api
def test_mse_loss(self):
paddle.enable_static()
input_val = np.random.uniform(0.1, 0.5, (2, 3)).astype("float32")
label_val = np.random.uniform(0.1, 0.5, (2, 3)).astype("float32")
sub = input_val - label_val
np_result = np.mean(sub * sub)
main = paddle.static.Program()
startup = paddle.static.Program()
with paddle.static.program_guard(main, startup):
input_var = paddle.static.data(
name="input", shape=[-1, 3], dtype="float32"
)
label_var = paddle.static.data(
name="label", shape=[-1, 3], dtype="float32"
)
output = paddle.nn.functional.mse_loss(input=input_var, label=label_var)
for use_cuda in (
[False, True] if core.is_compiled_with_cuda() else [False]
):
place = base.CUDAPlace(0) if use_cuda else base.CPUPlace()
exe = Executor(place)
(result,) = exe.run(
main,
feed={"input": input_val, "label": label_val},
fetch_list=[output],
)
np.testing.assert_allclose(np_result, result, rtol=1e-05) 问题出现原因排查IRGuard 没有对 Paddle/python/paddle/pir_utils.py Lines 72 to 77 in 21d7d04
该代码在 PR #57956 中被注释掉了,原因为: pir_guard中没有切换base.default_main_program(),
切换后会导致 OpTest 中 pir 使用 get_kernel_signature 时无法获取就静态图的 proto,
合理的做法是在 pir 下写新的 get_kernel_signature , 不依赖于旧 ir 的结构,
暂时可以将其他单测使用 base.defalut_program 改为 static.default_program. |
问题描述在 @test_with_pir_api 装饰器包裹下,两个不同的网络在同一个 Program 下组网,第一次执行第一个网络,第二次执行第二个网络。执行器在执行第二个网络时出现错误: Traceback (most recent call last):
File "/home/aistudio/Paddle-gpu/build/python/paddle/pir_utils.py", line 119, in impl
func(*args, **kwargs)
File "/home/aistudio/Paddle-gpu/test/legacy_test/test_fused_feedforward_op.py", line 319, in test_static
fetch_list=[res],
File "/home/aistudio/Paddle-gpu/build/python/paddle/base/executor.py", line 1644, in run
return_numpy=return_numpy,
File "/home/aistudio/Paddle-gpu/build/python/paddle/base/executor.py", line 1947, in _run_pir_impl
scope,
File "/home/aistudio/Paddle-gpu/build/python/paddle/base/executor.py", line 1037, in get_pir_program_and_executor
program, fetch_list=fetch_list, fetch_var_name=fetch_var_name
File "/home/aistudio/Paddle-gpu/build/python/paddle/base/executor.py", line 511, in _add_pir_fetch_ops
global_block, fetch_list, fetch_var_name, fetch_op
File "/home/aistudio/Paddle-gpu/build/python/paddle/base/executor.py", line 430, in has_fetch_operations
"There is a fetch op in Program which will fetch variable that is not belong to fetch_targets."
Exception: There is a fetch op in Program which will fetch variable that is not belong to fetch_targets. 用来复现问题的代码
解决方法 or 思路
class APITestStaticFusedFFN(unittest.TestCase):
@test_with_pir_api
def test_static(self):
paddle.enable_static()
dtype = "float32"
layer_norm_dtype = "float32"
batch_size = 1
d_model = 8
dim_feedforward = 8
x_data = np.random.random(
(batch_size, d_model, dim_feedforward)
).astype(dtype)
linear1_weight_data = np.random.random(
(d_model, dim_feedforward)
).astype(dtype)
linear1_bias_data = np.zeros(dim_feedforward).astype(dtype)
linear2_weight_data = np.random.random(
(dim_feedforward, d_model)
).astype(dtype)
linear2_bias_data = np.zeros(d_model).astype(dtype)
ln1_scale_data = np.ones(d_model).astype(layer_norm_dtype)
ln1_bias_data = np.zeros(d_model).astype(layer_norm_dtype)
ln2_scale_data = np.ones(d_model).astype(layer_norm_dtype)
ln2_bias_data = np.zeros(d_model).astype(layer_norm_dtype)
main_1 = paddle.static.Program()
startup_1 = paddle.static.Program()
main_1.random_seed = 42
# <------------ 第一个网络组网 ------------>
with paddle.static.program_guard(main_1, startup_1):
# <------------ code: 第一个网络组网代码 ------------>
... ...
main_2 = paddle.static.Program()
startup_2 = paddle.static.Program()
main_2.random_seed = 42
# <------------ 第二个网络组网 ------------>
with paddle.static.program_guard(main_2, startup_2):
# <------------ code: 第二个网络组网代码 ------------>
... ...
# 比较两者结果
... ... 问题出现原因排查现阶段 pir.Program 在多个网络共用 program 时可能会有问题,具体原因待排查 |
亲爱的开发者们👋,
大家好!大家在「PIR Python API 适配升级任务」中可能会遇到🐛和解决🔧各种各样的 bug,同时遇到的 bug 可能会普遍出现在其他 API 适配的场景,因此为了降低开发成本,我们维护了这个 bug 修复手册。大家可以以 comment 的形式对 bug 描述及解决方法进行记录。
comment 回复模板可以参考下面的形式:
The text was updated successfully, but these errors were encountered: