【PIR】记录新ir全量算子覆盖需要修复的算子 #58266

xingmingyyj · 2023-10-20T04:46:44Z

记录正在修复的算子

序号	任务	失败原因	PR	开发者	问题记录
1	test_dpsgd_op	Op dpsgd should have corresponding OpInfo pd.dpsgd	#57826	@xingmingyyj	问题一：在`ops.yaml`中添加dpsgd的注册信息之后，运行报错 `[Hint: Expected param->numel() == sz, but received param->numel():10710 != sz:0.]`,原因是没有给dpsgd配置infermeta,在`paddle/phi/infermeta/ternary.cc`下添加infermeta函数后成功解决. 问题二：`error: eager_api_dpsgd was not declared in this scope`,添加算子n时编译出现该问题需要将算子名称加入`paddle/fluid/pir/dialect/op_generator/ops_api_gen.py NO_NEED_GEN_STATIC_ONLY_APIS`
2	test_exponential_op	Op exponential should have corresponding OpInfo pd.exponential & The difference in accuracy is too great & The difference in accuracy is too great	#58029	@xingmingyyj	问题一：exponential op的输出值具有随机性，新Ir下的测试逻辑是在旧Ir下运行一次op，新Ir下运行一次op，比较两者的输出结果。因为exponential op的两次输出不相同所以现有的测试逻辑不适用。解决方法是对对于此类op单独创建一个`new_ir_op_test_no_check_list`不对输出结果做相应的检查
3	test_norm_all	Attribute cast error in InferMeta Context, the expected attribute type is `St6vectorIlSaIlEE`	#57942	@changeyoung98
4	test_pixel_unshuffle		#57521	@phlrain
5	test_randint_op	(PreconditionNotMet) ProtoType 17 has no corresponding translator	#58295	@xingmingyyj	问题一：LOD_TENSOR的dtype类型会出现RAW类型,但是目前不支持RAW类型的翻译，所以这里仿照InferMeta的逻辑，将attribute中的dtype指定给Out 问题二：randint的输出同样具有随机性，所以这里将其加入`new_ir_op_test_no_check_list`对输出值不做检查
6	test_real_imag_op	(PreconditionNotMet) op [pd_op.real_grad] kernel output args defs should equal op outputs			问题一：主要是单测机制导致的测试在开启`FLAGS_enable_new_ir_in_executor`时执行错误，开启`FLAGS_PIR_OPTEST`, `FLAGS_PIR_OPTEST_WHITE_LIST`单测成功，这里暂时不做处理
7	test_repeat_interleave_op	InvalidArgumentError: repeats should be larger than zero	#58379	@xingmingyyj	问题一：新Ir下需要将repeat_interleave这个op根据输入`RepeatsTensor`翻译成`repeat_interleave_with_tensor_index`或者`repeat_interleave`这里增加RepeatInterLeaveOpTranscriber就可以实现，但是要注意对对应的grad op也要做相同的处理。问题二：报错`the type of data we are trying to retrieve (float32) does not match the type of data (flaot64)` 这个错误原因主要是组网时声明的tensor的dtype为float32但是测试文件中给出的数据是float64，旧Ir下的GetExpectedKernelType函数可以根据输入的数据的数据类型选择kernel,而新ir下暂不支持，新ir下根据x的dtype选择对应的kernel。所以对于此类问题需要修改单测文件，强制输入的数据类型和声明的dtype一致。
8	test_seed_op	(NotFound) The kernel with key (CPU, Undefined(AnyLayout), int64) of kernel `seed` is not registered. Selected wrong DataType `int64`. Paddle support following DataTypes: int32.	#58552	@xingmingyyj	问题一：该错误主要是由新旧ir下的`GetExpectedKernelType`不一致造成的，旧Ir下kerneltype为`INT32`,而新ir下的`GetExpectedKernelType`返回的是Out的dtype,修改新ir下的`GetExpectedKernelType`问题解决
9	test_share_data_op		#57212	@yangguohao
10	test_spare_momentum_op	Op dpsgd should have correspoding OpInfo pd.spare_momentum	#58536	@xingmingyyj	问题一：`OpYamlInfoParser`在解析`runtime_info.kernel_param`时会将可变属性放入`kernel_fn_attr_params`这样对于新Ir下定义的sparse_momentum_op(定义了Scalar axis)会造成`AttributeMap`中不存在`axis`属性的问题。所以对于此类`legacy op`暂时将可变属性统一放入`kernel_fn_tensor_params`中。解决方案是需要给`OpYamlInfoParser`多增加一个属性，用来判断当前翻译的Op是非为`legacy op`。
11	test_sum_op	FatalError: `Segmentation fault` is detected by the operating system.			问题一：主要是单测机制导致的测试在开启`FLAGS_enable_new_ir_in_executor`时执行错误，开启`FLAGS_PIR_OPTEST`, `FLAGS_PIR_OPTEST_WHITE_LIST`单测成功，这里暂时不做处理
12	test_uniform_random_op	FatalError: `Segmentation fault` is detected by the operating system.			问题一：主要是单测机制导致的测试在开启`FLAGS_enable_new_ir_in_executor`时执行错误，开启`FLAGS_PIR_OPTEST`, `FLAGS_PIR_OPTEST_WHITE_LIST`单测成功，这里暂时不做处理
13	test_unique	PreconditionNotMetError: Tensor holds no memory. Call Tensor::mutable_data firstly.			问题一:test_unique在新Ir下执行报错为PreconditionNotMetError: Tensor holds no memory. Call Tensor::mutable_data firstly..这里的问题是由新Ir下默认将旧Ir下的unique只翻译成新Ir下的unique导致的。在旧Ir下unique会根据属性`is_sorted`的值选择unique或者unique_raw两个`kernel`执行。在新Ir下不存在这样的机制，所以需要根据is_sorted的值将旧Ir下的unique翻译为新Ir下的unique或者unique_raw两个OP.这里在新Ir下补充了unique_raw的定义。问题二:修复后在GPU环境上运行，在GPU版本的`kernel`中发生空指针异常，这是选`kernel`的逻辑存在问题，旧Ir下通过`GetReduceGradExpectedKernelType`在GPU环境下选择CPU中的`kernel`，新IR下不适配GetReduceGradExpectedKernelType导致在GPU环境下Kernel选择出现问题，暂时尚未处理
14	test_uniform_random_bf16_op	Op uniform_random_batch_size_like should have corresponding OpInfo pd_op.uniform_random_batch_size_like,RuntimeError: (NotFound) Variable is not initialized.1558: [Hint: holder_ should not be null.]	#58904	@xingmingyyj	问题一: `input`对应的`Variable`在构建`PhiContext`时`holder_`为空。在python侧_StandaloneExecutor执行run函数时传入的feed_names为空，在旧IR中会在program_interpreter中执行run函数，对于program_interpreter初始化Variable的机制，他会在构建Varibale是就将其初始化。而pir_interpreter不会先初始化Variable，它根据feed_names为输入变量初始化，所以如果feed_names为空，会导致input不会被初始化，导致后面运行报错。解决方案是在exe.run()中加入`feed`

The text was updated successfully, but these errors were encountered:

xingmingyyj added status/new-issue 新建 type/others 其他问题 labels Oct 20, 2023

paddle-bot bot assigned lugimzzz Oct 20, 2023

xingmingyyj changed the title ~~[pir] 记录新ir全量算子覆盖需要修复的算子~~ 【PIR】记录新ir全量算子覆盖需要修复的算子 Oct 20, 2023

paddle-bot bot added the PFCC Paddle Framework Contributor Club，https://github.com/PaddlePaddle/community/tree/master/pfcc label Oct 20, 2023

Ligoml removed status/new-issue 新建 type/others 其他问题 labels Oct 24, 2023

xingmingyyj closed this as completed Jun 2, 2024

paddle-bot bot added the status/close 已关闭 label Jun 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【PIR】记录新ir全量算子覆盖需要修复的算子 #58266

【PIR】记录新ir全量算子覆盖需要修复的算子 #58266

xingmingyyj commented Oct 20, 2023 •

edited

Loading

【PIR】记录新ir全量算子覆盖需要修复的算子 #58266

【PIR】记录新ir全量算子覆盖需要修复的算子 #58266

Comments

xingmingyyj commented Oct 20, 2023 • edited Loading

xingmingyyj commented Oct 20, 2023 •

edited

Loading