Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BugFix with ParseInputDataType from LodTensorArray #32918

Merged
merged 2 commits into from
May 17, 2021

Conversation

Aurelius84
Copy link
Contributor

@Aurelius84 Aurelius84 commented May 14, 2021

PR types

Bug fixes

PR changes

Others

Describe

1. PR内容

修复LoDTensorArray作为Op的输入时,在GetExepectedKernelType函数中ParseInputDataType的极其隐藏bug

2. BUG原因

报错信息:

2021-05-13 14:16:32 ----------------------
2021-05-13 14:16:32 Error Message Summary:
2021-05-13 14:16:32 ----------------------
2021-05-13 14:16:32 NotFoundError: Operator (slice) does not have kernel for data_type[bool]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN].
2021-05-13 14:16:32   [Hint: Expected kernel_iter != kernels.end(), but received kernel_iter == kernels.end().] (at /paddle/paddle/fluid/framework/operator.cc:1278)
2021-05-13 14:16:32   [operator < slice > error]  [operator < run_program > error]

问题代码:
https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/operator.cc#L1541-L1572

     // ...(省略)
      } else if (var->IsType<LoDTensorArray>()) {
        auto t_arr = var->Get<LoDTensorArray>();   // 此处为赋值构造函数
        // VLOG(1) << "t_arr :" << &t_arr;
        // auto t_arr1 = var->Get<LoDTensorArray>();   // 执行两次,是连个不同的copy
       // VLOG(1) << "t_arr1 :" << &t_arr1;
        for (size_t j = 0; j < t_arr.size(); j++) {
          if (t_arr[j].IsInitialized()) {
            t = &(t_arr[j]);
          }
        }  // 出了else if 的作用域,则t指向了一个已析构了的Tensor对象
      }
      if (t != nullptr) {
        // VLOG(1) << "tensor ptr:" << t << "holder: " << t->Holder() << " , " <<  t->Holder()->Ptr();
        proto::VarType::Type tmp = t->type();  // 存在一定的风险。因为t是个野指针

image

3. 解决方案

将t_arr改为引用,因为此处每次返回指向的Tensor都是内在的Tensor地址。

PR之前:

I0514 10:06:15.922755  3916 device_context.cc:94] DeviceContextPool Get: CUDAPlace(0)
I0514 10:06:15.922760  3916 device_context.cc:94] DeviceContextPool Get: CUDAPlace(0)
I0514 10:06:15.922771  3916 variable.h:40] a ptr: 0x3ec8d908 holder->Ptr(): 0x3ec8d908
I0514 10:06:15.922776  3916 operator.cc:1555] t_arr: 0x7fdd850267b0     # 第一次copy
I0514 10:06:15.922781  3916 variable.h:40] a ptr: 0x3ec8d908 holder->Ptr(): 0x3ec8d908
I0514 10:06:15.922785  3916 operator.cc:1557] t_arr1: 0x7fdd850267d0    # 第二次copy (不同变量)
I0514 10:06:15.922791  3916 operator.cc:1566] tensor array[0] ptr: 0x7fdcdc00efe0 holder: 0x7fdcdc0092f0 , 0x7fdd1f8c8b00
# 以下4次的 t 的指针都是一致的,但是后两行为出了if block后,t->Holder()就瞎指了
I0514 10:06:15.922794  3916 operator.cc:1568]  tensor ptr: 0x7fdcdc00efe0 holder: 0x7fdcdc0092f0 , 0x7fdd1f8c8b00 tmp dtype int  
I0514 10:06:15.922799  3916 operator.cc:1570]  tensor ptr: 0x7fdcdc00efe0 holder: 0x7fdcdc0092f0 , 0x7fdd1f8c8b00 tmp dtype int
I0514 10:06:15.922804  3916 operator.cc:1572]  tensor ptr: 0x7fdcdc00efe0 holder: 0x7fdcdc000118 , 0x7fdcdc00b0a0 tmp dtype int
I0514 10:06:15.922809  3916 operator.cc:1580]  tensor ptr: 0x7fdcdc00efe0 holder: 0x7fdcdc000118 , 0x7fdcdc00b0a0 tmp dtype int

修复之后:

I0514 08:16:24.946592 17077 device_context.cc:94] DeviceContextPool Get: CUDAPlace(0)
I0514 08:16:24.946602 17077 device_context.cc:94] DeviceContextPool Get: CUDAPlace(0)0x7fc3ef8c8700
# 以下4次的 t 的指针都是一致的,后两行为出了if block后,t->Holder()也是一致的
I0514 08:16:24.946631 17077 operator.cc:1556] tensor array[0] ptr: 0x7fc3a4013df8 holder: 0x7fc3a400de40 , 0x7fc3ef8c8600
I0514 08:16:24.946635 17077 operator.cc:1557] tensor array[0] ptr: 0x7fc3a4013df8 holder: 0x7fc3a400de40 , 0x7fc3ef8c8600
I0514 08:16:24.946640 17077 operator.cc:1561] tmp dtype int tensor ptr: 0x7fc3a4013df8 holder: 0x7fc3a400de40 , 0x7fc3ef8c8600
I0514 08:16:24.946645 17077 operator.cc:1569] tmp dtype int tensor ptr: 0x7fc3a4013df8 holder: 0x7fc3a400de40 , 0x7fc3ef8c8600

4. 之前为啥没有触发

  1. LoDTensorArray的dtype作为OpKernel的选择,此类分支被触发的比较少
  2. 个人觉得,虽然t是个野指针,但其为 undefined behavior,如果block外的访问的内存还未被释放或未被重新重新分配和修改,可能误打误撞返回有效值。
  3. 从多次执行可以稳定复现,log中也出现了有时会返回无效值

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Member

@zhhsplendid zhhsplendid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Aurelius84 Aurelius84 merged commit 5f1c07d into PaddlePaddle:develop May 17, 2021
Aurelius84 added a commit to Aurelius84/Paddle that referenced this pull request May 19, 2021
* BugFix with ParseInputDataType from LodTensorArray

* BugFix with ParseInputDataType from LodTensorArray
lanxianghit pushed a commit that referenced this pull request May 20, 2021
* BugFix with ParseInputDataType from LodTensorArray

* BugFix with ParseInputDataType from LodTensorArray
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants