Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] Possible scheduler 2.0 deadlock with write buffer sync #143

Closed
agozillon opened this issue May 15, 2019 · 8 comments
Closed

[SYCL] Possible scheduler 2.0 deadlock with write buffer sync #143

agozillon opened this issue May 15, 2019 · 8 comments
Assignees

Comments

@agozillon
Copy link
Contributor

agozillon commented May 15, 2019

So I have two example snippets of code that write to a single value in a buffer twice using two kernels and use the accessor functionality to read on host and write on device (I believe both snippets are legal SYCL code, but please do correct me if I am wrong and making some incorrect assumptions).

The first doesn't work but the second does, the only difference (from a user perspective) is the braces { } around the submit calls, which I believe forces a wait/synchronization event in SYCL (perhaps I am misunderstanding however). They both work with the old scheduler when -DSCHEDULER_10 is passed to the compiler, which leads me to think that it's less the legality of the two examples and more some incorrect synchronization event.

Tested with following command and unaltered top of the tree (as of May 14th): $ISYCL_BIN_DIR/clang++ -std=c++11 -fsycl scheduler_2_buffer_block.cpp -o scheduler_2_buffer_block -lOpenCL

I tinkered with this for a while, from what I've found:

  • If the get_access inside the second kernel is just a read accessor, it won't block.
  • It seems to block when waiting on the second kernel to complete, however it doesn't appear to be the kernel blocking it seems to be a dependent event generated from the clEnqueueUnmapMemObject invocation from memory_manager.cpp (can comment out the contents of unmap and the non-working snippet should work).
  • You can replace the second host side get_access with a queue wait and it'll still block
  • As far as the OpenCL runtime (and I) can tell, the OpenCL events generated aren't erroneous

Before I decide to dig any deeper I thought it might be worth finding out if this is a bug or a misconception/silliness on my end and if you guys are already aware and working on it!

Invalid, blocks when trying to wait for second kernel submit:

int main() {
  cl::sycl::queue q;
  cl::sycl::buffer<int, 1> ob((int[1]){0}, 1);
  q.submit([&](handler &cgh) {
    auto wb = ob.get_access<access::mode::read_write>(cgh);
    cgh.single_task<class k1>([=]() {
      wb[0] += 1;
    });
  });
  auto rb = ob.get_access<access::mode::read>();
  std::cout << rb[0] << "\n";

  q.submit([&](handler &cgh) {
    auto wb = ob.get_access<access::mode::read_write>(cgh);
    cgh.single_task<class k2>([=]() {
      wb[0] += 1;
    });
  });
  auto rb2 = ob.get_access<access::mode::read>();
  std::cout << rb2[0] << "\n";

  return 0;
}

Valid, no block:

int main() {
  cl::sycl::queue q;
  cl::sycl::buffer<int, 1> ob((int[1]){0}, 1);
  {
    q.submit([&](handler &cgh) {
      auto wb = ob.get_access<access::mode::read_write>(cgh);
      cgh.single_task<class k1>([=]() {
        wb[0] += 1;
      });
    });
    auto rb = ob.get_access<access::mode::read>();
    std::cout << rb[0] << "\n";
  }

  {
    q.submit([&](handler &cgh) {
      auto wb = ob.get_access<access::mode::read_write>(cgh);
      cgh.single_task<class k2>([=]() {
        wb[0] += 1;
      });
    });
    auto rb2 = ob.get_access<access::mode::read>();
    std::cout << rb2[0] << "\n";
  }

  return 0;
}
@romanovvlad
Copy link
Contributor

@agozillon
Hi,

We are aware of the hang problem when host accessor is created while another host accessor is still alive, hang happens during construction of the second host accessor. It seems you have faced exactly the same issue. It should hang on Intel's GPU device, but work fine on Intel's CPU device.
We are working on fix.

@agozillon
Copy link
Contributor Author

Ah that's great that your working on a fix, thank you very much! I originally thought it was a problem on our end (it deadlocks with our runtime as well), but noticed it was happening on an unmodified build as well.

Although, I appear to be having it occur on CPU devices, unless I am misunderstanding something (which is quite possible). When I query the device, the SYCL runtime categorizes it as a CPU. I also do not use the Intel Compute runtime, just the experimental CPU runtime (I've tried with both the SYCL modified variation stated in GetStartedWithSYCLCompiler.md and the released version on the Intel website).

Thanks for the quick answer as always.

@romanovvlad
Copy link
Contributor

Ohh, sorry, there is another problem in your example: we block all operations with buffer while the host accessor is alive. From the Spec(3.6.5.1):
Host Accessors: The constructor for a host accessor waits for all kernels that modify the same buffer (or
image) in any queues to complete and then copies data back to host memory before the constructor returns. Any command groups with requirements to the same memory object cannot execute until the host accessor is destroyed (see 3.5).

@agozillon
Copy link
Contributor Author

agozillon commented May 16, 2019

Ah thank you very much, for the information! So it makes sense and is correct to have a deadlock from the specifications standpoint (I'll try to be a little more thorough with the specification in the future, my apologies). It worked in triSYCL and the previous scheduler, so I incorrectly assumed it was legal. @keryell has also recently informed me that it's a bug in the existing triSYCL implementation.

I am happy for this issue to be closed, if you wish to do so.

@keryell
Copy link
Contributor

keryell commented May 16, 2019

Yes, I think that Intel implementation is more correct than triSYCL, but still too conservative for me.
Actually when Intel made some bug reports against triSYCL last year it opened a can of (good) worms than we need to clarify. Thanks for this! :-)

In triSYCL: triSYCL/triSYCL#190

Some (private...) discussions inside the committee: https://gitlab.khronos.org/sycl/Specification/issues/154, https://gitlab.khronos.org/sycl/Specification/issues/174 we need to clarify for SYCL 2019.
Now the specification is open-sourced, we can use also https://github.com/KhronosGroup/SYCL-Docs for public discussions :-)

Probably the answers should come from @jeffhammond, @tgmattso, @jcownie-intel...

Spoiler alert: I cannot see any reason we should deviate from the usual RAR, RAW, WAR & WAW dependencies used for at least 60 years of parallel programming, especially if we spent a lot of time in SYCL to express of all these with the accessors... :-)

@romanovvlad
Copy link
Contributor

@keryell Could you, please, clarify why Intel implementation is too conservative?
If we change read_write accessor to just read in the second command group it will not hang:

int main() {
  cl::sycl::queue q;
  cl::sycl::buffer<int, 1> ob((int[1]){0}, 1);
  q.submit([&](handler &cgh) {
    auto wb = ob.get_access<access::mode::read_write>(cgh);
    cgh.single_task<class k1>([=]() {
      wb[0] += 1;
    });
  });
  auto rb = ob.get_access<access::mode::read>();
  std::cout << rb[0] << "\n";

  q.submit([&](handler &cgh) {
    //auto wb = ob.get_access<access::mode::read_write>(cgh);
    auto wb = ob.get_access<access::mode::read>(cgh);
    cgh.single_task<class k2>([=]() {
      //wb[0] += 1;
      (void)wb[0];
    });
  });
  auto rb2 = ob.get_access<access::mode::read>();
  std::cout << rb2[0] << "\n";

  return 0;
}

The code above will be lowered to the following OCL API calls:

...
>>>> clEnqueueMapBuffer: [ map count = 0 ] queue = 0x252c2d8, buffer = 0x257b818, non-blocking, map_flags = CL_MAP_READ (1), offset = 0, cb = 4, event_wait_list = ( size =
1 )[ 0x3503678 ]
<<<< clEnqueueMapBuffer created event = 0x341a718: [ map count = 1 ] returned 0x257b140 -> CL_SUCCESS
>>>> clEnqueueUnmapMemObject: [ map count = 1 ] queue = 0x252c2d8, memobj = 0x257b818, mapped_ptr = 0x257b140, event_wait_list = ( size = 2 )[ 0x341a718, 0x359c148 ]
<<<< clEnqueueUnmapMemObject created event = 0x3180f58: [ map count = 1 ] -> CL_SUCCESS
>>>> clWaitForEvents: event_list = ( size = 1 )[ 0x341a718 ]
<<<< clWaitForEvents -> CL_SUCCESS
1
>>>> clCreateKernel: program = 0x257bd88, kernel_name = _ZTSZZ4mainENK3$_0clERN2cl4sycl7handlerEE2k2
<<<< clCreateKernel: returned 0x31250a8 -> CL_SUCCESS
>>>> clSetKernelArg( _ZTSZZ4mainENK3$_0clERN2cl4sycl7handlerEE2k2 ): kernel = 0x31250a8, index = 0, size = 8, value = 0x257b818
<<<< clSetKernelArg -> CL_SUCCESS
>>>> clSetKernelArg( _ZTSZZ4mainENK3$_0clERN2cl4sycl7handlerEE2k2 ): kernel = 0x31250a8, index = 1, size = 8, value = 0x1
<<<< clSetKernelArg -> CL_SUCCESS
>>>> clSetKernelArg( _ZTSZZ4mainENK3$_0clERN2cl4sycl7handlerEE2k2 ): kernel = 0x31250a8, index = 2, size = 8, value = 0x1
<<<< clSetKernelArg -> CL_SUCCESS
>>>> clSetKernelArg( _ZTSZZ4mainENK3$_0clERN2cl4sycl7handlerEE2k2 ): kernel = 0x31250a8, index = 3, size = 8, value = (nil)
<<<< clSetKernelArg -> CL_SUCCESS
>>>> clEnqueueNDRangeKernel( _ZTSZZ4mainENK3$_0clERN2cl4sycl7handlerEE2k2 ): queue = 0x252c2d8, kernel = 0x31250a8, global_work_offset = < 0 >, global_work_size = < 1 >, lo
cal_work_size = < NULL >, event_wait_list = ( size = 2 )[ 0x3503678, 0x3503678 ]
<<<< clEnqueueNDRangeKernel created event = 0x31882c8 -> CL_SUCCESS
>>>> clCreateUserEvent: context = 0x20140d8
<<<< clCreateUserEvent: returned 0x342f9e8 -> CL_SUCCESS
>>>> clEnqueueMapBuffer: [ map count = 1 ] queue = 0x252c2d8, buffer = 0x257b818, non-blocking, map_flags = CL_MAP_READ (1), offset = 0, cb = 4, event_wait_list = ( size =
1 )[ 0x3503678 ]
<<<< clEnqueueMapBuffer created event = 0x2028ac8: [ map count = 2 ] returned 0x257b140 -> CL_SUCCESS
>>>> clEnqueueUnmapMemObject: [ map count = 2 ] queue = 0x252c2d8, memobj = 0x257b818, mapped_ptr = 0x257b140, event_wait_list = ( size = 2 )[ 0x2028ac8, 0x342f9e8 ]
<<<< clEnqueueUnmapMemObject created event = 0x3140458: [ map count = 2 ] -> CL_SUCCESS
>>>> clWaitForEvents: event_list = ( size = 1 )[ 0x2028ac8 ]
<<<< clWaitForEvents -> CL_SUCCESS

...

@keryell
Copy link
Contributor

keryell commented May 22, 2019

Great! So you have implemented what I think has to be implemented, then. :-)

@keryell
Copy link
Contributor

keryell commented May 22, 2019

@agozillon if you are good with this, I guess you can close this.
Probably we need to make the SYCL specification clearer on this.
Happy to see that this SYCL implementation follows the usual behavior about dependencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants