[SYCL][CUDA] Fix unexpected async memcpy #1798

bjoernknafla · 2020-06-01T21:52:56Z

When memory buffers are created with either of the following flags

PI_MEM_FLAGS_HOST_PTR_USE
PI_MEM_FLAGS_HOST_PTR_COPY
we copy the data to the device.

During memory buffer creation we do not have a CUDA stream and therefore call
cuMemCpyHtoD which operates on the CUDA default stream.

This fix synchronizes with the default stream to ensure that data copying is
finished before any other PI operation uses it on a non-default stream.

Signed-off-by: Bjoern Knafla bjoern@codeplay.com

When memory buffers are created with either of the following flags * PI_MEM_FLAGS_HOST_PTR_USE * PI_MEM_FLAGS_HOST_PTR_COPY we copy the data to the device. During memory buffer creation we do not have a CUDA stream and therefore call cuMemCpyHtoD which operates on the CUDA default stream. This fix synchronizes with the default stream to ensure that data copying is finished before any other PI operation uses it on a non-default stream. Signed-off-by: Bjoern Knafla <bjoern@codeplay.com>

smaslov-intel · 2020-06-05T00:33:33Z

sycl/plugins/cuda/pi_cuda.cpp

+          // uses it.
+          if (retErr == PI_SUCCESS) {
+            CUstream defaultStream = 0;
+            retErr = PI_CHECK_ERROR(cuStreamSynchronize(defaultStream));


The "cuMemcpyHtoD" copy is synchronous, so what else do you need to synchronize?
Also the "cuStreamSynchronize" is waiting for all activities on the stream to finish, may it be that you unnecessarily wait for something else too?

The CUDA default stream in the default legacy mode synchronizes with all other CUDA streams (it synchronizes with the work on them at the point of enqueueing work onto the default stream).

While creating a PI buffer we do not know which PI queue (and its associated CUDA stream) will operate on it. Therefore we use a CUDA function that implicitly uses the default stream.

However, all PI enqueue operations target a specific queue and with that a specific, queue associated CUDA stream. These streams are not synchronizing with each other, nor are they waiting/synchronizing with the default stream.

We observed race conditions when a buffer was created with a host pointer (with the above memcpy happening on the default stream) but then the buffer was used by a kernel enqueued to a different stream (which does not synchronize with other streams). The result is a racecondition. Sometimes the memcpy on the default stream finished in time for the kernel on the other stream getting all the expected input data, sometimes the input data was not completely available on the device and testing would fail.

To make sure, that the memcpy handled by the default stream is finished before any other stream (accessing the same device) operates on it we now explicitly synchronize inside the buffer creation on the default stream.

This has performance implications. However, they are easily avoided by high performance code by first creating all SYCL (and with that PI) objects and then reuse them.

PS: Also, CUDA's async/synchronous naming is sometimes (and annoyingly) misleading. Copying data from page-able memory to a device are synchronous in staging the data into internal, pinned memory, but asynchronous for copying the data from the pinned memory to the device via DMA.

The buffer creation would not return until copy is completed (guaranteed by cuMemcpyHtoD), right? So how is it possible that any command enqueue relying on that buffer be ready on device is attempted before copy is completed? [sorry if I am asking dumb questions]

but asynchronous for copying the data from the pinned memory to the device

Ah, I guess this answers my earlier question.

This has performance implications. However, they are easily avoided by high performance code by first creating all SYCL (and with that PI) objects and then reuse them.

That might be too strict requirement for an average user. With the currently enabled performance testing, do you see any regressions?

We are seeing non-deterministic fails (as expected by a racecondition) in our work with libraries, when we set up data for testing or profiling.

I tried to create unit test on the PI level to trigger the problem. With certain buffer sizes (and numbers of buffers) I got the test to fail when run through LIT 100% on my machine - but running the unit test directly (not through LIT which runs many tests in parallel and puts load onto the machine) I couldn't get the test to fail and I expect that another machine - or another day with another number of open apps on my machine would also not show the racecondition in LIT.

If this was purely CPU code I would push for thread sanitizer builds and testing, or for testing with valgrind. I am not aware of a CPU-GPU racecondition checker tool 😞 I haven't tried if the cuda-memcheck initcheckl tool can spot these kind of problems but my guess is that it only detects a problem when the race happens while it will miss it when the kernel started just slow enough to hide the non-synchronized data copy...

smaslov-intel

I don't think it is worthwhile to spend time trying to LIT test a race condition.

bjoernknafla requested a review from a team as a code owner June 1, 2020 21:52

bjoernknafla requested a review from smaslov-intel June 1, 2020 21:52

bjoernknafla force-pushed the bjoern/fix-unexpected-async-mem-copy branch from 8f9603f to 3797020 Compare June 1, 2020 22:00

smaslov-intel reviewed Jun 5, 2020

View reviewed changes

smaslov-intel approved these changes Jun 5, 2020

View reviewed changes

bjoernknafla mentioned this pull request Jun 5, 2020

handler_mem_op.cpp and buffer_dev_to_dev.cpp tests fail sporadically on cuda #1508

Closed

bader merged commit 4f0a3df into intel:sycl Jun 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL][CUDA] Fix unexpected async memcpy #1798

[SYCL][CUDA] Fix unexpected async memcpy #1798

Uh oh!

bjoernknafla commented Jun 1, 2020

Uh oh!

smaslov-intel Jun 5, 2020

Uh oh!

bjoernknafla Jun 5, 2020

Uh oh!

bjoernknafla Jun 5, 2020

Uh oh!

smaslov-intel Jun 5, 2020

Uh oh!

smaslov-intel Jun 5, 2020

Uh oh!

bjoernknafla Jun 5, 2020

Uh oh!

smaslov-intel left a comment

Uh oh!

Uh oh!

[SYCL][CUDA] Fix unexpected async memcpy #1798

[SYCL][CUDA] Fix unexpected async memcpy #1798

Uh oh!

Conversation

bjoernknafla commented Jun 1, 2020

Uh oh!

smaslov-intel Jun 5, 2020

Choose a reason for hiding this comment

Uh oh!

bjoernknafla Jun 5, 2020

Choose a reason for hiding this comment

Uh oh!

bjoernknafla Jun 5, 2020

Choose a reason for hiding this comment

Uh oh!

smaslov-intel Jun 5, 2020

Choose a reason for hiding this comment

Uh oh!

smaslov-intel Jun 5, 2020

Choose a reason for hiding this comment

Uh oh!

bjoernknafla Jun 5, 2020

Choose a reason for hiding this comment

Uh oh!

smaslov-intel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!