Add `thrust::cuda::par_nosync` execution policy #1515

jedbrown · 2021-08-30T20:50:07Z

Thrust-1.9.4 made the breaking API change:

Synchronous Thrust algorithms now block until all of their operations have completed. Use the new asynchronous Thrust algorithms for non-blocking behavior.

The rationale seems to be (14f8a54):

* All Thrust synchronous algorithms for the CUDA backend now actually
  synchronize. Previously, any algorithm that did not allocate temporary
  storage (counterexample: `thrust::sort`) and did not have a
  computation-dependent result (counterexample: `thrust::reduce`) would actually
  be launched asynchronously.  Additionally, synchronous algorithms that
  allocated temporary storage would become asynchronous if a custom allocator
  was supplied that did not synchronize on allocation/deallocation, unlike
  `cudaMalloc`/`cudaFree`. So, now `thrust::for_each`, `thrust::transform`,
  `thrust::sort`, etc are truly synchronous. In some cases this may be a
  performance regression; if you need asynchrony, use the new asynchronous
  algorithms.

This is disruptive for libraries that wish to use Thrust internally without exposing its objects to the caller. The new thrust::async interfaces require holding the futures to be waited on eventually. In iterative linear algebra, this might be after an "essential" [^1] synchronization point, like a dot product reporting its result on the host. But the different operations can be implemented in different libraries (which may or may not call Thrust), so using async seems to imply that internal use of thrust::async requires all transitive callers to export Thrust futures or a suitable wrapper.

So my questions:

Is this transitive disruption avoidable? Is there a way to disown the futures without blocking?
What is your recommended migration path for legacy software that uses thrust::transform (for example) and desires stream-based nonblocking semantics?

[^1] I use scare quotes because it's possible to deliver such results on-device and thus avoid an extra round-trip latency.

The text was updated successfully, but these errors were encountered:

alliepiper · 2021-09-10T17:03:02Z

There is an ongoing effort to expose the Thrust CUDA kernels as CUB device functions (https://github.com/NVIDIA/cub). These have the non-blocking, stream-ordered semantics that you're looking for.

Unfortunately, we don't have a suitable alternative for transform available in CUB yet, so there isn't a good option other than thrust::transform or thrust::async::transform at the moment. If you require this algorithm without blocking or using futures, a custom transform kernel would be needed. This is a large gap in our API that I'm hoping to have fixed soon.

I should point out that your usecase is a common request, and we're planning to provide better support for stream-ordered non-blocking "fire-and-forget"-type algorithms in future iterations of the Thrust and CUB APIs.

jedbrown · 2021-09-18T03:45:33Z

Thanks for your helpful reply. Should I file a cub::DeviceTransform issue so there's something to subscribe to?

BTW, I notice that the CUB website (linked from its GitHub) hasn't learned about any releases since early 2018.

alliepiper · 2021-10-06T20:37:59Z

Thanks for your helpful reply. Should I file a cub::DeviceTransform issue so there's something to subscribe to?

Sure, feel free to open an issue for this.

BTW, I notice that the CUB website (linked from its GitHub) hasn't learned about any releases since early 2018.

The Thrust and CUB docs are a mess currently. There's an ongoing effort to clean them up in #1475.

brycelelbach · 2021-10-06T22:05:19Z

We could add a `par_nosync` policy for folks who want the old behavior.

…

On Wed, Oct 6, 2021 at 4:38 PM Allison Vacanti ***@***.***> wrote: Thanks for your helpful reply. Should I file a cub::DeviceTransform issue so there's something to subscribe to? Sure, feel free to open an issue for this. BTW, I notice that the CUB website (linked from its GitHub) hasn't learned about any releases since early 2018. The Thrust and CUB docs are a mess currently. There's an ongoing effort to clean them up in #1475 <#1475>. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1515 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADBG4VMJHXE4EFMPCGWRQDUFSXTHANCNFSM5DCVD37Q> .

-- Bryce Adelstein Lelbach aka wash (he/him/his) US Programming Language Standards (PL22) Chair ISO C++ Library Evolution Chair CppCon and C++Now Program Chair HPC Programming Models Architect @ NVIDIA --

alliepiper · 2021-10-06T22:10:33Z

We could add a par_nosync policy for folks who want the old behavior.

That could work, though it may be misleading since some thrust algorithms run multiple kernels that require synchronization between launches. But we could potentially drop the last sync.

brycelelbach · 2021-10-06T22:12:47Z

Yah, that's why we added the sync in the first place, for consistency. But par_nosync would be nice for folks just calling `transform`, which is a lot of people.

…

On Wed, Oct 6, 2021 at 6:10 PM Allison Vacanti ***@***.***> wrote: We could add a par_nosync policy for folks who want the old behavior. That could work, though it may be misleading since some thrust algorithms run multiple kernels that require synchronization between launches. But we could potentially drop the last sync. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

-- Bryce Adelstein Lelbach aka wash (he/him/his) US Programming Language Standards (PL22) Chair ISO C++ Library Evolution Chair CppCon and C++Now Program Chair HPC Programming Models Architect @ NVIDIA --

jedbrown · 2021-10-06T22:44:59Z

Indeed, this would be really convenient since our alternative right now is to replace PETSc's use of thrust::transform with raw CUDA to avoid some embarrassing latency costs.

alliepiper · 2021-10-08T17:47:11Z

I think it's doable. We can just document it as a hint that the implementation may ignore if needed.

I'll try to spend some time looking at this soon, since it's a common request.

fkallen · 2021-11-12T16:31:07Z

@allisonvacanti I created a pull request with a possible implementation of par_nosync

Version 1.16 of Thrust adds policy thrust::cuda::par_nosync, which accepts a stream argument and does not synchronize, thus preventing a stall waiting for the CPU to learn the kernel has completed before launching its next operation. NVIDIA/thrust#1568 This feature (not blocking for kernels that don't need to) had been removed (breaking change) in Thrust-1.9.4 to simplify error handling behavior and because a futures-based async interface had been deemed sufficient. This issue describes the history and rationale for the new par_nosync feature. NVIDIA/thrust#1515

alliepiper added question Inquiry. type: enhancement New feature or request. labels Sep 10, 2021

alliepiper changed the title ~~Evolution for stream-based nonblocking after Thrust-1.9.4~~ Add thrust::cuda::par_nosync execution policy Oct 8, 2021

alliepiper self-assigned this Oct 8, 2021

alliepiper added P1: should have Necessary, but not critical. and removed question Inquiry. labels Oct 8, 2021

alliepiper modified the milestones: 1.15.0, 1.16.0 Oct 8, 2021

alliepiper removed their assignment Nov 12, 2021

fkallen mentioned this issue Nov 12, 2021

Add execution policy thrust::cuda::par_nosync #1568

Merged

alliepiper assigned fkallen Nov 15, 2021

alliepiper linked a pull request Nov 15, 2021 that will close this issue

Add execution policy thrust::cuda::par_nosync #1568

Merged

alliepiper closed this as completed in #1568 Dec 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `thrust::cuda::par_nosync` execution policy #1515

Add `thrust::cuda::par_nosync` execution policy #1515

jedbrown commented Aug 30, 2021

alliepiper commented Sep 10, 2021

jedbrown commented Sep 18, 2021

alliepiper commented Oct 6, 2021

brycelelbach commented Oct 6, 2021 via email

alliepiper commented Oct 6, 2021

brycelelbach commented Oct 6, 2021 via email

jedbrown commented Oct 6, 2021

alliepiper commented Oct 8, 2021

fkallen commented Nov 12, 2021

Add thrust::cuda::par_nosync execution policy #1515

Add thrust::cuda::par_nosync execution policy #1515

Comments

jedbrown commented Aug 30, 2021

alliepiper commented Sep 10, 2021

jedbrown commented Sep 18, 2021

alliepiper commented Oct 6, 2021

brycelelbach commented Oct 6, 2021 via email

alliepiper commented Oct 6, 2021

brycelelbach commented Oct 6, 2021 via email

jedbrown commented Oct 6, 2021

alliepiper commented Oct 8, 2021

fkallen commented Nov 12, 2021

Add `thrust::cuda::par_nosync` execution policy #1515

Add `thrust::cuda::par_nosync` execution policy #1515