Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Add thrust::cuda::par_nosync execution policy #1515

Closed
jedbrown opened this issue Aug 30, 2021 · 9 comments · Fixed by #1568
Closed

Add thrust::cuda::par_nosync execution policy #1515

jedbrown opened this issue Aug 30, 2021 · 9 comments · Fixed by #1568
Assignees
Labels
P1: should have Necessary, but not critical. type: enhancement New feature or request.
Milestone

Comments

@jedbrown
Copy link

Thrust-1.9.4 made the breaking API change:

Synchronous Thrust algorithms now block until all of their operations have completed. Use the new asynchronous Thrust algorithms for non-blocking behavior.

The rationale seems to be (14f8a54):

* All Thrust synchronous algorithms for the CUDA backend now actually
  synchronize. Previously, any algorithm that did not allocate temporary
  storage (counterexample: `thrust::sort`) and did not have a
  computation-dependent result (counterexample: `thrust::reduce`) would actually
  be launched asynchronously.  Additionally, synchronous algorithms that
  allocated temporary storage would become asynchronous if a custom allocator
  was supplied that did not synchronize on allocation/deallocation, unlike
  `cudaMalloc`/`cudaFree`. So, now `thrust::for_each`, `thrust::transform`,
  `thrust::sort`, etc are truly synchronous. In some cases this may be a
  performance regression; if you need asynchrony, use the new asynchronous
  algorithms.

This is disruptive for libraries that wish to use Thrust internally without exposing its objects to the caller. The new thrust::async interfaces require holding the futures to be waited on eventually. In iterative linear algebra, this might be after an "essential" [^1] synchronization point, like a dot product reporting its result on the host. But the different operations can be implemented in different libraries (which may or may not call Thrust), so using async seems to imply that internal use of thrust::async requires all transitive callers to export Thrust futures or a suitable wrapper.

So my questions:

  1. Is this transitive disruption avoidable? Is there a way to disown the futures without blocking?
  2. What is your recommended migration path for legacy software that uses thrust::transform (for example) and desires stream-based nonblocking semantics?

[^1] I use scare quotes because it's possible to deliver such results on-device and thus avoid an extra round-trip latency.

@alliepiper
Copy link
Collaborator

There is an ongoing effort to expose the Thrust CUDA kernels as CUB device functions (https://github.com/NVIDIA/cub). These have the non-blocking, stream-ordered semantics that you're looking for.

Unfortunately, we don't have a suitable alternative for transform available in CUB yet, so there isn't a good option other than thrust::transform or thrust::async::transform at the moment. If you require this algorithm without blocking or using futures, a custom transform kernel would be needed. This is a large gap in our API that I'm hoping to have fixed soon.

I should point out that your usecase is a common request, and we're planning to provide better support for stream-ordered non-blocking "fire-and-forget"-type algorithms in future iterations of the Thrust and CUB APIs.

@alliepiper alliepiper added question Inquiry. type: enhancement New feature or request. labels Sep 10, 2021
@jedbrown
Copy link
Author

Thanks for your helpful reply. Should I file a cub::DeviceTransform issue so there's something to subscribe to?

BTW, I notice that the CUB website (linked from its GitHub) hasn't learned about any releases since early 2018.

@alliepiper
Copy link
Collaborator

Thanks for your helpful reply. Should I file a cub::DeviceTransform issue so there's something to subscribe to?

Sure, feel free to open an issue for this.

BTW, I notice that the CUB website (linked from its GitHub) hasn't learned about any releases since early 2018.

The Thrust and CUB docs are a mess currently. There's an ongoing effort to clean them up in #1475.

@brycelelbach
Copy link
Collaborator

brycelelbach commented Oct 6, 2021 via email

@alliepiper
Copy link
Collaborator

We could add a par_nosync policy for folks who want the old behavior.

That could work, though it may be misleading since some thrust algorithms run multiple kernels that require synchronization between launches. But we could potentially drop the last sync.

@brycelelbach
Copy link
Collaborator

brycelelbach commented Oct 6, 2021 via email

@jedbrown
Copy link
Author

jedbrown commented Oct 6, 2021

Indeed, this would be really convenient since our alternative right now is to replace PETSc's use of thrust::transform with raw CUDA to avoid some embarrassing latency costs.

@alliepiper
Copy link
Collaborator

I think it's doable. We can just document it as a hint that the implementation may ignore if needed.

I'll try to spend some time looking at this soon, since it's a common request.

@alliepiper alliepiper changed the title Evolution for stream-based nonblocking after Thrust-1.9.4 Add thrust::cuda::par_nosync execution policy Oct 8, 2021
@alliepiper alliepiper self-assigned this Oct 8, 2021
@alliepiper alliepiper added P1: should have Necessary, but not critical. and removed question Inquiry. labels Oct 8, 2021
@alliepiper alliepiper modified the milestones: 1.15.0, 1.16.0 Oct 8, 2021
@alliepiper alliepiper removed their assignment Nov 12, 2021
@fkallen
Copy link
Contributor

fkallen commented Nov 12, 2021

@allisonvacanti I created a pull request with a possible implementation of par_nosync

@alliepiper alliepiper linked a pull request Nov 15, 2021 that will close this issue
petscbot pushed a commit to petsc/petsc that referenced this issue Jan 4, 2022
Version 1.16 of Thrust adds policy thrust::cuda::par_nosync, which
accepts a stream argument and does not synchronize, thus preventing a
stall waiting for the CPU to learn the kernel has completed before
launching its next operation.

NVIDIA/thrust#1568

This feature (not blocking for kernels that don't need to) had been
removed (breaking change) in Thrust-1.9.4 to simplify error handling
behavior and because a futures-based async interface had been deemed
sufficient. This issue describes the history and rationale for the new
par_nosync feature.

NVIDIA/thrust#1515
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P1: should have Necessary, but not critical. type: enhancement New feature or request.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants