-
Notifications
You must be signed in to change notification settings - Fork 756
Add thrust::cuda::par_nosync
execution policy
#1515
Comments
There is an ongoing effort to expose the Thrust CUDA kernels as CUB device functions (https://github.com/NVIDIA/cub). These have the non-blocking, stream-ordered semantics that you're looking for. Unfortunately, we don't have a suitable alternative for I should point out that your usecase is a common request, and we're planning to provide better support for stream-ordered non-blocking "fire-and-forget"-type algorithms in future iterations of the Thrust and CUB APIs. |
Thanks for your helpful reply. Should I file a BTW, I notice that the CUB website (linked from its GitHub) hasn't learned about any releases since early 2018. |
Sure, feel free to open an issue for this.
The Thrust and CUB docs are a mess currently. There's an ongoing effort to clean them up in #1475. |
We could add a `par_nosync` policy for folks who want the old behavior.
…On Wed, Oct 6, 2021 at 4:38 PM Allison Vacanti ***@***.***> wrote:
Thanks for your helpful reply. Should I file a cub::DeviceTransform issue
so there's something to subscribe to?
Sure, feel free to open an issue for this.
BTW, I notice that the CUB website (linked from its GitHub) hasn't learned
about any releases since early 2018.
The Thrust and CUB docs are a mess currently. There's an ongoing effort to
clean them up in #1475 <#1475>.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1515 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADBG4VMJHXE4EFMPCGWRQDUFSXTHANCNFSM5DCVD37Q>
.
--
Bryce Adelstein Lelbach aka wash (he/him/his)
US Programming Language Standards (PL22) Chair
ISO C++ Library Evolution Chair
CppCon and C++Now Program Chair
HPC Programming Models Architect @ NVIDIA
--
|
That could work, though it may be misleading since some thrust algorithms run multiple kernels that require synchronization between launches. But we could potentially drop the last sync. |
Yah, that's why we added the sync in the first place, for consistency.
But par_nosync would be nice for folks just calling `transform`, which
is a lot of people.
…On Wed, Oct 6, 2021 at 6:10 PM Allison Vacanti ***@***.***> wrote:
We could add a par_nosync policy for folks who want the old behavior.
That could work, though it may be misleading since some thrust algorithms run multiple kernels that require synchronization between launches. But we could potentially drop the last sync.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
--
Bryce Adelstein Lelbach aka wash (he/him/his)
US Programming Language Standards (PL22) Chair
ISO C++ Library Evolution Chair
CppCon and C++Now Program Chair
HPC Programming Models Architect @ NVIDIA
--
|
Indeed, this would be really convenient since our alternative right now is to replace PETSc's use of |
I think it's doable. We can just document it as a hint that the implementation may ignore if needed. I'll try to spend some time looking at this soon, since it's a common request. |
thrust::cuda::par_nosync
execution policy
@allisonvacanti I created a pull request with a possible implementation of |
Version 1.16 of Thrust adds policy thrust::cuda::par_nosync, which accepts a stream argument and does not synchronize, thus preventing a stall waiting for the CPU to learn the kernel has completed before launching its next operation. NVIDIA/thrust#1568 This feature (not blocking for kernels that don't need to) had been removed (breaking change) in Thrust-1.9.4 to simplify error handling behavior and because a futures-based async interface had been deemed sufficient. This issue describes the history and rationale for the new par_nosync feature. NVIDIA/thrust#1515
Thrust-1.9.4 made the breaking API change:
The rationale seems to be (14f8a54):
This is disruptive for libraries that wish to use Thrust internally without exposing its objects to the caller. The new
thrust::async
interfaces require holding the futures to be waited on eventually. In iterative linear algebra, this might be after an "essential" [^1] synchronization point, like a dot product reporting its result on the host. But the different operations can be implemented in different libraries (which may or may not call Thrust), so usingasync
seems to imply that internal use ofthrust::async
requires all transitive callers to export Thrust futures or a suitable wrapper.So my questions:
thrust::transform
(for example) and desires stream-based nonblocking semantics?[^1] I use scare quotes because it's possible to deliver such results on-device and thus avoid an extra round-trip latency.
The text was updated successfully, but these errors were encountered: