-
Notifications
You must be signed in to change notification settings - Fork 802
[SYCL][CUDA][libclc] Add asynchronous barrier #5303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 13 commits
8f5e03e
4eed959
4721cf6
38f1d76
d34a206
41c66ec
bc4f04a
8dff0f7
f68d80b
62ada41
c662d70
1e2d99b
a804562
9f2f636
f33ddf0
f63b973
bb08a1b
fa086bd
bccc461
99596cd
42afcf0
62d731e
d420f88
1adecca
f110d25
77cafed
5c8e030
32f75aa
3ccd520
6092e61
52e540e
8e4f969
9478857
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,56 @@ | ||
| //===----------------------------------------------------------------------===// | ||
| // | ||
| // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. | ||
| // See https://llvm.org/LICENSE.txt for license information. | ||
| // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception | ||
| // | ||
| //===----------------------------------------------------------------------===// | ||
|
|
||
| #include <spirv/spirv.h> | ||
| #include <spirv/spirv_types.h> | ||
|
|
||
| _CLC_OVERLOAD _CLC_DEF void __clc_BarrierInitialize(long* state, | ||
| int expected_count) { | ||
| __nvvm_mbarrier_init(state, expected_count); | ||
| } | ||
|
|
||
| _CLC_OVERLOAD _CLC_DEF void | ||
| __clc_BarrierInvalidate(long* state) { | ||
| __nvvm_mbarrier_inval(state); | ||
| } | ||
|
|
||
| _CLC_OVERLOAD _CLC_DEF long __clc_BarrierArrive(long* state) { | ||
| return __nvvm_mbarrier_arrive(state); | ||
| } | ||
|
|
||
| _CLC_OVERLOAD _CLC_DEF long __clc_BarrierArriveAndDrop(long* state) { | ||
| return __nvvm_mbarrier_arrive_drop(state); | ||
| } | ||
|
|
||
| _CLC_OVERLOAD _CLC_DEF long __clc_BarrierArriveNoComplete(long* state, int count) { | ||
| return __nvvm_mbarrier_arrive_noComplete(state, count); | ||
| } | ||
|
|
||
| _CLC_OVERLOAD _CLC_DEF long __clc_BarrierArriveAndDropNoComplete(long* state, int count) { | ||
| return __nvvm_mbarrier_arrive_drop_noComplete(state, count); | ||
| } | ||
|
|
||
| _CLC_OVERLOAD _CLC_DEF void __clc_BarrierCopyAsyncArrive(long* state) { | ||
| return __nvvm_cp_async_mbarrier_arrive(state); | ||
| } | ||
|
|
||
| _CLC_OVERLOAD _CLC_DEF void __clc_BarrierCopyAsyncArriveNoInc(long* state) { | ||
| return __nvvm_cp_async_mbarrier_arrive_noinc(state); | ||
| } | ||
|
|
||
| _CLC_OVERLOAD _CLC_DEF _CLC_CONVERGENT void __clc_BarrierWait(long* state, long arrival) { | ||
| while(!__nvvm_mbarrier_test_wait(state, arrival)){} | ||
| } | ||
|
|
||
| _CLC_OVERLOAD _CLC_DEF _CLC_CONVERGENT bool __clc_BarrierTestWait(long* state, long arrival) { | ||
| return __nvvm_mbarrier_test_wait(state, arrival); | ||
| } | ||
|
|
||
| _CLC_OVERLOAD _CLC_DEF _CLC_CONVERGENT void __clc_BarrierArriveAndWait(long* state) { | ||
| __clc_BarrierWait(state, __clc_BarrierArrive(state)); | ||
| } |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,172 @@ | ||||||
| = SYCL_EXT_ONEAPI_BARRIER | ||||||
gmlueck marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| :source-highlighter: coderay | ||||||
| :coderay-linenums-mode: table | ||||||
|
|
||||||
| // This section needs to be after the document title. | ||||||
| :doctype: book | ||||||
| :toc2: | ||||||
| :toc: left | ||||||
| :encoding: utf-8 | ||||||
| :lang: en | ||||||
|
|
||||||
| :blank: pass:[ +] | ||||||
|
|
||||||
| // Set the default source code type in this document to C++, | ||||||
| // for syntax highlighting purposes. This is needed because | ||||||
| // docbook uses c++ and html5 uses cpp. | ||||||
| :language: {basebackend@docbook:c++:cpp} | ||||||
|
|
||||||
| == Introduction | ||||||
| IMPORTANT: This specification is a draft. | ||||||
|
|
||||||
| NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by permission by Khronos. | ||||||
|
|
||||||
| This document describes an extension that adds `barrier`, which acts similarly to https://en.cppreference.com/w/cpp/thread/barrier[`std::barrier`]. | ||||||
|
|
||||||
| NOTE: CUDA can hardware accelerate from compute capability 8. | ||||||
|
|
||||||
| == Notice | ||||||
|
|
||||||
| Copyright (c) 2021 Intel Corporation. All rights reserved. | ||||||
|
|
||||||
| == Status | ||||||
|
|
||||||
| Working Draft | ||||||
|
|
||||||
| This is a preview extension specification, intended to provide early access to a feature for review and community feedback. When the feature matures, this specification may be released as a formal extension. | ||||||
|
|
||||||
| Because the interfaces defined by this specification are not final and are subject to change they are not intended to be used by shipping software products. | ||||||
|
|
||||||
| == Version | ||||||
|
|
||||||
| Revision: 1 | ||||||
|
|
||||||
| == Contact | ||||||
| Tadej Ciglarič, Codeplay (tadej 'dot' ciglaric 'at' codeplay 'dot' com) | ||||||
|
|
||||||
| == Dependencies | ||||||
|
|
||||||
| This extension is written against the SYCL 2020 specification, Revision 4. | ||||||
|
|
||||||
| == Feature Test Macro | ||||||
|
|
||||||
| This extension provides a feature-test macro as described in the core SYCL | ||||||
| specification section 6.3.3 "Feature test macros". Therefore, an | ||||||
| implementation supporting this extension must predefine the macro | ||||||
| `SYCL_EXT_ONEAPI_CUDA_ASYNC_BARRIER` to one of the values defined in the table | ||||||
| below. Applications can test for the existence of this macro to determine if | ||||||
| the implementation supports this feature, or applications can test the macro's | ||||||
| value to determine which of the extension's APIs the implementation supports. | ||||||
|
|
||||||
| [%header,cols="1,5"] | ||||||
| |=== | ||||||
| |Value |Description | ||||||
| |1 |Initial extension version. Base features are supported. | ||||||
| |=== | ||||||
|
|
||||||
| == Overview | ||||||
|
|
||||||
| This extension introduces asynchronous barrier for CUDA devices. This extends `group_barrier` by splitting it into into two calls - arrive and wait. The wait call blocks until the predetermined number of work items in the same work group call arrive. This is also very similar to https://en.cppreference.com/w/cpp/thread/barrier[`std::barrier`] introduced in c++20. | ||||||
|
|
||||||
| Implementing this requires some space in local memory, where the state of the barrier is kept (this is true even for the CUDA compute capability 8 that support this functionality in hardware). | ||||||
Pennycook marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| This extension introduces `barrier` class. It must be used only in local memory. Once a `barrier` is initialized it can be reused for multiple cycles of arrivals and waits. | ||||||
|
|
||||||
| This extension can only be used on devices that support independent forward progress. | ||||||
t4c1 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| === Cycle of arrivals and waits | ||||||
|
|
||||||
| When predetermined number of arrivals happen, the barrier moves into the next cycle. That unblocks the waits for current cycle. That is any wait that was previously called with the arrival token from this cycle stops blocking and future calls to wait with arrival token from this cycle will not block. The pending count is also reset and any future arrivals happen in the next cycle. However, at least one wait (or test wait returning `true`) must happen with arrival token from current cycle before any arrivals can happen in the next cycle. Another arrival happening before the wait is undefined behavior. | ||||||
t4c1 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| Wait and test wait can only be called with an arrival token from current cycle (in which case wait will block until the barrier moves into next cycle) or the previous cycle (in which case wait will not block). Calling wait or test wait with an arrival token from any other cycle is undefined behavior. | ||||||
t4c1 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| === Interface | ||||||
|
|
||||||
| `barrier` class has the follwing member functions: | ||||||
t4c1 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| `void initialize(uint32_t expected_count)` : Initializes the barrier with given count of arrivals before wait unblocks. It only needs to be called by one work-item in work group. After the initialization a barrier operation (such as `group_barrier()`)needs to be executed by all work-items using the `barrier` object before they can use the newly initialized `barrier` object. | ||||||
|
||||||
| `void initialize(uint32_t expected_count)` : Initializes the barrier with given count of arrivals before wait unblocks. It only needs to be called by one work-item in work group. After the initialization a barrier operation (such as `group_barrier()`)needs to be executed by all work-items using the `barrier` object before they can use the newly initialized `barrier` object. | |
| `void initialize(uint32_t expected_count)` : Initializes the barrier with an expected number of arrivals, representing the number of arrivals required to unblock calls to `wait`. This function only needs to be called by one work-item in work group. After the initialization a barrier operation (such as `group_barrier()`) needs to be executed by all work-items using the `barrier` object before they can use the newly initialized `barrier` object. If `expected_count` is greater than the value returned by `max()`, behavior is undefined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left this at "only needs to be", but should it be stronger (i.e. "should only be")? Is the behavior well-defined if multiple work-items initialize the barrier? Is it unsafe to call initialize on a previously initialized barrier without calling invalidate first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTX documentation is not very clear on the subject, but it does suggest that initializing multiple times is fine, while invalidating multiple times is not.
Alternatively we could be more strict and specify that initialize should also only be called once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the equivalent CUDA feature allows this then I won't argue. But we should make a note of this -- I think for the more generic extension we should say that it's unsafe to initialize the same barrier multiple times (since otherwise some implementations may have to track whether a barrier has already been initialized to avoid paying additional overheads).
t4c1 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
t4c1 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
t4c1 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `arrival_token arrive_no_complete(int32_t count)` : Executes arrival operation that counts as `count` arrivals and returns a token that is needed for the wait call corresponding to this arrival. This must not be the last arrival that causes the cycle to complete - it would be undefined behavior. | |
| `arrival_token arrive_no_complete(int32_t count)` : Executes arrival operation that counts as `count` arrivals and returns a token that is needed for the `wait` call corresponding to this arrival. If this is the last arrival that causes the cycle to complete, behavior is undefined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear from reading this why the function exists, why the behavior is undefined or how a developer can use this function safely. Would it make sense to add a non-normative note explaining why the behavior is undefined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like there should be a restriction on count here given max() below. I'm not sure if the requirement should be that the result of count arrivals cannot cause the total number of arrivals to exceed expected_count or that it cannot exceed max().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified limitation on count and added some suggestions how this function can be used.
t4c1 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding some non-normative examples to the extension would help to clarify what these functions do. From reading the description alone I'm having trouble working out what they do or how to use them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some examples.
t4c1 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Pennycook marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Pennycook marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
t4c1 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this really need to have that "reach" interface compared to std::barrier that it draws similarity with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what do you mean by "reach interface".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that std::barrier is happy with just arrive/wait/drop, and no token. Can we simplify this API (it's likely will add chances that other backends would be able to support it too)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While removing this token would make interface more similar to std::barrier, I disagree that doing so would make it easier to implement for other backends. If other backends do not need this token they can pass around a dummy value. Meanwhile implementing this without the token for CUDA would lead to more complicated implementation and most likely additional limitations, such as only one barrier being usable at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If other backends do not need this token they can pass around a dummy value.
But the token is used as an input in the core wait API, it can't be dummy.
Meanwhile implementing this without the token for CUDA would lead to more complicated implementation and most likely additional limitations, such as only one barrier being usable at once.
CUDA implementation can use "token" under the hood, of course.
I don't have any practical suggestions, just expressing a desire for a simpler and a more standard interface for the feature. I will rely on @Pennycook review/approval for the feature definition and then review the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the token is used as an input in the core wait API, it can't be dummy.
I guess I was not clear enough. I meant that an implementation that does need the token can return a dummy value from arrive and ignore that token in the wait call.
CUDA implementation can use "token" under the hood, of course.
I have a feeling that hiding the token under the hood would, depending on how it is implemented, either limit the functionality, or be inefficient or complicated. Although I can't provide any concrete arguments from the top of my head. I need to think about this some more.
Uh oh!
There was an error while loading. Please reload this page.