-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL] Add non-uniform group classes #8202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
will there be tests for this over on llvm-test-suite? |
Yes - the tests for this basic functionality are in intel/llvm-test-suite#1574. clang-format is currently complaining about them, but otherwise I think they work. |
Converting this to draft to prevent it from being merged. Trying to implement the group algorithms for |
Hi @Pennycook I had forgotten that The branch was in a working state but it is now very old. I can remove some of the dead wood in it and try to merge tip in it and post it here so you can have a look. I'm not sure to what extent it might overlap with what you plan, but it sounds like it may be quite similar. |
I don't think it's a direct replacement, but it may be helpful to people working with these sorts of masks. There are still cases where a
I'd definitely be interested in seeing that, thanks! I do think a lot of this will end up being useful for implementing the other group types, which are likely to use masks as an implementation detail. Aside: I plan to re-purpose this PR to add the other new group types, rather than close it and open another one. |
OK. I am starting to remember things better now. I think that in the cuda backend we basically only need the
Although additionally for post volta maybe we have to think carefully about ensuring the control flow is converged if this is defined by the spec for certain operations as you point out. So is the main point that the design of this will mostly depend on what other hardware needs, and then Nvidia should be pretty easy to fit into whatever design is required?
So actually you can get an idea of how it worked from looking at this test branch that is already available: intel/llvm-test-suite@intel...JackAKirk:llvm-test-suite:group_collectives_mask (Note that I think you can ignore all the "async_copy" tests because as I understand it this function is not included in the extension document as a masked variant in any form and the implementations we made for these will be completely removed). |
Also here is the corresponding implementation to go with the above linked tests: sycl...JackAKirk:llvm:bitmask-cuda A lot of this is the async stuff which you can ignore. You can completely ignore the changes made to: libclc/generic/include/clc/async/common.h So I think I just need confirmation of the status of the group definitions in this document (https://github.com/AerialMantis/llvm/blob/560269af36543cd88ecd380205b4721abac15354/sycl/doc/extensions/proposed/sycl_ext_oneapi_non_uniform_groups.asciidoc), and I can update the bitmask-cuda branch linked above to use these group definitions instead of the sub_group_mask helper functions. version sub_group_mask:
with a version that just takes a specialized group version group:
? Although I remember you saying that this interface would be changed in some way, so would version group not be correct? Thanks |
I'm not 100% sure yet, but I think we might be able to skip the "in mask" check. The groups are defined such that it's UB if the developer calls one of the algorithms using a group that the work-item doesn't belong to. We might want to keep the check around for debugging purposes, but if it impacts performance we should consider removing it.
I'd phrase it slightly differently, but I think we're on the same page. The group types are being designed to be fairly high-level and portable group types that map to common use-cases. The NVIDIA intrinsics are quite flexible -- because of the arbitrary mask parameter -- so the NVIDIA implementation of each group type really just boils down to figuring out how to construct the mask. Is that what you meant?
I think the Sorry I don't have my code ready to share. I'll try to get that uploaded by the end of this week -- it should be easier to see the differences, then.
The latest version is here, and it includes the changes I alluded to before. Specifically,
Yes, exactly. The code in version group is what we're aiming for. |
OK I see, makes sense.
Yeah exactly.
OK, this makes sense to me. Thanks for the explanation.
Great, I think we are on the same page. I can work towards this from the cuda backend side via the bitmask-cuda branch. |
Some additional NonUniform instructions operating on masks are required to implement basic non-uniform groups functionality (e.g. group membership). Signed-off-by: John Pennycook <john.pennycook@intel.com>
Also includes is_fixed_topology and is_user_constructed. Signed-off-by: John Pennycook <john.pennycook@intel.com>
Signed-off-by: John Pennycook <john.pennycook@intel.com>
Signed-off-by: John Pennycook <john.pennycook@intel.com>
Signed-off-by: John Pennycook <john.pennycook@intel.com>
70f9727
to
12c069d
Compare
@JackAKirk I've just added the implementations of the classes themselves (i.e. without group function or algorithm support), and I consider those ready for review. As you'll see, they're quite straightforward really. There are two TODOs that I need your help in addressing before things will work on the CUDA or HIP backends:
I'm happy to defer these TODOs until later, if you'd like to address them in a later PR. I'll work on getting the tests up this afternoon, and share a link here when they're available. |
Didn't like the comparison between id<1> and uint32_t, so I've inserted a static cast. Signed-off-by: John Pennycook <john.pennycook@intel.com>
Signed-off-by: John Pennycook <john.pennycook@intel.com>
/verify with intel/llvm-test-suite#1574 |
@cperkinsintel, @bader: The failures didn't seem to be related to any of the changes here (something about a missing path on the CI machines?). I can't seem to get things to re-run, either; could you please take a look? |
Closed and reopened in an attempt to retrigger the checks. |
@Pennycook I'm probably telling you something you already know, but on Jenkins/llvm-test-suite it looks like there are failing tests.
|
@cperkinsintel - I'd actually missed that the tests started running correctly, so thanks for letting me know. I think I've fixed things, so I'll try and re-run that job. |
SYCL_EXTERNAL has been renamed DPCPP_SYCL_EXTERNAL
@cperkinsintel: Things seem to be passing now. |
/verify with intel/llvm-test-suite#1574 |
@Pennycook - failure on LevelZero GPU
|
Do you know which GPU we test on? I know this is cliche, but the test passes on my machine! |
/verify with intel/llvm-test-suite#1574 |
@intel/llvm-reviewers-runtime: @steffenlarsen and I were able to implement a workaround for the failing test. We're going to need to revisit this later on, but I think this can be merged now. |
#include <sycl/ext/oneapi/experimental/cuda/barrier.hpp> | ||
#include <sycl/ext/oneapi/experimental/opportunistic_group.hpp> | ||
#include <sycl/ext/oneapi/experimental/tangle_group.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compile time is not great already. Can we don't add these includes to sycl.hpp
and ask users to include header files with experimental API explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not opposed to that, but would it be acceptable to do that in a separate PR? There are some other experimental headers that I think we'd want to move (e.g. cuda_barrier
) and some other headers that aren't obviously marked experimental but which implement experimental features (e.g. kernel properties).
I think it could be confusing if these new groups are the only experimental feature requiring users to include extra headers.
If we were to adopt this as a convention for experimental features, I think we'd need to update the documentation for our extensions too. So I'd like to hear @gmlueck's opinion about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we remove all includes of headers with API extensions (not just experimental ones) and leave only headers with standard features API in sycl.hpp
. I remember discussing this already, but I don't remember if we decided to clarify this in any documentation. On the other hand, as far as I know there are no guidelines to add includes for extension headers to standard sycl.hpp
header neither.
I'm not sure if we do it easily for existing includes as it might be considered as API breaking change, but at least we can avoid doing it for new headers. Once breaking changes are allowed, we can clean-up sycl.hpp
file.
would it be acceptable to do that in a separate PR?
I think so, but, please, make sure doing this before any products based on intel/llvm are released. Otherwise, it's going to be an API breaking change and we usually keep API stable for a long period of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is definitely an API breaking change if we start requiring the user to #include
a new header in order to use an extension.
I'd like to have a consistent strategy before we start changing things. I agree with @Pennycook that it will be confusing if only a few extensions require the user to #include
. Possible options are:
- Require users to
#include
for experimental extensions but not supported or deprecated ones. - Require users to
#include
for all extensions. In this case, we need some plan to manage the transition. For example, can we keep existing extensions as part ofsycl/sycl.hpp
with a deprecation warning?
Note that some extensions will not be able to be extracted into a separate header. For example, extensions that add member functions to existing classes don't have a separate header.
What about extensions that add new types and also add new member functions to existing classes? It might be weird to have the extension half supported when the user didn't #include
the extension's header.
Maybe it would be better to have all the extensions implemented in sycl/sycl.hpp
, but enable them via some command line option. The option could predefine some macro which enables the extension code within the headers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be better to have all the extensions implemented in sycl/sycl.hpp, but enable them via some command line option. The option could predefine some macro which enables the extension code within the headers.
Since every extension already defines a macro, could we just use those?
I could then update sycl.hpp to:
#ifdef SYCL_EXT_ONEAPI_NON_UNIFORM_GROUPS
#include <sycl/ext/oneapi/experimental/opportunistic_group.hpp>
#include <sycl/ext/oneapi/experimental/tangle_group.hpp>
#endif
(Or, alternatively, put that #ifdef
inside each of the group headers.)
Then we could tell developers that to use experimental extensions they have to enable the macro themselves, whereas the macros for other (non-experimental) extensions are predefined?
I'm not sure if we do it easily for existing includes as it might be considered as API breaking change, but at least we can avoid doing it for new headers. Once breaking changes are allowed, we can clean-up sycl.hpp file.
If we only make changes for experimental extensions, do breaking changes matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following Zero-overhead_principle "You don't pay for what you don't use" C++ standard library API is distributed across multiple headers: https://en.cppreference.com/w/cpp/header. Users include <optional>
header to import std::optional
class declaration.
GNU libstdc++ implementation provides a non-standard <bits/stdc++.h>
header file, which includes all the headers of C++ standard library. Using this header simplifies the use of C++ standard library functionality but incurs the overhead of parsing all the headers.
The SYCL specification defines a single header for the entire standard library/API. The spec doesn't define whether extensions should be defined in sycl/sycl.hpp
, it just says that extensions can add new headers, and if they do, they must follow the naming scheme. The more features we add, the higher the overhead on parsing sycl/sycl.hpp
.
Technically, the DPC++ library already provides separate headers for each class or feature, but we use them only for internal testing and don't promote using them to our users.
I believe that SYCL specification should provide users ability to include reduced set of declarations similar to standard C++ library. Having sycl/sycl.hpp
including all standard declarations and extensions is fine as long as we have an alternative to avoid "usability overhead".
Since every extension already defines a macro, could we just use those?
I don't think so. If feature doesn't require compiler support, the macro is usually defined in the header itself (example). The macro can be tested after include directive.
Then we could tell developers that to use experimental extensions they have to enable the macro themselves, whereas the macros for other (non-experimental) extensions are predefined?
The macro is supposed to be set by implementation as it indicates whether extension is enabled by implementation. SYCL standard macro defined by user might be confusing. But we could have some controls to disable support for extensions/feature (e.g. compiler flag and/or SYCL_EXT_DISABLE_***
macro for use before SYCL headers included).
We should move this discussion to SYCL-Docs
project as we are not going to solve this problem in this PR, but I wanted to raise this issue to draw the attention to the compile time overhead problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm 100% behind splitting SYCL out into multiple headers (or C++20 modules) to improve compile-times. Any resistance I'm showing here is definitely self-serving: I'd like to get these initial class definitions merged so that we can move our attention to things like CUDA support and more complicated functionality (e.g. the algorithms).
You're right that the macro fix I proposed is probably just as confusing as requiring extra includes. I agree that this is likely a bigger problem that we should discuss elsewhere... I suspect that it will take some time to reach broad consensus about what to do here, so what should I do to move this PR forward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish we don't put extension headers to sycl.hpp
and teach users to include the right headers.
There are > 10 headers for oneapi
experimental features, but only 3 of them snuck into sycl.hpp
.
Having said that, this is not the most critical issue for this project, and we will need to address this problem for other features anyway. I won't block your patch, if you want to commit it with sycl.hpp
changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish we don't put extension headers to
sycl.hpp
and teach users to include the right headers.
There are > 10 headers foroneapi
experimental features, but only 3 of them snuck intosycl.hpp
.
This would be a good topic to discuss in our Friday "SYCL Language Discussion" meeting. Would you like to summarize the current status of experimental extensions and their headers, and then present at the meeting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish we don't put extension headers to
sycl.hpp
and teach users to include the right headers.
There are > 10 headers foroneapi
experimental features, but only 3 of them snuck intosycl.hpp
.This would be a good topic to discuss in our Friday "SYCL Language Discussion" meeting. Would you like to summarize the current status of experimental extensions and their headers, and then present at the meeting?
Unfortunately, I don't track DPC++ development closely enough to feel like I can present a summary of the current status. At the same time, I don't think the current status of experimental extensions and their headers is really important to the main topic of this discussion - compile time overhead. Experimental extensions are brought here as an example of uncontrolled compile time overhead accumulation because this PR implements an experimental extension and there is an easy way to avoid it (my POV).
There are multiple ways to deal with the compile time overhead problem and the SYCL specification can help with making the solution portable across multiple implementations. I will glad to join the discussion at the meeting if you send me invitation.
sycl::sub_group sg = sycl::ext::oneapi::this_sub_group(); | ||
sub_group_mask mask = sycl::ext::oneapi::group_ballot(sg, true); | ||
return opportunistic_group(mask); | ||
#elif defined(__NVPTX__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DPC++ supports AMD GPU targets in addition to these targets.
Is it possible to give a diagnostic on unsupported targets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I'm not sure. I thought about this, but I was worried that if I emitted something like a #warning
then it would trigger on every AMD GPU compilation that included the headers (rather than only compilations calling the function).
@steffenlarsen, is there a way to trigger a compile-time or run-time diagnostic when a function is invoked on an unsupported backend? The only precedent I see for this is throwing a runtime_error
on the host device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. A warning would likely leak into any AMDGCN/NVPTX compilations. Had this function been template dependent we could have used a static assertion using the template argument. I believe something like
template <typename T = void, typename = std::enable_if_t<std::is_same_v<T, void>>>
inline opportunistic_group get_opportunistic_group() {
...
#elif defined(__NVPTX__) || defined (__AMDGCN__)
static_assert(!std::is_same_v<T, void>, "get_opportunistic_group() is not currently supported on the target");
#endif
...
}
would work, but I am not sure it's worth the dependence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we say "not currently supported" is this something that we expect to fix soon, or is it a long term problem?
I think the static_assert
solution also does not work in the general case. Imagine an application that has some kernels that run on Nvidia and some that run on Intel GPUs. The application use get_opportunistic_group
in a kernel that will be submitted to an Intel GPU and the host code checks the device type before submitting the kernel. There are other kernels in the same TU that can run on Nvidia devices.
Wouldn't the static_assert
trigger in that case also?
We invented the "optional kernel features" concept to handle cases like this. The diagnostic occurs at runtime if the application mistakenly submits the kernel using get_opportunistic_group
to an Nvidia device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we say "not currently supported" is this something that we expect to fix soon, or is it a long term problem?
From my perspective, this is a complete unknown right now, and it's why everything is marked experimental.
My aim is to get things running reliably across all the devices and backends, but we need to get implementation experience. I sketched out __SPIRV__
and __NVPTX__
paths here because I understand what the implementations will look like; I need somebody with AMDGCN experience to take a look at things there.
Designing aspects and optional features for these group types right now seems premature to me, because we don't know which group types we'll need to make optional. If it's not a lot of work for somebody to give us optional features for each group type then I'm not opposed to that, but I don't want to make a lot of extra work for folks if there's a chance we won't need the aspects later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to make sure that when this experimental feature is used on unsupported platforms, users will receive a meaningful error message about it. Actually I think we might already have it. I see that only code under #if defined(__SPIR__)
has return
statement, so I expect if __SPIR__
is not defined the compiler will report an error about missing return
.
@Pennycook, do you have any tests for this API we can run on NVIDIA/AMD GPUs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests in intel/llvm-test-suite#1574 will work on anything supporting a sub-group size of 32 (once the NVIDIA/AMD paths are implemented). I've marked those backends as unsupported on NVIDIA/AMD for now because they're unimplemented, but @JackAKirk is working on implementing the missing parts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JackAKirk, do you know whether DPC++ with #8202 gives any diagnostic on NVIDIA/AMD GPU for using new classes? I hope the compiler returns an error. Am I right?
John, meaningful diagnostic is nice to have, but should not block commit of experimental API. I suppose at this point all users of this feature know how to use it properly and what to do if it doesn't work as intended. We just need to make sure that when it's ready for widespread use, the behavior of the implementation is user-friendly.
Sorry for the inconvenience I cause by discussing non-critical matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmlueck - I agree, the static assertion would cause problems in cases like the one you mentioned. Having an aspect for this and marking the relevant functions as requiring that aspect is definitely an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the groups like opportunistic_group that weren't implemented on NVPTX I think my memory is that the tests did not return an error immediately when get_opportunistic_group
was called, but I think as @steffenlarsen points out this could maybe be fixed with a static assert like:
static_assert( |
For all types added here except for tangle_group, I think that we resolved the discussions on what the NVPTX implementation should be, so I could push what I implemented for NVPTX after this PR is merged. For AMD I think that AMD gpus do not support independent forward progress and therefore don't have a similar instruction to __activemask from NVPTX, but I think that for AMD get_opportunistic_group can be supported in the essentially same was as it is for SPIR here, by using ballot, the only thing is that amd also needs to support 64 WI subgroups.
@bader - Please merge this as-is. I've made a note to myself to revisit these questions when we're done with the initial implementation and/or we've reached consensus on the experimental headers issue. And there's no need to apologize! I hadn't actually considered what warning/error behavior we might want in the final implementation, so this has given me a lot to think about. |
Implements basic functionality for the following group types:
This functionality includes all member functions and type traits.
Support for group functions and algorithms will follow later.