Skip to content

[SYCL] Add non-uniform group classes #8202

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Mar 17, 2023
Merged

Conversation

Pennycook
Copy link
Contributor

@Pennycook Pennycook commented Feb 3, 2023

Implements basic functionality for the following group types:

  • ballot_group
  • cluster_group
  • tangle_group
  • opportunistic_group

This functionality includes all member functions and type traits.
Support for group functions and algorithms will follow later.

@Pennycook Pennycook added the spec extension All issues/PRs related to extensions specifications label Feb 3, 2023
@Pennycook Pennycook requested a review from a team as a code owner February 3, 2023 22:06
@Pennycook Pennycook temporarily deployed to aws February 3, 2023 22:10 — with GitHub Actions Inactive
@Pennycook Pennycook temporarily deployed to aws February 3, 2023 22:51 — with GitHub Actions Inactive
@cperkinsintel
Copy link
Contributor

will there be tests for this over on llvm-test-suite?

@Pennycook
Copy link
Contributor Author

will there be tests for this over on llvm-test-suite?

Yes - the tests for this basic functionality are in intel/llvm-test-suite#1574. clang-format is currently complaining about them, but otherwise I think they work.

@Pennycook Pennycook marked this pull request as draft February 7, 2023 23:42
@Pennycook
Copy link
Contributor Author

Converting this to draft to prevent it from being merged. Trying to implement the group algorithms for masked_sub_group raised some interesting questions, and I'd like to discuss them some more before committing to this implementation.

@JackAKirk
Copy link
Contributor

Hi @Pennycook
Now masked_sub_group is removed from the extension, can you remind me: are you thinking of using the sub_group_mask class directly (https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/oneapi/sub_group_mask.hpp) as a replacement for masked_sub_group?

I had forgotten that sub_group_mask was already an implemented extension, and in fact we used it in our internal draft of the cuda implementation of these non-uniform sub-group types and cooperative algorithms.
It turns out I also made use of internal cuda variables like __nvvm_read_ptx_sreg_lanemask_eq and __nvvm_read_ptx_sreg_lanemask_lt as you imagined.

The branch was in a working state but it is now very old. I can remove some of the dead wood in it and try to merge tip in it and post it here so you can have a look. I'm not sure to what extent it might overlap with what you plan, but it sounds like it may be quite similar.

@Pennycook
Copy link
Contributor Author

Hi @Pennycook Now masked_sub_group is removed from the extension, can you remind me: are you thinking of using the sub_group_mask class directly (https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/oneapi/sub_group_mask.hpp) as a replacement for masked_sub_group?

I don't think it's a direct replacement, but it may be helpful to people working with these sorts of masks. There are still cases where a sub_group_mask can be used to reason about which work-items will belong to which groups, and operating on those masks may be more efficient (e.g. because doing some bit-twiddling is probably lower overhead than shuffling between work-items). The mask could be used to inform control flow that then uses one of the other group types.

The branch was in a working state but it is now very old. I can remove some of the dead wood in it and try to merge tip in it and post it here so you can have a look. I'm not sure to what extent it might overlap with what you plan, but it sounds like it may be quite similar.

I'd definitely be interested in seeing that, thanks! I do think a lot of this will end up being useful for implementing the other group types, which are likely to use masks as an implementation detail.


Aside: I plan to re-purpose this PR to add the other new group types, rather than close it and open another one.

@JackAKirk
Copy link
Contributor

JackAKirk commented Feb 14, 2023

Hi @Pennycook Now masked_sub_group is removed from the extension, can you remind me: are you thinking of using the sub_group_mask class directly (https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/oneapi/sub_group_mask.hpp) as a replacement for masked_sub_group?

I don't think it's a direct replacement, but it may be helpful to people working with these sorts of masks. There are still cases where a sub_group_mask can be used to reason about which work-items will belong to which groups, and operating on those masks may be more efficient (e.g. because doing some bit-twiddling is probably lower overhead than shuffling between work-items). The mask could be used to inform control flow that then uses one of the other group types.

OK. I am starting to remember things better now. I think that in the cuda backend we basically only need the sub_group_mask for e.g. "cluster_group" or "masked_sub_group" and in particular only the 32 bit mask since e.g. the position of the calling thread wrt the 32 bit subgroup map can be returned from the cuda builtins I referenced above. Then the ptx builtins require that only the threads that are active in the mask call the builtin (if threads that are not active in the mask call the masked ptx builtins you get UB: see e.g. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-bar-warp-sync), such as bar.warp.sync (sycl::barrier) and redux.sync (sycl reduction algorithms), so within the implementation we simply do e.g.

if (__nvvm_read_ptx_sreg_lanemask_eq() & Mask) {                           
      TYPE result = __nvvm_redux_sync_##REDUX_OP(x, Mask);                     
      *carry = result;                                                         
      return result;                                                           
    } else {                                                                   
      return *carry;                                                           
    }                                                                          

Although additionally for post volta maybe we have to think carefully about ensuring the control flow is converged if this is defined by the spec for certain operations as you point out.

So is the main point that the design of this will mostly depend on what other hardware needs, and then Nvidia should be pretty easy to fit into whatever design is required?

The branch was in a working state but it is now very old. I can remove some of the dead wood in it and try to merge tip in it and post it here so you can have a look. I'm not sure to what extent it might overlap with what you plan, but it sounds like it may be quite similar.

I'd definitely be interested in seeing that, thanks! I do think a lot of this will end up being useful for implementing the other group types, which are likely to use masks as an implementation detail.

So actually you can get an idea of how it worked from looking at this test branch that is already available: intel/llvm-test-suite@intel...JackAKirk:llvm-test-suite:group_collectives_mask (Note that I think you can ignore all the "async_copy" tests because as I understand it this function is not included in the extension document as a masked variant in any form and the implementations we made for these will be completely removed).
You can see that we just pass a bare sub_group_mask to the "masked" "cooperative" functions, and that we had different types of functions that returned appropriate masks, I think effectively replacing different "group" types like "masked_sub_group" and "cluster_group". For example, you can see that item.ext_oneapi_partition_sub_group(partition_size); was used instead of "cluster_group". I guess your idea is that sub_group_mask will be wrapped in some group classes in order to satisfy implementation requirements of backends other than cuda?

@JackAKirk
Copy link
Contributor

JackAKirk commented Feb 15, 2023

Aside: I plan to re-purpose this PR to add the other new group types, rather than close it and open another one.

Also here is the corresponding implementation to go with the above linked tests: sycl...JackAKirk:llvm:bitmask-cuda

A lot of this is the async stuff which you can ignore. You can completely ignore the changes made to:

libclc/generic/include/clc/async/common.h
libclc/generic/libspirv/async/async_work_group_strided_copy.inc
libclc/ptx-nvidiacl/libspirv/SOURCES
libclc/ptx-nvidiacl/libspirv/async/async_work_group_strided_copy.cl
libclc/ptx-nvidiacl/libspirv/async/wait_group_events.cl.

So I think I just need confirmation of the status of the group definitions in this document (https://github.com/AerialMantis/llvm/blob/560269af36543cd88ecd380205b4721abac15354/sycl/doc/extensions/proposed/sycl_ext_oneapi_non_uniform_groups.asciidoc), and I can update the bitmask-cuda branch linked above to use these group definitions instead of the sub_group_mask helper functions.
So for example, I think I can replace item.ext_oneapi_partition_sub_group(partition_size) with get_cluster_group<partition_size>(sg) (obviously cluster_group name yet to be decided) , then update the group function interfaces that were, e.g.:

version sub_group_mask:

reduce_over_group((sub)Group sg, sub_group_mask mask, T start, Operator binary_op);

with a version that just takes a specialized group

version group:

reduce_over_group((sub/cluster/...)Group sg, T start, Operator binary_op);

?

Although I remember you saying that this interface would be changed in some way, so would version group not be correct?

Thanks

@Pennycook
Copy link
Contributor Author

Pennycook commented Feb 15, 2023

Then the ptx builtins require that only the threads that are active in the mask call the builtin (if threads that are not active in the mask call the masked ptx builtins you get UB: see e.g. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-bar-warp-sync), such as bar.warp.sync (sycl::barrier) and redux.sync (sycl reduction algorithms)

I'm not 100% sure yet, but I think we might be able to skip the "in mask" check. The groups are defined such that it's UB if the developer calls one of the algorithms using a group that the work-item doesn't belong to. We might want to keep the check around for debugging purposes, but if it impacts performance we should consider removing it.

So is the main point that the design of this will mostly depend on what other hardware needs, and then Nvidia should be pretty easy to fit into whatever design is required?

I'd phrase it slightly differently, but I think we're on the same page. The group types are being designed to be fairly high-level and portable group types that map to common use-cases. The NVIDIA intrinsics are quite flexible -- because of the arbitrary mask parameter -- so the NVIDIA implementation of each group type really just boils down to figuring out how to construct the mask.

Is that what you meant?

I guess your idea is that sub_group_mask will be wrapped in some group classes in order to satisfy implementation requirements of backends other than cuda?

I think the sub_group_mask class is to satisfy the implementation requirements of CUDA, where the intrinsics always require a mask. The SPIR-V non-uniform instructions (e.g. OpGroupNonUniformShuffle) do not require a mask, and the work-items involved in a NonUniform instruction are derived from control flow.

Sorry I don't have my code ready to share. I'll try to get that uploaded by the end of this week -- it should be easier to see the differences, then.

So I think I just need confirmation of the status of the group definitions in this document

The latest version is here, and it includes the changes I alluded to before. Specifically, masked_sub_group has been removed. The design of the other group types is fixed (for now 😆). I think we can start implementing them, and then address the renaming once we're done.

I think I can replace item.ext_oneapi_partition_sub_group(partition_size) with get_cluster_group<partition_size>(sg) (obviously cluster_group name yet to be decided) , then update the group function interfaces

Yes, exactly. The code in version group is what we're aiming for.

@JackAKirk
Copy link
Contributor

I'm not 100% sure yet, but I think we might be able to skip the "in mask" check. The groups are defined such that it's UB if the developer calls one of the algorithms using a group that the work-item doesn't belong to. We might want to keep the check around for debugging purposes, but if it impacts performance we should consider removing it.

OK I see, makes sense.

The group types are being designed to be fairly high-level and portable group types that map to common use-cases. The NVIDIA intrinsics are quite flexible -- because of the arbitrary mask parameter -- so the NVIDIA implementation of each group type really just boils down to figuring out how to construct the mask.

Is that what you meant?

Yeah exactly.

I think the sub_group_mask class is to satisfy the implementation requirements of CUDA, where the intrinsics always require a mask. The SPIR-V non-uniform instructions (e.g. OpGroupNonUniformShuffle) do not require a mask, and the work-items involved in a NonUniform instruction are derived from control flow.

Sorry I don't have my code ready to share. I'll try to get that uploaded by the end of this week -- it should be easier to see the differences, then.

OK, this makes sense to me. Thanks for the explanation.

Yes, exactly. The code in version group is what we're aiming for.

Great, I think we are on the same page. I can work towards this from the cuda backend side via the bitmask-cuda branch.

Some additional NonUniform instructions operating on masks are required
to implement basic non-uniform groups functionality (e.g. group membership).

Signed-off-by: John Pennycook <john.pennycook@intel.com>
Also includes is_fixed_topology and is_user_constructed.

Signed-off-by: John Pennycook <john.pennycook@intel.com>
Signed-off-by: John Pennycook <john.pennycook@intel.com>
Signed-off-by: John Pennycook <john.pennycook@intel.com>
Signed-off-by: John Pennycook <john.pennycook@intel.com>
@Pennycook Pennycook changed the title [SYCL] Add masked_sub_group class [SYCL] Add non-uniform group classes Feb 16, 2023
@Pennycook Pennycook marked this pull request as ready for review February 16, 2023 17:45
@Pennycook
Copy link
Contributor Author

@JackAKirk I've just added the implementations of the classes themselves (i.e. without group function or algorithm support), and I consider those ready for review. As you'll see, they're quite straightforward really.

There are two TODOs that I need your help in addressing before things will work on the CUDA or HIP backends:

  • In get_opportunistic_group: I think all that needs to happen here is to construct a sub_group_mask containing the active mask. I know that's __activemask() in CUDA, but I'm not sure what the NVVM equivalent is.
  • In get_tangle_group: This is where we need some experiments/research. I don't know what the mask value should be here, only that we need the compiler to help build it.

I'm happy to defer these TODOs until later, if you'd like to address them in a later PR.

I'll work on getting the tests up this afternoon, and share a link here when they're available.

@Pennycook Pennycook temporarily deployed to aws February 16, 2023 18:10 — with GitHub Actions Inactive
Didn't like the comparison between id<1> and uint32_t, so I've
inserted a static cast.

Signed-off-by: John Pennycook <john.pennycook@intel.com>
@Pennycook Pennycook temporarily deployed to aws February 16, 2023 19:20 — with GitHub Actions Inactive
Signed-off-by: John Pennycook <john.pennycook@intel.com>
@Pennycook Pennycook temporarily deployed to aws February 16, 2023 19:53 — with GitHub Actions Inactive
@Pennycook
Copy link
Contributor Author

/verify with intel/llvm-test-suite#1574

@Pennycook
Copy link
Contributor Author

@cperkinsintel, @bader: The failures didn't seem to be related to any of the changes here (something about a missing path on the CI machines?). I can't seem to get things to re-run, either; could you please take a look?

@Pennycook Pennycook closed this Mar 7, 2023
@Pennycook Pennycook reopened this Mar 7, 2023
@Pennycook
Copy link
Contributor Author

Closed and reopened in an attempt to retrigger the checks.

@Pennycook Pennycook temporarily deployed to aws March 7, 2023 16:33 — with GitHub Actions Inactive
@cperkinsintel
Copy link
Contributor

cperkinsintel commented Mar 7, 2023

@Pennycook I'm probably telling you something you already know, but on Jenkins/llvm-test-suite it looks like there are failing tests.

Failed Tests (2):
[2023-03-07T16:22:47.263Z]   SYCL :: NonUniformGroups/is_fixed_topology.cpp
[2023-03-07T16:22:47.263Z]   SYCL :: NonUniformGroups/is_user_constructed.cpp```

@Pennycook
Copy link
Contributor Author

@Pennycook I'm probably telling you something you already know, but on Jenkins/llvm-test-suite it looks like there are failing tests.

Failed Tests (2):
[2023-03-07T16:22:47.263Z]   SYCL :: NonUniformGroups/is_fixed_topology.cpp
[2023-03-07T16:22:47.263Z]   SYCL :: NonUniformGroups/is_user_constructed.cpp```

@cperkinsintel - I'd actually missed that the tests started running correctly, so thanks for letting me know. I think I've fixed things, so I'll try and re-run that job.

@Pennycook Pennycook temporarily deployed to aws March 7, 2023 23:18 — with GitHub Actions Inactive
SYCL_EXTERNAL has been renamed DPCPP_SYCL_EXTERNAL
@Pennycook Pennycook temporarily deployed to aws March 9, 2023 20:31 — with GitHub Actions Inactive
@Pennycook Pennycook temporarily deployed to aws March 9, 2023 22:43 — with GitHub Actions Inactive
@Pennycook
Copy link
Contributor Author

@cperkinsintel: Things seem to be passing now.

@steffenlarsen
Copy link
Contributor

/verify with intel/llvm-test-suite#1574

@cperkinsintel
Copy link
Contributor

@Pennycook - failure on LevelZero GPU

Failed Tests (1):

[2023-03-10T15:18:46.792Z]   SYCL :: NonUniformGroups/tangle_group.cpp

@Pennycook
Copy link
Contributor Author

@Pennycook - failure on LevelZero GPU

Failed Tests (1):

[2023-03-10T15:18:46.792Z]   SYCL :: NonUniformGroups/tangle_group.cpp

Do you know which GPU we test on? I know this is cliche, but the test passes on my machine!

@Pennycook Pennycook temporarily deployed to aws March 13, 2023 16:54 — with GitHub Actions Inactive
@Pennycook Pennycook temporarily deployed to aws March 13, 2023 17:47 — with GitHub Actions Inactive
@Pennycook
Copy link
Contributor Author

/verify with intel/llvm-test-suite#1574

@jbrodman jbrodman self-requested a review March 14, 2023 19:59
@Pennycook
Copy link
Contributor Author

@intel/llvm-reviewers-runtime: @steffenlarsen and I were able to implement a workaround for the failing test. We're going to need to revisit this later on, but I think this can be merged now.

#include <sycl/ext/oneapi/experimental/cuda/barrier.hpp>
#include <sycl/ext/oneapi/experimental/opportunistic_group.hpp>
#include <sycl/ext/oneapi/experimental/tangle_group.hpp>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compile time is not great already. Can we don't add these includes to sycl.hpp and ask users to include header files with experimental API explicitly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to that, but would it be acceptable to do that in a separate PR? There are some other experimental headers that I think we'd want to move (e.g. cuda_barrier) and some other headers that aren't obviously marked experimental but which implement experimental features (e.g. kernel properties).

I think it could be confusing if these new groups are the only experimental feature requiring users to include extra headers.

If we were to adopt this as a convention for experimental features, I think we'd need to update the documentation for our extensions too. So I'd like to hear @gmlueck's opinion about that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we remove all includes of headers with API extensions (not just experimental ones) and leave only headers with standard features API in sycl.hpp. I remember discussing this already, but I don't remember if we decided to clarify this in any documentation. On the other hand, as far as I know there are no guidelines to add includes for extension headers to standard sycl.hpp header neither.
I'm not sure if we do it easily for existing includes as it might be considered as API breaking change, but at least we can avoid doing it for new headers. Once breaking changes are allowed, we can clean-up sycl.hpp file.

would it be acceptable to do that in a separate PR?

I think so, but, please, make sure doing this before any products based on intel/llvm are released. Otherwise, it's going to be an API breaking change and we usually keep API stable for a long period of time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is definitely an API breaking change if we start requiring the user to #include a new header in order to use an extension.

I'd like to have a consistent strategy before we start changing things. I agree with @Pennycook that it will be confusing if only a few extensions require the user to #include. Possible options are:

  • Require users to #include for experimental extensions but not supported or deprecated ones.
  • Require users to #include for all extensions. In this case, we need some plan to manage the transition. For example, can we keep existing extensions as part of sycl/sycl.hpp with a deprecation warning?

Note that some extensions will not be able to be extracted into a separate header. For example, extensions that add member functions to existing classes don't have a separate header.

What about extensions that add new types and also add new member functions to existing classes? It might be weird to have the extension half supported when the user didn't #include the extension's header.

Maybe it would be better to have all the extensions implemented in sycl/sycl.hpp, but enable them via some command line option. The option could predefine some macro which enables the extension code within the headers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be better to have all the extensions implemented in sycl/sycl.hpp, but enable them via some command line option. The option could predefine some macro which enables the extension code within the headers.

Since every extension already defines a macro, could we just use those?

I could then update sycl.hpp to:

#ifdef SYCL_EXT_ONEAPI_NON_UNIFORM_GROUPS
#include <sycl/ext/oneapi/experimental/opportunistic_group.hpp>
#include <sycl/ext/oneapi/experimental/tangle_group.hpp>
#endif

(Or, alternatively, put that #ifdef inside each of the group headers.)

Then we could tell developers that to use experimental extensions they have to enable the macro themselves, whereas the macros for other (non-experimental) extensions are predefined?

I'm not sure if we do it easily for existing includes as it might be considered as API breaking change, but at least we can avoid doing it for new headers. Once breaking changes are allowed, we can clean-up sycl.hpp file.

If we only make changes for experimental extensions, do breaking changes matter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following Zero-overhead_principle "You don't pay for what you don't use" C++ standard library API is distributed across multiple headers: https://en.cppreference.com/w/cpp/header. Users include <optional> header to import std::optional class declaration.

GNU libstdc++ implementation provides a non-standard <bits/stdc++.h> header file, which includes all the headers of C++ standard library. Using this header simplifies the use of C++ standard library functionality but incurs the overhead of parsing all the headers.

The SYCL specification defines a single header for the entire standard library/API. The spec doesn't define whether extensions should be defined in sycl/sycl.hpp, it just says that extensions can add new headers, and if they do, they must follow the naming scheme. The more features we add, the higher the overhead on parsing sycl/sycl.hpp.

Technically, the DPC++ library already provides separate headers for each class or feature, but we use them only for internal testing and don't promote using them to our users.

I believe that SYCL specification should provide users ability to include reduced set of declarations similar to standard C++ library. Having sycl/sycl.hpp including all standard declarations and extensions is fine as long as we have an alternative to avoid "usability overhead".

Since every extension already defines a macro, could we just use those?

I don't think so. If feature doesn't require compiler support, the macro is usually defined in the header itself (example). The macro can be tested after include directive.

Then we could tell developers that to use experimental extensions they have to enable the macro themselves, whereas the macros for other (non-experimental) extensions are predefined?

The macro is supposed to be set by implementation as it indicates whether extension is enabled by implementation. SYCL standard macro defined by user might be confusing. But we could have some controls to disable support for extensions/feature (e.g. compiler flag and/or SYCL_EXT_DISABLE_*** macro for use before SYCL headers included).

We should move this discussion to SYCL-Docs project as we are not going to solve this problem in this PR, but I wanted to raise this issue to draw the attention to the compile time overhead problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 100% behind splitting SYCL out into multiple headers (or C++20 modules) to improve compile-times. Any resistance I'm showing here is definitely self-serving: I'd like to get these initial class definitions merged so that we can move our attention to things like CUDA support and more complicated functionality (e.g. the algorithms).

You're right that the macro fix I proposed is probably just as confusing as requiring extra includes. I agree that this is likely a bigger problem that we should discuss elsewhere... I suspect that it will take some time to reach broad consensus about what to do here, so what should I do to move this PR forward?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we don't put extension headers to sycl.hpp and teach users to include the right headers.
There are > 10 headers for oneapi experimental features, but only 3 of them snuck into sycl.hpp.

Having said that, this is not the most critical issue for this project, and we will need to address this problem for other features anyway. I won't block your patch, if you want to commit it with sycl.hpp changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we don't put extension headers to sycl.hpp and teach users to include the right headers.
There are > 10 headers for oneapi experimental features, but only 3 of them snuck into sycl.hpp.

This would be a good topic to discuss in our Friday "SYCL Language Discussion" meeting. Would you like to summarize the current status of experimental extensions and their headers, and then present at the meeting?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we don't put extension headers to sycl.hpp and teach users to include the right headers.
There are > 10 headers for oneapi experimental features, but only 3 of them snuck into sycl.hpp.

This would be a good topic to discuss in our Friday "SYCL Language Discussion" meeting. Would you like to summarize the current status of experimental extensions and their headers, and then present at the meeting?

Unfortunately, I don't track DPC++ development closely enough to feel like I can present a summary of the current status. At the same time, I don't think the current status of experimental extensions and their headers is really important to the main topic of this discussion - compile time overhead. Experimental extensions are brought here as an example of uncontrolled compile time overhead accumulation because this PR implements an experimental extension and there is an easy way to avoid it (my POV).

There are multiple ways to deal with the compile time overhead problem and the SYCL specification can help with making the solution portable across multiple implementations. I will glad to join the discussion at the meeting if you send me invitation.

sycl::sub_group sg = sycl::ext::oneapi::this_sub_group();
sub_group_mask mask = sycl::ext::oneapi::group_ballot(sg, true);
return opportunistic_group(mask);
#elif defined(__NVPTX__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DPC++ supports AMD GPU targets in addition to these targets.
Is it possible to give a diagnostic on unsupported targets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'm not sure. I thought about this, but I was worried that if I emitted something like a #warning then it would trigger on every AMD GPU compilation that included the headers (rather than only compilations calling the function).

@steffenlarsen, is there a way to trigger a compile-time or run-time diagnostic when a function is invoked on an unsupported backend? The only precedent I see for this is throwing a runtime_error on the host device.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. A warning would likely leak into any AMDGCN/NVPTX compilations. Had this function been template dependent we could have used a static assertion using the template argument. I believe something like

template <typename T = void, typename = std::enable_if_t<std::is_same_v<T, void>>>
inline opportunistic_group get_opportunistic_group() {
...
#elif defined(__NVPTX__) || defined (__AMDGCN__)
  static_assert(!std::is_same_v<T, void>, "get_opportunistic_group() is not currently supported on the target");
#endif
...
}

would work, but I am not sure it's worth the dependence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we say "not currently supported" is this something that we expect to fix soon, or is it a long term problem?

I think the static_assert solution also does not work in the general case. Imagine an application that has some kernels that run on Nvidia and some that run on Intel GPUs. The application use get_opportunistic_group in a kernel that will be submitted to an Intel GPU and the host code checks the device type before submitting the kernel. There are other kernels in the same TU that can run on Nvidia devices.

Wouldn't the static_assert trigger in that case also?

We invented the "optional kernel features" concept to handle cases like this. The diagnostic occurs at runtime if the application mistakenly submits the kernel using get_opportunistic_group to an Nvidia device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we say "not currently supported" is this something that we expect to fix soon, or is it a long term problem?

From my perspective, this is a complete unknown right now, and it's why everything is marked experimental.

My aim is to get things running reliably across all the devices and backends, but we need to get implementation experience. I sketched out __SPIRV__ and __NVPTX__ paths here because I understand what the implementations will look like; I need somebody with AMDGCN experience to take a look at things there.

Designing aspects and optional features for these group types right now seems premature to me, because we don't know which group types we'll need to make optional. If it's not a lot of work for somebody to give us optional features for each group type then I'm not opposed to that, but I don't want to make a lot of extra work for folks if there's a chance we won't need the aspects later on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to make sure that when this experimental feature is used on unsupported platforms, users will receive a meaningful error message about it. Actually I think we might already have it. I see that only code under #if defined(__SPIR__) has return statement, so I expect if __SPIR__ is not defined the compiler will report an error about missing return.
@Pennycook, do you have any tests for this API we can run on NVIDIA/AMD GPUs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests in intel/llvm-test-suite#1574 will work on anything supporting a sub-group size of 32 (once the NVIDIA/AMD paths are implemented). I've marked those backends as unsupported on NVIDIA/AMD for now because they're unimplemented, but @JackAKirk is working on implementing the missing parts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JackAKirk, do you know whether DPC++ with #8202 gives any diagnostic on NVIDIA/AMD GPU for using new classes? I hope the compiler returns an error. Am I right?

John, meaningful diagnostic is nice to have, but should not block commit of experimental API. I suppose at this point all users of this feature know how to use it properly and what to do if it doesn't work as intended. We just need to make sure that when it's ready for widespread use, the behavior of the implementation is user-friendly.

Sorry for the inconvenience I cause by discussing non-critical matters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmlueck - I agree, the static assertion would cause problems in cases like the one you mentioned. Having an aspect for this and marking the relevant functions as requiring that aspect is definitely an option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the groups like opportunistic_group that weren't implemented on NVPTX I think my memory is that the tests did not return an error immediately when get_opportunistic_group was called, but I think as @steffenlarsen points out this could maybe be fixed with a static assert like:


For all types added here except for tangle_group, I think that we resolved the discussions on what the NVPTX implementation should be, so I could push what I implemented for NVPTX after this PR is merged. For AMD I think that AMD gpus do not support independent forward progress and therefore don't have a similar instruction to __activemask from NVPTX, but I think that for AMD get_opportunistic_group can be supported in the essentially same was as it is for SPIR here, by using ballot, the only thing is that amd also needs to support 64 WI subgroups.

@Pennycook
Copy link
Contributor Author

@bader - Please merge this as-is. I've made a note to myself to revisit these questions when we're done with the initial implementation and/or we've reached consensus on the experimental headers issue.

And there's no need to apologize! I hadn't actually considered what warning/error behavior we might want in the final implementation, so this has given me a lot to think about.

@bader bader merged commit 8689420 into intel:sycl Mar 17, 2023
@Pennycook Pennycook deleted the masked_sub_group branch March 17, 2023 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec extension All issues/PRs related to extensions specifications
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants