-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Add ballot_group support to algorithms #8784
Conversation
To avoid duplicating logic and introducing even more overloads of the group algorithms, it is desirable to move some of the implementation details into the detail::spirv namespace. This commit makes a few changes to enable that to happen: - spirv:: functions with a Group template now take a group object, to enable run-time information (e.g. group membership) to pass through. - ControlBarrier and the OpGroup* instruction used to implement reduce/scan now forward to spirv::, similar to other group functions and algorithms. - The calc helper used to map functors to SPIR-V instructions is updated to use the new spirv:: functions, instead of calling __spirv intrinsics. Signed-off-by: John Pennycook <john.pennycook@intel.com>
Nested detail namespaces cause problems for name lookup. Signed-off-by: John Pennycook <john.pennycook@intel.com>
Enables the following functions to be used with ballot_group arguments: - group_barrier - group_broadcast - any_of_group - all_of_group - none_of_group - reduce_over_group - exclusive_scan_over_group - inclusive_scan_over_group Signed-off-by: John Pennycook <john.pennycook@intel.com>
A few quick notes to reviewers:
|
Fixes compilation at -O0.
Hi @Pennycook , with intel/llvm-test-suite moved in-tree you need to add the tests from intel/llvm-test-suite#1698 into this PR now. |
Tests the ability to create an instance of each new group type, and the correctness of the core member functions. Signed-off-by: John Pennycook <john.pennycook@intel.com>
This commit adds tests for using ballot_group and the following algorithms: - group_barrier - group_broadcast - any_of_group - all_of_group - none_of_group - reduce_over_group - exclusive_scan_over_group - inclusive_scan_over_group Signed-off-by: John Pennycook <john.pennycook@intel.com>
Thanks, @aelovikov-intel. I've copied over the tests from intel/llvm-test-suite#1698, and also the related tests from intel/llvm-test-suite#1574 which didn't get merged before the move. |
sycl::buffer<bool, 1> BarrierBuf{sycl::range{32}}; | ||
sycl::buffer<bool, 1> BroadcastBuf{sycl::range{32}}; | ||
sycl::buffer<bool, 1> AnyBuf{sycl::range{32}}; | ||
sycl::buffer<bool, 1> AllBuf{sycl::range{32}}; | ||
sycl::buffer<bool, 1> NoneBuf{sycl::range{32}}; | ||
sycl::buffer<bool, 1> ReduceBuf{sycl::range{32}}; | ||
sycl::buffer<bool, 1> ExScanBuf{sycl::range{32}}; | ||
sycl::buffer<bool, 1> IncScanBuf{sycl::range{32}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my debug experiments I'm using something like
constexpr int N_RESULTS = 32;
sycl::buffer<bool, 1> results(32 * N_RESULTS);
...
accessor res_acc{results, cgh};
// kernel
auto *res = res_acc.get_pointer() + WI*N_RESULTS;
...
*res++ = res1;
*res++ = res2;
...
// could be outlined to a helper and shared between different places.
host_accessor res_acc{results};
bool success = std::all_of(res_acc.begin(), res_acc.end(), [](bool r) { return r; });
if (!success) {
for (int j = 0; j< N_RESULTS; ++j) {
for (int i = 0; i < res_acc.size() / N_RESULTS; ++i) {
if (i % 8 == 0)
std::cout << " |";
std::cout << " " << res_acc[i*N_RESULTS + j];
}
std::cout << std::endl;
}
assert(false);
}
I think it might be suitable here as well, but up to you.
uint32_t ReduceResult = | ||
sycl::reduce_over_group(BallotGroup, 1, sycl::plus<>()); | ||
ReduceAcc[WI] = | ||
(ReduceResult == BallotGroup.get_local_linear_range()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking, the previous test only verified get_local_range()
and not get_local_linear_range()
but we can leave this to CTS.
// RUN: %clangxx -fsycl -fsyntax-only -fsycl-targets=%sycl_triple %s -o %t.out | ||
|
||
#include <sycl/sycl.hpp> | ||
namespace syclex = sycl::ext::oneapi::experimental; | ||
|
||
static_assert( | ||
syclex::is_user_constructed_group_v<syclex::ballot_group<sycl::sub_group>>); | ||
static_assert(syclex::is_user_constructed_group_v< | ||
syclex::cluster_group<1, sycl::sub_group>>); | ||
static_assert(syclex::is_user_constructed_group_v< | ||
syclex::cluster_group<2, sycl::sub_group>>); | ||
static_assert( | ||
syclex::is_user_constructed_group_v<syclex::tangle_group<sycl::sub_group>>); | ||
static_assert(syclex::is_user_constructed_group_v<syclex::opportunistic_group>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd slightly prefer to have this fused with the previous one but won't insist on that.
Match &= (OpportunisticGroup.get_group_id() == 0); | ||
Match &= (OpportunisticGroup.get_local_id() < | ||
OpportunisticGroup.get_local_range()); | ||
Match &= (OpportunisticGroup.get_group_range() == 1); | ||
Match &= (OpportunisticGroup.get_local_linear_range() <= | ||
SG.get_local_linear_range()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest writing all the ranges/WIs and verifying their sum/existence on the host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some cases, that requires a lot more data to be sent back to the host, though. Doing the test inside of the kernel means we have access to all the values of nd_item
, sub_group
and the opportunistic_group
. Getting all those values on the host would require a bunch of additional accessors.
All the tests passed, so I'll work on making some of these formatting changes tomorrow. One thing to possibly look at on the CI side: this got stuck waiting on "Stop AWS" for a really long time -- several hours after all the test suite runs had completed. I'm not sure what that action does, but it struck me as odd that it didn't get scheduled sooner. |
That seems to be the issue with Github's public runners that we use for the ultra-lightweight tasks. We've been hitting this issue elsewhere recently too. |
Co-authored-by: aelovikov-intel <andrei.elovikov@intel.com>
Hi @Pennycook I'm in the middle of implementing cuda support for these algorithms on top of your implementation and I'm at the point where it would be good to ask for your feedback on a few small implementation issues. I've also implemented cuda support for The NVPTX backend could implement Other than this I don't think there are any issues. I have implemented GroupAny, GroupAll, GroupBarrier, and GroupBroadcast for Thanks |
Sorry for the delayed response, @JackAKirk.
This is a good idea.
Hm. I don't think I have a strong preference, because it's not immediately obvious to me which the compiler is going to be better at optimizing. My gut says that storing the mask might be slightly easier to optimize: it's unlikely that somebody would create a
Honestly, I'm not sure. I think you're right that it would make sense for
The only thing I'm curious about is this: res[0] = __nvvm_vote_ballot_sync(threads, predicate); // couldnt call this within intel impl because undefined behaviour if not all reach it? I understand the comment, I think, -- Thank you again for working on this, I really appreciate it. Now that I'm back from vacation, I'll renew my efforts to get this fixed and merged in, along with the other group types. |
Hi @Pennycook, could you please fix post-commit issues for this PR. They are mostly werror problems: |
I also see this https://github.com/intel/llvm/actions/runs/4767700304/jobs/8476298480#step:7:3089 :
I'm not 100% sure it's caused by this, but looks so. |
Sorry for the confusion of the comment. This was really a note to myself when I was considering whether I could make a cuda impl that reused the same spirv functions used by the intel impl that do not take a mask (I can't even if it were desirable). The relevance is simply that you can use All sounds good. I also think it is best to store a mask for |
Enables the following functions to be used with ballot_group arguments:
Signed-off-by: John Pennycook john.pennycook@intel.com