-
Notifications
You must be signed in to change notification settings - Fork 802
Description
Describe the bug
When compiling a SYCL/oneAPI application ahead of time for Intel CPUs, the current version of opencl-aot (2023.2.0) fails to compile a kernel that uses a subgroup size that is not supported by the OpenCL runtime.
According to the SYCL specification, all SYCL implementations must be able to compile device code that uses these optional features (various subgroup sizes etc) regardless of whether the implementation supports the features on any of its devices.
To Reproduce
Please describe the steps to reproduce the behavior:
1. Include code snippet as short as possible:
subgroup_test.cc
#include <cstdio>
#include <iostream>
#include <sycl/sycl.hpp>
#ifdef __SYCL_DEVICE_ONLY__
# define __DEVICE_CONSTANT__ [[clang::opencl_constant]]
#else
# define __DEVICE_CONSTANT__
#endif
#define printf(FORMAT, ...) \
do \
{ \
static const char* __DEVICE_CONSTANT__ format = FORMAT; \
sycl::ext::oneapi::experimental::printf(format, ##__VA_ARGS__); \
} while(false)
template <uint32_t S>
struct do_some_work {
void operator()(sycl::nd_item<1> item) const {
printf(" the expected sub-group size is %d\n", S);
printf(" the actual sub-group size is %d\n", item.get_sub_group().get_max_local_range()[0]);
}
};
int main() {
auto platforms = sycl::platform::get_platforms();
for (auto const& platform : platforms) {
std::cout << "SYCL platform: " << platform.get_info<sycl::info::platform::name>() << '\n';
auto devices = platform.get_devices();
for (auto const& device : devices) {
sycl::queue queue{device};
auto sizes = device.get_info<sycl::info::device::sub_group_sizes>();
std::cout << " sub-group sizes supported by the device: " << sizes[0];
for (int i = 1; i < sizes.size(); ++i) {
std::cout << ", " << sizes[i];
}
std::cout << '\n';
auto range = sycl::nd_range<1>(1, 1);
for (int size : sizes) {
std::cout << "\n test sub-group of " << size << " elements:\n";
// check if the kernel should be launched with a subgroup size of 4
if (size == 4) {
// launch the kernel with a subgroup size of 4
queue.submit([&](sycl::handler& cgh) {
cgh.parallel_for(sycl::nd_range<1>(1, 1),
[=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(4)]] { do_some_work<4>{}(item); });
}).wait();
}
// check if the kernel should be launched with a subgroup size of 8
if (size == 8) {
// launch the kernel with a subgroup size of 8
queue.submit([&](sycl::handler& cgh) {
cgh.parallel_for(sycl::nd_range<1>(1, 1),
[=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(8)]] { do_some_work<8>{}(item); });
}).wait();
}
// check if the kernel should be launched with a subgroup size of 16
if (size == 16) {
// launch the kernel with a subgroup size of 16
queue.submit([&](sycl::handler& cgh) {
cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(16)]] {
do_some_work<16>{}(item);
});
}).wait();
}
// check if the kernel should be launched with a subgroup size of 32
if (size == 32) {
// launch the kernel with a subgroup size of 32
queue.submit([&](sycl::handler& cgh) {
cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(32)]] {
do_some_work<32>{}(item);
});
}).wait();
}
// check if the kernel should be launched with a subgroup size of 64
if (size == 64) {
// launch the kernel with a subgroup size of 64
queue.submit([&](sycl::handler& cgh) {
cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(64)]] {
do_some_work<64>{}(item);
});
}).wait();
}
// check if the kernel should be launched with a subgroup size of 128
if (size == 128) {
// launch the kernel with a subgroup size of 128
queue.submit([&](sycl::handler& cgh) {
cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(128)]] {
do_some_work<128>{}(item);
});
}).wait();
}
}
}
std::cout << '\n';
}
std::cout << '\n';
}2. Specify the command which should be used to compile the program
$ source /opt/intel/oneapi/setvars.sh
$ icpx -std=c++17 -O2 -g -Wall -fsycl -fsycl-targets=spir64_x86_64 subgroup_test.cc -o test.cpu3. Specify the comment which should be used to launch the program
ONEAPI_DEVICE_SELECTOR='opencl:cpu' ./test.cpu4. Indicate what is wrong and what was expected
The program fails to compile, with the error
Failed to build: : -11 (CL_BUILD_PROGRAM_FAILURE)
llvm-foreach:
icpx: error: x86_64 compiler command failed with exit code 245 (use -v to see invocation)
Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230622)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2023.2.0/linux/bin-llvm
Configuration file: /opt/intel/oneapi/compiler/2023.2.0/linux/bin-llvm/../bin/icpx.cfg
icpx: note: diagnostic msg: Error generating preprocessed source(s).
The expected behaviour is that the program should compile correctly, compiling the kernel for all the supported subgroup sizes (4, 8, 16, 32, 64), possibly issuing a warning about the unsupported subgroup sizes (128).
For completeness, CodePlay's NVIDIA plugin produces only a warning about unsupported subgroup sizes, and builds the kernel correctly for the supported one:
$ icpx -std=c++17 -O2 -g -Wall -Wno-unknown-cuda-version -fsycl -fsycl-targets=nvidia_gpu_sm_86 subgroup_test.cc -o test.nv
subgroup_test.cc:56:86: warning: attribute argument 4 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat]
[=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(4)]] { do_some_work<4>{}(item); });
^
subgroup_test.cc:65:86: warning: attribute argument 8 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat]
[=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(8)]] { do_some_work<8>{}(item); });
^
subgroup_test.cc:73:111: warning: attribute argument 16 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat]
cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(16)]] {
^
subgroup_test.cc:93:111: warning: attribute argument 64 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat]
cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(64)]] {
^
subgroup_test.cc:103:111: warning: attribute argument 128 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat]
cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(128)]] {
^
5 warnings generated.
$ ONEAPI_DEVICE_SELECTOR='cuda:gpu' ./test.nv
SYCL platform: NVIDIA CUDA BACKEND
sub-group sizes supported by the device: 32
test sub-group of 32 elements:
the expected sub-group size is 32
the actual sub-group size is 32
Environment (please complete the following information):
- OS: Linux (tested on Ubuntu 22.04 and RHEL 8.7)
- Target device and vendor: Intel CPU
- DPC++ version: 2023.1.0 and 2023.2.0
- Dependencies version: n/a
Additional context
According to the latest SYCL 2020 specification:
5.7. Optional kernel features
A number of kernel features defined by this SYCL specification are optional; they may be supported on some devices but not on other devices. As described in Section 4.6.4.3, an application can test whether a device supports these features by testing whether the device has an associated aspect. The following aspects are those that correspond to optional kernel features:
fp16fp64atomic64In addition, the following C++ attributes from Section 5.8.1 also correspond to optional kernel features because they force the kernel to be compiled in a way that might not run on all devices:
reqd_work_group_size()reqd_sub_group_size()In order to guarantee source code portability of SYCL applications that use optional kernel features, all SYCL implementations must be able to compile device code that uses these optional features regardless of whether the implementation supports the features on any of its devices.
(emphasis added)
Note: I would rate this issue as low priority, because the OpenCL CPU runtime supports the widest range of subgroup sizes (4, 8, 16, 32, 64) than any other SYCL backend.
So, while the AOT compiler does not follow the SYCL specification, it is unlikely that this specific issue will cause any real world problems, as nobody will likely use subgroup sizes smaller than 4 or larger than 64.