Skip to content

opencl-aot fails to compile SYCL kernels with an unsupported subgroup size #10531

@fwyzard

Description

@fwyzard

Describe the bug

When compiling a SYCL/oneAPI application ahead of time for Intel CPUs, the current version of opencl-aot (2023.2.0) fails to compile a kernel that uses a subgroup size that is not supported by the OpenCL runtime.

According to the SYCL specification, all SYCL implementations must be able to compile device code that uses these optional features (various subgroup sizes etc) regardless of whether the implementation supports the features on any of its devices.

To Reproduce

Please describe the steps to reproduce the behavior:

1. Include code snippet as short as possible:

subgroup_test.cc

#include <cstdio>
#include <iostream>

#include <sycl/sycl.hpp>


#ifdef __SYCL_DEVICE_ONLY__
#    define __DEVICE_CONSTANT__ [[clang::opencl_constant]]
#else
#    define __DEVICE_CONSTANT__
#endif

#define printf(FORMAT, ...)                                                                                           \
    do                                                                                                                \
    {                                                                                                                 \
        static const char* __DEVICE_CONSTANT__ format = FORMAT;                                                       \
        sycl::ext::oneapi::experimental::printf(format, ##__VA_ARGS__);                                               \
    } while(false)


template <uint32_t S>
struct do_some_work {
  void operator()(sycl::nd_item<1> item) const {
    printf("      the expected sub-group size is %d\n", S);
    printf("      the actual sub-group size is %d\n", item.get_sub_group().get_max_local_range()[0]);
  }
};


int main() {
  auto platforms = sycl::platform::get_platforms();

  for (auto const& platform : platforms) {
    std::cout << "SYCL platform: " << platform.get_info<sycl::info::platform::name>() << '\n';
    auto devices = platform.get_devices();

    for (auto const& device : devices) {
      sycl::queue queue{device};

      auto sizes = device.get_info<sycl::info::device::sub_group_sizes>();
      std::cout << "  sub-group sizes supported by the device: " << sizes[0];
      for (int i = 1; i < sizes.size(); ++i) {
        std::cout << ", " << sizes[i];
      }
      std::cout << '\n';

      auto range = sycl::nd_range<1>(1, 1);
      for (int size : sizes) {
        std::cout << "\n    test sub-group of " << size << " elements:\n";

        // check if the kernel should be launched with a subgroup size of 4
        if (size == 4) {
          // launch the kernel with a subgroup size of 4
          queue.submit([&](sycl::handler& cgh) {
            cgh.parallel_for(sycl::nd_range<1>(1, 1),
                             [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(4)]] { do_some_work<4>{}(item); });
          }).wait();
        }

        // check if the kernel should be launched with a subgroup size of 8
        if (size == 8) {
          // launch the kernel with a subgroup size of 8
          queue.submit([&](sycl::handler& cgh) {
            cgh.parallel_for(sycl::nd_range<1>(1, 1),
                             [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(8)]] { do_some_work<8>{}(item); });
          }).wait();
        }

        // check if the kernel should be launched with a subgroup size of 16
        if (size == 16) {
          // launch the kernel with a subgroup size of 16
          queue.submit([&](sycl::handler& cgh) {
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(16)]] {
              do_some_work<16>{}(item);
            });
          }).wait();
        }

        // check if the kernel should be launched with a subgroup size of 32
        if (size == 32) {
          // launch the kernel with a subgroup size of 32
          queue.submit([&](sycl::handler& cgh) {
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(32)]] {
              do_some_work<32>{}(item);
            });
          }).wait();
        }

        // check if the kernel should be launched with a subgroup size of 64
        if (size == 64) {
          // launch the kernel with a subgroup size of 64
          queue.submit([&](sycl::handler& cgh) {
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(64)]] {
              do_some_work<64>{}(item);
            });
          }).wait();
        }
 
        // check if the kernel should be launched with a subgroup size of 128
        if (size == 128) {
          // launch the kernel with a subgroup size of 128
          queue.submit([&](sycl::handler& cgh) {
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(128)]] {
              do_some_work<128>{}(item);
            });
          }).wait();
        }
      }
    }
    std::cout << '\n';
  }
  std::cout << '\n';
}

2. Specify the command which should be used to compile the program

$ source /opt/intel/oneapi/setvars.sh
$ icpx -std=c++17 -O2 -g -Wall -fsycl -fsycl-targets=spir64_x86_64 subgroup_test.cc -o test.cpu

3. Specify the comment which should be used to launch the program

ONEAPI_DEVICE_SELECTOR='opencl:cpu' ./test.cpu

4. Indicate what is wrong and what was expected

The program fails to compile, with the error

Failed to build: : -11 (CL_BUILD_PROGRAM_FAILURE)

llvm-foreach: 
icpx: error: x86_64 compiler command failed with exit code 245 (use -v to see invocation)
Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230622)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2023.2.0/linux/bin-llvm
Configuration file: /opt/intel/oneapi/compiler/2023.2.0/linux/bin-llvm/../bin/icpx.cfg
icpx: note: diagnostic msg: Error generating preprocessed source(s).

The expected behaviour is that the program should compile correctly, compiling the kernel for all the supported subgroup sizes (4, 8, 16, 32, 64), possibly issuing a warning about the unsupported subgroup sizes (128).

For completeness, CodePlay's NVIDIA plugin produces only a warning about unsupported subgroup sizes, and builds the kernel correctly for the supported one:

$ icpx -std=c++17 -O2 -g -Wall -Wno-unknown-cuda-version -fsycl -fsycl-targets=nvidia_gpu_sm_86 subgroup_test.cc -o test.nv
subgroup_test.cc:56:86: warning: attribute argument 4 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat]
                             [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(4)]] { do_some_work<4>{}(item); });
                                                                                     ^
subgroup_test.cc:65:86: warning: attribute argument 8 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat]
                             [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(8)]] { do_some_work<8>{}(item); });
                                                                                     ^
subgroup_test.cc:73:111: warning: attribute argument 16 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat]
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(16)]] {
                                                                                                              ^
subgroup_test.cc:93:111: warning: attribute argument 64 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat]
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(64)]] {
                                                                                                              ^
subgroup_test.cc:103:111: warning: attribute argument 128 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat]
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(128)]] {
                                                                                                              ^
5 warnings generated.

$ ONEAPI_DEVICE_SELECTOR='cuda:gpu' ./test.nv 
SYCL platform: NVIDIA CUDA BACKEND
  sub-group sizes supported by the device: 32

    test sub-group of 32 elements:
      the expected sub-group size is 32
      the actual sub-group size is 32

Environment (please complete the following information):

  • OS: Linux (tested on Ubuntu 22.04 and RHEL 8.7)
  • Target device and vendor: Intel CPU
  • DPC++ version: 2023.1.0 and 2023.2.0
  • Dependencies version: n/a

Additional context

According to the latest SYCL 2020 specification:

5.7. Optional kernel features

A number of kernel features defined by this SYCL specification are optional; they may be supported on some devices but not on other devices. As described in Section 4.6.4.3, an application can test whether a device supports these features by testing whether the device has an associated aspect. The following aspects are those that correspond to optional kernel features:

  • fp16
  • fp64
  • atomic64

In addition, the following C++ attributes from Section 5.8.1 also correspond to optional kernel features because they force the kernel to be compiled in a way that might not run on all devices:

  • reqd_work_group_size()
  • reqd_sub_group_size()

In order to guarantee source code portability of SYCL applications that use optional kernel features, all SYCL implementations must be able to compile device code that uses these optional features regardless of whether the implementation supports the features on any of its devices.

(emphasis added)

Note: I would rate this issue as low priority, because the OpenCL CPU runtime supports the widest range of subgroup sizes (4, 8, 16, 32, 64) than any other SYCL backend.
So, while the AOT compiler does not follow the SYCL specification, it is unlikely that this specific issue will cause any real world problems, as nobody will likely use subgroup sizes smaller than 4 or larger than 64.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions