`opencl-aot` fails to compile SYCL kernels with an unsupported subgroup size

## Describe the bug

When compiling a SYCL/oneAPI application ahead of time for Intel CPUs, the current version of `opencl-aot` (2023.2.0) fails to compile a kernel that uses a subgroup size that is not supported by the OpenCL runtime.

According to the SYCL specification, _all SYCL implementations must be able to compile device code that uses these optional features_ (various subgroup sizes _etc_) _regardless of whether the implementation supports the features on any of its devices._


## To Reproduce
Please describe the steps to reproduce the behavior:

### 1. Include code snippet as short as possible:

#### `subgroup_test.cc`
```c++
#include <cstdio>
#include <iostream>

#include <sycl/sycl.hpp>


#ifdef __SYCL_DEVICE_ONLY__
#    define __DEVICE_CONSTANT__ [[clang::opencl_constant]]
#else
#    define __DEVICE_CONSTANT__
#endif

#define printf(FORMAT, ...)                                                                                           \
    do                                                                                                                \
    {                                                                                                                 \
        static const char* __DEVICE_CONSTANT__ format = FORMAT;                                                       \
        sycl::ext::oneapi::experimental::printf(format, ##__VA_ARGS__);                                               \
    } while(false)


template <uint32_t S>
struct do_some_work {
  void operator()(sycl::nd_item<1> item) const {
    printf("      the expected sub-group size is %d\n", S);
    printf("      the actual sub-group size is %d\n", item.get_sub_group().get_max_local_range()[0]);
  }
};


int main() {
  auto platforms = sycl::platform::get_platforms();

  for (auto const& platform : platforms) {
    std::cout << "SYCL platform: " << platform.get_info<sycl::info::platform::name>() << '\n';
    auto devices = platform.get_devices();

    for (auto const& device : devices) {
      sycl::queue queue{device};

      auto sizes = device.get_info<sycl::info::device::sub_group_sizes>();
      std::cout << "  sub-group sizes supported by the device: " << sizes[0];
      for (int i = 1; i < sizes.size(); ++i) {
        std::cout << ", " << sizes[i];
      }
      std::cout << '\n';

      auto range = sycl::nd_range<1>(1, 1);
      for (int size : sizes) {
        std::cout << "\n    test sub-group of " << size << " elements:\n";

        // check if the kernel should be launched with a subgroup size of 4
        if (size == 4) {
          // launch the kernel with a subgroup size of 4
          queue.submit([&](sycl::handler& cgh) {
            cgh.parallel_for(sycl::nd_range<1>(1, 1),
                             [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(4)]] { do_some_work<4>{}(item); });
          }).wait();
        }

        // check if the kernel should be launched with a subgroup size of 8
        if (size == 8) {
          // launch the kernel with a subgroup size of 8
          queue.submit([&](sycl::handler& cgh) {
            cgh.parallel_for(sycl::nd_range<1>(1, 1),
                             [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(8)]] { do_some_work<8>{}(item); });
          }).wait();
        }

        // check if the kernel should be launched with a subgroup size of 16
        if (size == 16) {
          // launch the kernel with a subgroup size of 16
          queue.submit([&](sycl::handler& cgh) {
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(16)]] {
              do_some_work<16>{}(item);
            });
          }).wait();
        }

        // check if the kernel should be launched with a subgroup size of 32
        if (size == 32) {
          // launch the kernel with a subgroup size of 32
          queue.submit([&](sycl::handler& cgh) {
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(32)]] {
              do_some_work<32>{}(item);
            });
          }).wait();
        }

        // check if the kernel should be launched with a subgroup size of 64
        if (size == 64) {
          // launch the kernel with a subgroup size of 64
          queue.submit([&](sycl::handler& cgh) {
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(64)]] {
              do_some_work<64>{}(item);
            });
          }).wait();
        }
 
        // check if the kernel should be launched with a subgroup size of 128
        if (size == 128) {
          // launch the kernel with a subgroup size of 128
          queue.submit([&](sycl::handler& cgh) {
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(128)]] {
              do_some_work<128>{}(item);
            });
          }).wait();
        }
      }
    }
    std::cout << '\n';
  }
  std::cout << '\n';
}
```

### 2. Specify the command which should be used to compile the program

```bash
$ source /opt/intel/oneapi/setvars.sh
$ icpx -std=c++17 -O2 -g -Wall -fsycl -fsycl-targets=spir64_x86_64 subgroup_test.cc -o test.cpu
```

### 3. Specify the comment which should be used to launch the program
```bash
ONEAPI_DEVICE_SELECTOR='opencl:cpu' ./test.cpu
```

### 4. Indicate what is wrong and what was expected

The program fails to compile, with the error
```
Failed to build: : -11 (CL_BUILD_PROGRAM_FAILURE)

llvm-foreach: 
icpx: error: x86_64 compiler command failed with exit code 245 (use -v to see invocation)
Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230622)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2023.2.0/linux/bin-llvm
Configuration file: /opt/intel/oneapi/compiler/2023.2.0/linux/bin-llvm/../bin/icpx.cfg
icpx: note: diagnostic msg: Error generating preprocessed source(s).
```

The expected behaviour is that the program should compile correctly, compiling the kernel for all the supported subgroup sizes (4, 8, 16, 32, 64), possibly issuing a warning about the unsupported subgroup sizes (128).

For completeness, CodePlay's NVIDIA plugin produces only a warning about unsupported subgroup sizes, and builds the kernel correctly for the supported one:
```
$ icpx -std=c++17 -O2 -g -Wall -Wno-unknown-cuda-version -fsycl -fsycl-targets=nvidia_gpu_sm_86 subgroup_test.cc -o test.nv
subgroup_test.cc:56:86: warning: attribute argument 4 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat]
                             [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(4)]] { do_some_work<4>{}(item); });
                                                                                     ^
subgroup_test.cc:65:86: warning: attribute argument 8 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat]
                             [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(8)]] { do_some_work<8>{}(item); });
                                                                                     ^
subgroup_test.cc:73:111: warning: attribute argument 16 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat]
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(16)]] {
                                                                                                              ^
subgroup_test.cc:93:111: warning: attribute argument 64 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat]
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(64)]] {
                                                                                                              ^
subgroup_test.cc:103:111: warning: attribute argument 128 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat]
            cgh.parallel_for(sycl::nd_range<1>(1, 1), [=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(128)]] {
                                                                                                              ^
5 warnings generated.

$ ONEAPI_DEVICE_SELECTOR='cuda:gpu' ./test.nv 
SYCL platform: NVIDIA CUDA BACKEND
  sub-group sizes supported by the device: 32

    test sub-group of 32 elements:
      the expected sub-group size is 32
      the actual sub-group size is 32

```
 
## Environment (please complete the following information):

- OS: Linux (tested on Ubuntu 22.04 and RHEL 8.7)
- Target device and vendor: Intel CPU
- DPC++ version: 2023.1.0 and 2023.2.0
- Dependencies version: n/a

## Additional context

According to the [latest SYCL 2020 specification](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html):

> ## [5.7. Optional kernel features](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:optional-kernel-features)
> 
> A number of kernel features defined by this SYCL specification are optional; they may be supported on some devices but not on other devices. As described in [Section 4.6.4.3](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:device-aspects), an application can test whether a device supports these features by testing whether the device has an associated aspect. The following aspects are those that correspond to optional kernel features:
>   - `fp16`
>   - `fp64`
>   - `atomic64`
> 
> In addition, the following C++ attributes from [Section 5.8.1](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:kernel.attributes) also correspond to optional kernel features because they force the kernel to be compiled in a way that might not run on all devices:
>   - `reqd_work_group_size()`
>   - `reqd_sub_group_size()`
> 
> In order to guarantee source code portability of SYCL applications that use optional kernel features, **all SYCL implementations must be able to compile device code that uses these optional features regardless of whether the implementation supports the features on any of its devices.**

(emphasis added)

**Note**: I would rate this issue as low priority, because the OpenCL CPU runtime supports the widest range of subgroup sizes (4, 8, 16, 32, 64) than any other SYCL backend.
So, while the AOT compiler does not follow the SYCL specification, it is unlikely that this specific  issue will cause any real world problems, as nobody will likely use subgroup sizes smaller than 4 or larger than 64.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`opencl-aot` fails to compile SYCL kernels with an unsupported subgroup size #10531

Describe the bug

To Reproduce

1. Include code snippet as short as possible:

`subgroup_test.cc`

2. Specify the command which should be used to compile the program

3. Specify the comment which should be used to launch the program

4. Indicate what is wrong and what was expected

Environment (please complete the following information):

Additional context

5.7. Optional kernel features

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

opencl-aot fails to compile SYCL kernels with an unsupported subgroup size #10531

Description

Describe the bug

To Reproduce

1. Include code snippet as short as possible:

subgroup_test.cc

2. Specify the command which should be used to compile the program

3. Specify the comment which should be used to launch the program

4. Indicate what is wrong and what was expected

Environment (please complete the following information):

Additional context

5.7. Optional kernel features

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`opencl-aot` fails to compile SYCL kernels with an unsupported subgroup size #10531

`subgroup_test.cc`