-
Notifications
You must be signed in to change notification settings - Fork 801
[SYCL][Devicelib] Implement cmath rintf wrapper with __spirv_ocl_rint #18857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
npmiller
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, we could probably remove the NVPTX/AMDGCN specific path and use the __spirv_ocl_rint path for them as well, but we can look at that in a different patch.
|
In general, there is no point in adding |
againull
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test please.
Could you point me to the wrapper that implements This PR is to support use of |
Hi, @bader and @wenju-he |
done, thanks. |
It's not actually ready yet, but I've been working on that in #18706 which will address this rint issue once completed, but it needs more work and is currently only tested for CUDA/HIP. So I think it's fine to go ahead with this PR until the header solution is ready. |
|
### Overview Currently to support C++ builtins in SYCL kernels, we rely on `libdevice` which provides implementations for standard library builtins. This library is built either to bitcode or SPIR-V and linked in our kernels. On some targets this causes issues because clang sometimes turns standard library calls into LLVM intrinsics that not all targets support. Specifically on NVPTX and AMDGCN we can't easily support these intrinsics because we currently use implementations provided by CUDA and HIP in the form of a bitcode library, which is not something we can use from the LLVM backend. In upstream LLVM for CUDA and HIP kernels, the way this is handled is that they have clang headers providing device-side overloads of C++ library functions that hook into the target specific versions of the builtins (for example `std::sin` to `__nv_sin`). This way on the device side C++ builtins are hijacked before clang can turn them to intrinsics which solves the issue mentioned above. This patch is adding the infrastructure to support handling C++ builtins in SYCL in the same way as it is done for CUDA and HIP in upstream LLVM. And is using it to support `cmath` in NVPTX and AMDGCN compilation. ## Breakdown * Add `sycl_device_only` attribute: This new attribute allows functions marked with it to be treated as device-side overload of existing functions. This is what allows us to overload C++ library functions for device in SYCL. * Remove clang hack to prevent generating LLVM intrinsics from standard library builtins for NVPTX and AMDGCN. In theory since this is only moving `cmath`, the hack could still be needed, but it looks fine in testing and if we run into issues we should just move the problematic builtins to this solution. The test `sycl-libdevice-cmath.cpp` was testing this hack, so it was removed. * `cmath` support for NVPTX and AMDGCN in `libdevice` was disabled. To limit the scope of the patch `libdevice` is still fully wired up for these targets, but it just won't provide the `cmath` functions. * Added a `cmath-fallback.h` header providing the device-side math function overloads. They are defined using SPIR-V builtins, so in theory this header could be used as-is for other targets. * Use our existing `cmath` stl wrapper to include `cmath-fallback.h` for NVPTX and AMDGCN. In upstream LLVM `clang-cuda` always includes with `-include` the header with these overloads, using the stl wrappers is a bit more selective. * Add `rint` to device lib tests and stl wrapper, this was added in #18857 but wasn't in E2E testing. ## Compile-time performance A quick check of compile-time shows that this seems to provide a small performance improvement. Using two samples, one using cmath (the E2E `cmath_test.cpp`), and a sample not using cmath, over 10 iterations, I'm getting the following results: | Run | Mean | Stdev | |:--:|:--:|:--:| |With patch, cmath sample | 4.2229s | 0.0294s | |With patch, no cmath sample | 5.7484s | 0.0525s | |Without patch, cmath sample | 4.3817s | 0.0424s | |Without patch, no cmath sample | 5.7941s | 0.0452s | Which suggest that the no cmath compile time performance is pretty much equivalent, and the cmath compile-time performance is faster by roughly ~0.12s. And this is with the whole `libdevice` setup still in place, so it's possible this approach could be even more beneficial with more work. ## Future work * Investigate commented out standard math builtins in `cmath-fallback.h`, these weren't defined in libdevice, we should either remove the commented out lines or implement them properly. * Untangle `cmath` and `math.h`, the current `cmath-fallback.h` implements both which seems to work fine, but ideally we should split it up. * Deal with `nearbyint`, this was only implemented for NVPTX and AMDGCN in `libdevice`, this patch keeps it the same, but we should look into proper support and testing for this. * Move more of `libdevice` into headers (complex, assert, crt, etc ...). * Try this approach for SPIR-V or other targets. --------- Co-authored-by: Alexey Bader <alexey.bader@intel.com>
This PR is to support the use of std::rint in device code. Currently it
is resolved to rintf symbol. With this PR, the rintf symbol is resolved
by libdevice.