[sycl_ext_oneapi_clock] implement NVPTX case#21280
[sycl_ext_oneapi_clock] implement NVPTX case#21280tdavidcl wants to merge 4 commits intointel:syclfrom
Conversation
|
Also I just found out that there is this file in LLVM We could maybe use that for both Nvidia and AMD since that's what is called within the CI. |
|
Thank you. I found some post ROCm/ROCm#1288 that may be related to your comments. |
Co-authored-by: Alexey Bader <alexey.bader@intel.com>
It seems that the native builtins are better whenever available. I can try to replace the amd & the else branch by __builtin_readcyclecounter then ? |
|
@tdavidcl Please give a try for the amd and the else branch. Thanks. |
|
I've added it now it needs a bit of testing. I do not have access to a AMD GPU right now though. The best way of action would be probably a simple test to check that it compiles in all configurations + check that the return is both non zero and monotonically increase in subsequent calls. Where is the best spot to add such a test ? |
https://github.com/intel/llvm/tree/sycl/sycl/test-e2e/Clock |
|
@tdavidcl thanks for working on this! Also, these functions require device to support aspects: llvm/sycl/include/sycl/ext/oneapi/experimental/clock.hpp Lines 47 to 49 in 1a77753 That means we also need something like this but for CUDA adapter. |
| // this is due to potential higher overhead compared to a native API call | ||
| // see : https://github.com/ROCm/ROCm/issues/1288 | ||
| #if defined(__NVPTX__) | ||
| if constexpr (Scope == work_group || Scope == sub_group) { |
There was a problem hiding this comment.
| if constexpr (Scope == work_group || Scope == sub_group) { | |
| if constexpr (Scope == clock_scope::work_group || Scope == clock_scope::sub_group) { |
Note - do not apply this as is, clang-format will fail because strings should be <= 80 symbols.
There was a problem hiding this comment.
probably like:
if constexpr (Scope == clock_scope::work_group ||
Scope == clock_scope::sub_group) {
Oh perfect it looks like no changes are required in the tests beside enabling the device aspect. Additionally, in the clock test there is this snippet I have to check but i think that |
Hi after suggestion from @zjin-lcf here is a PR (context: KhronosGroup/SYCL-Docs#958).
It implements the NVPTX variant of clock() using the
%%clock64special register from PTX.https://docs.nvidia.com/cuda/archive/10.1/parallel-thread-execution/index.html?utm_source=chatgpt.com#special-registers-clock64
So it is safe to assume that the register is supported regardless of the PTX version used since intel llvm assume >5.0 if I recall correctly.
reference for usage internally to llvm (on this repo actually, nice :) )
llvm/clang/lib/Headers/__clang_cuda_device_functions.h
Line 1558 in 2705b67
(there is a typo in the PR which is already corrected by a commit, but i don't why it is not updating in the PR ...)