Skip to content

[sycl_ext_oneapi_clock] implement NVPTX case#21280

Open
tdavidcl wants to merge 4 commits intointel:syclfrom
tdavidcl:ptx-clock
Open

[sycl_ext_oneapi_clock] implement NVPTX case#21280
tdavidcl wants to merge 4 commits intointel:syclfrom
tdavidcl:ptx-clock

Conversation

@tdavidcl
Copy link

@tdavidcl tdavidcl commented Feb 12, 2026

Hi after suggestion from @zjin-lcf here is a PR (context: KhronosGroup/SYCL-Docs#958).
It implements the NVPTX variant of clock() using the %%clock64 special register from PTX.

https://docs.nvidia.com/cuda/archive/10.1/parallel-thread-execution/index.html?utm_source=chatgpt.com#special-registers-clock64

PTX ISA Notes
Introduced in PTX ISA version 2.0.

So it is safe to assume that the register is supported regardless of the PTX version used since intel llvm assume >5.0 if I recall correctly.

reference for usage internally to llvm (on this repo actually, nice :) )

__DEVICE__ long long clock64() { return __nvvm_read_ptx_sreg_clock64(); }

(there is a typo in the PR which is already corrected by a commit, but i don't why it is not updating in the PR ...)

@tdavidcl tdavidcl requested a review from a team as a code owner February 12, 2026 19:16
@tdavidcl
Copy link
Author

Also I just found out that there is this file in LLVM libc/src/__support/GPU/utils.h which does define
uint64_t processor_clock() { return __builtin_readcyclecounter(); }
which is used in all test apparently.

We could maybe use that for both Nvidia and AMD since that's what is called within the CI.

@zjin-lcf
Copy link
Contributor

Thank you. I found some post ROCm/ROCm#1288 that may be related to your comments.

Co-authored-by: Alexey Bader <alexey.bader@intel.com>
@tdavidcl
Copy link
Author

Thank you. I found some post ROCm/ROCm#1288 that may be related to your comments.

It seems that the native builtins are better whenever available. I can try to replace the amd & the else branch by __builtin_readcyclecounter then ?

@zjin-lcf
Copy link
Contributor

@tdavidcl Please give a try for the amd and the else branch. Thanks.

@tdavidcl
Copy link
Author

I've added it now it needs a bit of testing. I do not have access to a AMD GPU right now though. The best way of action would be probably a simple test to check that it compiles in all configurations + check that the return is both non zero and monotonically increase in subsequent calls. Where is the best spot to add such a test ?

@KornevNikita
Copy link
Contributor

I've added it now it needs a bit of testing. I do not have access to a AMD GPU right now though. The best way of action would be probably a simple test to check that it compiles in all configurations + check that the return is both non zero and monotonically increase in subsequent calls. Where is the best spot to add such a test ?

https://github.com/intel/llvm/tree/sycl/sycl/test-e2e/Clock

@KornevNikita
Copy link
Contributor

@tdavidcl thanks for working on this! Also, these functions require device to support aspects:

#ifdef __SYCL_DEVICE_ONLY__
[[__sycl_detail__::__uses_aspects__(sycl::aspect::ext_oneapi_clock_device)]]
#endif

That means we also need something like this but for CUDA adapter.

// this is due to potential higher overhead compared to a native API call
// see : https://github.com/ROCm/ROCm/issues/1288
#if defined(__NVPTX__)
if constexpr (Scope == work_group || Scope == sub_group) {
Copy link
Contributor

@KornevNikita KornevNikita Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if constexpr (Scope == work_group || Scope == sub_group) {
if constexpr (Scope == clock_scope::work_group || Scope == clock_scope::sub_group) {

Note - do not apply this as is, clang-format will fail because strings should be <= 80 symbols.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably like:

if constexpr (Scope == clock_scope::work_group ||
              Scope == clock_scope::sub_group) {

@tdavidcl
Copy link
Author

I've added it now it needs a bit of testing. I do not have access to a AMD GPU right now though. The best way of action would be probably a simple test to check that it compiles in all configurations + check that the return is both non zero and monotonically increase in subsequent calls. Where is the best spot to add such a test ?

https://github.com/intel/llvm/tree/sycl/sycl/test-e2e/Clock

Oh perfect it looks like no changes are required in the tests beside enabling the device aspect. Additionally, in the clock test there is this snippet

// UNSUPPORTED: target-native_cpu
// UNSUPPORTED-TRACKER: https://github.com/intel/llvm/issues/20142

I have to check but i think that __builtin_readcyclecounter does support the host and maybe clock() could also be enabled for target-native_cpu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants