Verification results differ across vendors' GPUs #16636

jinz2014 · 2025-01-14T22:13:53Z

icpx 2025.0 with NVIDIA/AMD plugins:

The verification of the SYCL program in https://github.com/zjin-lcf/HeCBench/tree/master/src/quantVLLM-sycl may show some issues.

Intel Max GPU:

./main 4096 5137 1000
Input type is FP16
PASS
Input type is BF16
FAIL
Input type is FP32
PASS

NVIDIA/AMD GPU:
FAIL for three data types

The CUDA and HIP programs run successfully on the NVIDIA and AMD GPUs.

dm-vodopyanov · 2025-01-15T15:56:40Z

Hi @jinz2014, thanks for the report. Could you please also attach sycl-ls --verbose output?

Reproduced but for another Intel GPU got different results:

./a.out 4096 5137 1000
Input type is FP16
Average execution time of static_scaled_int8_quant kernel: 1882.331787 (us)
Average execution time of static_scaled_int8_quant_azp kernel: 463.766968 (us)
Average execution time of dynamic_scaled_int8_quant kernel: 137.035599 (us)
Average execution time of dynamic_scaled_int8_quant_azp kernel: 756.085266 (us)
FAIL
Input type is BF16
Average execution time of static_scaled_int8_quant kernel: 629.392334 (us)
Average execution time of static_scaled_int8_quant_azp kernel: 423.225311 (us)
Average execution time of dynamic_scaled_int8_quant kernel: 131.694229 (us)
Average execution time of dynamic_scaled_int8_quant_azp kernel: 708.030212 (us)
FAIL
Input type is FP32
Average execution time of static_scaled_int8_quant kernel: 258.933594 (us)
Average execution time of static_scaled_int8_quant_azp kernel: 263.415619 (us)
Average execution time of dynamic_scaled_int8_quant kernel: 142.391510 (us)
Average execution time of dynamic_scaled_int8_quant_azp kernel: 343.215698 (us)
PASS

jinz2014 · 2025-01-15T16:33:28Z

I assume that the ported program in SYCL matches the CUDA/HIP programs. I tried Syclomatic (Intel(R) DPC++ Compatibility Tool version 2025.0.0), but building the migrated files was not successful.

[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Xeon(R) Silver 4410T OpenCL 3.0 (Build 0) [2024.18.12.0.05_160000]
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1100 OpenCL 3.0 NEO [23.17.26241.33]

Platforms: 2
Platform [#1]:
Version : OpenCL 3.0 LINUX
Name : Intel(R) OpenCL
Vendor : Intel(R) Corporation
Devices : 1
Device [#0]:
Type : cpu
Version : OpenCL 3.0 (Build 0)
Name : Intel(R) Xeon(R) Silver 4410T
Vendor : Intel(R) Corporation
Driver : 2024.18.12.0.05_160000
Num SubDevices : 2
Num SubSubDevices : 0
Aspects : cpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations usm_system_allocations usm_atomic_host_allocations usm_atomic_shared_allocations atomic64 ext_oneapi_srgb ext_oneapi_native_assert ext_intel_legacy_image ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_tangle_group ext_intel_matrix ext_oneapi_private_alloca
info::device::sub_group_sizes: 4 8 16 32 64
Architecture: intel_cpu_spr
Platform [#2]:
Version : OpenCL 3.0
Name : Intel(R) OpenCL Graphics
Vendor : Intel(R) Corporation
Devices : 1
Device [#1]:
Type : gpu
Version : OpenCL 3.0 NEO
Name : Intel(R) Data Center GPU Max 1100
Vendor : Intel(R) Corporation
Driver : 23.17.26241.33
UUID : 1341282181147000440000000
Num SubDevices : 0
Num SubSubDevices : 0
Aspects : gpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations atomic64 ext_intel_device_info_uuid ext_oneapi_srgb ext_intel_device_id ext_intel_esimd ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_tangle_group ext_intel_matrix ext_oneapi_private_alloca
info::device::sub_group_sizes: 16 32
Architecture: intel_gpu_pvc
default_selector() : gpu, Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1100 OpenCL 3.0 NEO [23.17.26241.33]
accelerator_selector() : No device of requested type available. Please chec...
cpu_selector() : cpu, Intel(R) OpenCL, Intel(R) Xeon(R) Silver 4410T OpenCL 3.0 (Build 0) [2024.18.12.0.05_160000]
gpu_selector() : gpu, Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1100 OpenCL 3.0 NEO [23.17.26241.33]
custom_selector(gpu) : gpu, Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1100 OpenCL 3.0 NEO [23.17.26241.33]
custom_selector(cpu) : cpu, Intel(R) OpenCL, Intel(R) Xeon(R) Silver 4410T OpenCL 3.0 (Build 0) [2024.18.12.0.05_160000]

KseniyaTikhomirova · 2025-02-12T18:47:18Z

I found that verification results differ when we have mid value case. For example, result before rounding evaluates to 63.5... on host and to 63.499996 on device. After rounding we got 64 vs 63 and test fails.

Floating point math has different precision on host (precise) vs device. Device satisfies the following requirements: https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_C.html#relative-error-as-ulps
So some difference could be expected due to precision.

we recently added new compiler option that could improve fdiv accuracy on device #15836.

without this option I see failing tests on Intel GPU HW:

Input type is FP16
Average execution time of static_scaled_int8_quant kernel: 100954.109375 (us)
Average execution time of static_scaled_int8_quant_azp kernel: 88.623001 (us)
Average execution time of dynamic_scaled_int8_quant kernel: 75.143005 (us)
Average execution time of dynamic_scaled_int8_quant_azp kernel: 86.761002 (us)
PASS
Input type is BF16
Average execution time of static_scaled_int8_quant kernel: 112547.500000 (us)
Average execution time of static_scaled_int8_quant_azp kernel: 102.562004 (us)
Average execution time of dynamic_scaled_int8_quant kernel: 87.197006 (us)
Average execution time of dynamic_scaled_int8_quant_azp kernel: 100.933006 (us)
FAIL
Input type is FP32
Average execution time of static_scaled_int8_quant kernel: 59.606003 (us)
Average execution time of static_scaled_int8_quant_azp kernel: 56.462002 (us)
Average execution time of dynamic_scaled_int8_quant kernel: 63.856003 (us)
Average execution time of dynamic_scaled_int8_quant_azp kernel: 80.416000 (us)
PASS

with -foffload-fp32-prec-div option I see Passed results on the same system:

Input type is FP16
Average execution time of static_scaled_int8_quant kernel: 138122.687500 (us)
Average execution time of static_scaled_int8_quant_azp kernel: 111.836006 (us)
Average execution time of dynamic_scaled_int8_quant kernel: 98.415001 (us)
Average execution time of dynamic_scaled_int8_quant_azp kernel: 74.494003 (us)
PASS
Input type is BF16
Average execution time of static_scaled_int8_quant kernel: 181717.093750 (us)
Average execution time of static_scaled_int8_quant_azp kernel: 86.713005 (us)
Average execution time of dynamic_scaled_int8_quant kernel: 114.898003 (us)
Average execution time of dynamic_scaled_int8_quant_azp kernel: 88.442001 (us)
PASS
Input type is FP32
Average execution time of static_scaled_int8_quant kernel: 69.201004 (us)
Average execution time of static_scaled_int8_quant_azp kernel: 64.743004 (us)
Average execution time of dynamic_scaled_int8_quant kernel: 69.262001 (us)
Average execution time of dynamic_scaled_int8_quant_azp kernel: 76.831001 (us)
PASS

jinz2014 could you please verify if this new feature helps with the problem on your side?

jinz2014 · 2025-03-04T16:29:09Z

Thank you for your answer very much.

)

dm-vodopyanov added cuda CUDA back-end hip Issues related to execution on HIP backend. confirmed bug Something isn't working Need info Some clarifications are needed from the reporter labels Jan 15, 2025

dm-vodopyanov removed the Need info Some clarifications are needed from the reporter label Jan 15, 2025

jinz2014 closed this as completed Mar 4, 2025

zjin-lcf added a commit to zjin-lcf/HeCBench that referenced this issue Mar 7, 2025

[quantVLLM-sycl] add a compiler option in the Makefile (intel/llvm#16636

08b126d

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verification results differ across vendors' GPUs #16636

Verification results differ across vendors' GPUs #16636

jinz2014 commented Jan 14, 2025

dm-vodopyanov commented Jan 15, 2025 •

edited

Loading

jinz2014 commented Jan 15, 2025

KseniyaTikhomirova commented Feb 12, 2025 •

edited

Loading

jinz2014 commented Mar 4, 2025

Verification results differ across vendors' GPUs #16636

Verification results differ across vendors' GPUs #16636

Comments

jinz2014 commented Jan 14, 2025

dm-vodopyanov commented Jan 15, 2025 • edited Loading

jinz2014 commented Jan 15, 2025

KseniyaTikhomirova commented Feb 12, 2025 • edited Loading

jinz2014 commented Mar 4, 2025

dm-vodopyanov commented Jan 15, 2025 •

edited

Loading

KseniyaTikhomirova commented Feb 12, 2025 •

edited

Loading