-
Notifications
You must be signed in to change notification settings - Fork 801
[CI] Enable CUDA SYCL CTS tests #6439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Yin Yang <yin.yang@intel.com>
Signed-off-by: Yin Yang <yin.yang@intel.com>
Signed-off-by: Yin Yang <yin.yang@intel.com>
Signed-off-by: Yin Yang <yin.yang@intel.com>
Signed-off-by: Yin Yang <yin.yang@intel.com>
Signed-off-by: Yin Yang <yin.yang@intel.com>
Signed-off-by: Yin Yang <yin.yang@intel.com>
Signed-off-by: Yin Yang <yin.yang@intel.com>
Signed-off-by: Yin Yang <yin.yang@intel.com>
|
Here is the summary of the SYCL-CTS execution on NVIDIA GPU. Tests we can't build: Many test fail due to below error: @yinyangsx, I think something is wrong either with the HW or with the CUDA driver installation. @pvchupin, FYI. |
| export LD_LIBRARY_PATH=$PWD/toolchain/lib/:$LD_LIBRARY_PATH | ||
| export PATH=$PWD/toolchain/bin/:$PATH | ||
| # TODO make this part of container build | ||
| export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/rocm/hip/lib/:/opt/rocm/lib | ||
| export SYCL_DEVICE_FILTER=${{ matrix.sycl_device_filter }} | ||
| if [ -e /runtimes/oneapi-tbb/env/vars.sh ]; then | ||
| source /runtimes/oneapi-tbb/env/vars.sh; | ||
| elif [ -e /opt/runtimes/oneapi-tbb/env/vars.sh ]; then | ||
| source /opt/runtimes/oneapi-tbb/env/vars.sh; | ||
| else | ||
| echo "no TBB vars in /opt/runtimes or /runtimes"; | ||
| fi | ||
| # TODO remove workaround of FPGA emu bug | ||
| mkdir -p icd | ||
| echo /usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so > icd/gpu.icd | ||
| echo /runtimes/oclcpu/x64/libintelocl.so > icd/cpu.icd | ||
| echo /opt/runtimes/oclcpu/x64/libintelocl.so > icd/cpu2.icd | ||
| export OCL_ICD_VENDORS=$PWD/icd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? I expect the container to be already correctly configured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I have set up to mimic the environment of CUDA LLVM Test Suite, currently CUDA LLVM Test Suite is working well, so I think the environment should be correct. And when the test is stable I will do this configuration when building the image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I'll do review when it's marked as ready for review then.
NOTE: Some of the tests you disabled should be fixed by KhronosGroup/SYCL-CTS#367.
I suggest checking the status using latest SYCL-CTS source and re-enable disabled tests. Ideally we should use the filter file from the Khronos SYCL-CTS repository instead of creating a local duplicate.
bader
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't reproduce the issues from the latest CI test results on machine with NVIDIA GPU, but I don't use docker containers. So, I suspect it might be the reason of massive failures. Something wrong might with with configuring NVIDIA GPU device/driver inside a container.
devops/cts_exclude_filter
Outdated
| buffer | ||
| exceptions | ||
| multi_ptr | ||
| opencl_interop | ||
| reduction | ||
| group | ||
| kernel | ||
| math_builtin_api | ||
| nd_item | ||
| optional_kernel_features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| buffer | |
| exceptions | |
| multi_ptr | |
| opencl_interop | |
| reduction | |
| group | |
| kernel | |
| math_builtin_api | |
| nd_item | |
| optional_kernel_features | |
| buffer | |
| exceptions | |
| multi_ptr | |
| reduction | |
| vector_swizzles | |
| optional_kernel_features |
In addition to this list:
- accessor_legacy will abort due to unsupported image types, so I suggest excluding it too. EDIT: for NVIDIA GPU ONLY(!). Most likely the issue exists for AMD GPUs.
devops/test_configs.json
Outdated
| ], | ||
| "image": "${{ inputs.cuda_image }}", | ||
| "container_options": "--gpus all", | ||
| "sycl_device_filter": "cuda:gpu,host", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "sycl_device_filter": "cuda:gpu,host", | |
| "sycl_device_filter": "ext_oneapi_cuda:gpu,host", |
This reverts commit 8fa06e9.
There is an internal error in ptxas tool. ptxas fatal : Internal error: reference to deleted symbol It's unclear what is causing this problem. Let's disable some tests to work around this issue.
Test aborts on NVIDIA GPU tests/kernel_bundle/../common/../../util/proxy.h:47: FAILED: due to a fatal error condition: SIGABRT - Abort (abnormal termination) signal
|
@yinyangsx, @pvchupin, I addressed code review comments and enabled more tests. There are a few issues with compilation and running CTS, which I'm going to report via GitHub issues. |
|
Current status of issues for SYCL-CTS on NVIDIA GPU. I started with building clean version of Khronos SYCL CTS: Compiler commit - f553870 I've added all failing tests to the filter to exclude them from the build: I've excluded optional_kernel_features from the build in: I've excluded vector tests from the build in: Next I excluded accessor and accessor_legacy tests from the build in: I've excluded kernel_bundle tests from the build in: I've excluded specialization_constants tests from the build in: TODO:
|
|
|
||
| const ctsConfigs = inputs.cts_config.split(';'); | ||
|
|
||
| const enabledCTSConfigs = []; | ||
|
|
||
| testConfigs.cts.forEach(v => { | ||
| if (ctsConfigs.includes(v.config)) { | ||
| if (needsDrivers) { | ||
| v["env"] = { | ||
| "compute_runtime_tag" : | ||
| driverNew["linux"]["compute_runtime"]["github_tag"], | ||
| "igc_tag" : driverNew["linux"]["igc"]["github_tag"], | ||
| "cm_tag" : driverNew["linux"]["cm"]["github_tag"], | ||
| "tbb_tag" : driverNew["linux"]["tbb"]["github_tag"], | ||
| "cpu_tag" : driverNew["linux"]["oclcpu"]["github_tag"], | ||
| "fpgaemu_tag" : driverNew["linux"]["fpgaemu"]["github_tag"], | ||
| }; | ||
| } else { | ||
| v["env"] = {}; | ||
| } | ||
| enabledCTSConfigs.push(v); | ||
| } | ||
| }); | ||
|
|
||
| let ctsString = JSON.stringify(enabledCTSConfigs); | ||
| console.log(ctsString); | ||
|
|
||
| for (let [key, value] of Object.entries(inputs)) { | ||
| ctsString = ctsString.replaceAll("${{ inputs." + key + " }}", value); | ||
| } | ||
| if (needsDrivers) { | ||
| ctsString = ctsString.replaceAll( | ||
| "ghcr.io/intel/llvm/ubuntu2004_intel_drivers:latest", | ||
| "ghcr.io/intel/llvm/ubuntu2004_base:latest"); | ||
| } | ||
|
|
||
| core.setOutput('cts_matrix', ctsString); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please integrate all of that to the code above (see how lts_matrix and lts_aws_matrix handled together)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it make sense to "integrate" CTS code into llmv-test-suite. I see that @yinyangsx duplicated configuration. I guess it's done to be able to configure two independent execution environments for running llvm-test-suite and CTS.
I suggest we try to refactor this code in a separate PR. Are you okay with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with separate PR. Thanks. I think it will be cleaner code after refactoring.
| # TODO remove workaround of FPGA emu bug | ||
| mkdir -p icd | ||
| echo /usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so > icd/gpu.icd | ||
| echo /runtimes/oclcpu/x64/libintelocl.so > icd/cpu.icd | ||
| echo /opt/runtimes/oclcpu/x64/libintelocl.so > icd/cpu2.icd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's needed for execution on OpenCL GPU/CPU devices. I can add these configuration right away. Are you okay with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's start with CUDA first, but I think we want to add Level Zero GPU or OpenCL CPU, or maybe both as a reference.
Hard to estimate impact on the runners available.
Can be done in a separate patch.
This reverts commit 1326960.
No description provided.