[CI] Enable CUDA SYCL CTS tests #6439

yinyangsx · 2022-07-14T06:55:39Z

No description provided.

Signed-off-by: Yin Yang <yin.yang@intel.com>

devops/cts_exclude_filter

Signed-off-by: Yin Yang <yin.yang@intel.com>

bader · 2022-07-26T11:09:57Z

Here is the summary of the SYCL-CTS execution on NVIDIA GPU.

Tests we can't build:
buffer
exceptions
multi_ptr
opencl_interop
reduction
group
kernel
math_builtin_api
nd_item
optional_kernel_features

Many test fail due to below error:

2022-07-26T03:36:28.2377097Z get_kernel_bundle_with_incompatible_kernels
2022-07-26T03:36:28.2377274Z -------------------------------------------------------------------------------
2022-07-26T03:36:28.2377387Z /__w/llvm/llvm/khronos_sycl_cts/tests/kernel_bundle/../common/../../util/proxy.h:35
2022-07-26T03:36:28.2377462Z ...............................................................................
2022-07-26T03:36:28.2377466Z 
2022-07-26T03:36:28.2377584Z /__w/llvm/llvm/khronos_sycl_cts/tests/kernel_bundle/../common/assertions.h:85: FAILED:
2022-07-26T03:36:28.2377646Z explicitly with message:
2022-07-26T03:36:28.2377762Z   No expected exception thrown for sycl::get_kernel_bundle(context, devices,
2022-07-26T03:36:28.2377833Z   kernel_ids) with bundle state input
2022-07-26T03:36:28.2377838Z 
2022-07-26T03:36:28.2377873Z 
2022-07-26T03:36:28.2377930Z PI CUDA ERROR:
2022-07-26T03:36:28.2377984Z                Value:           700
2022-07-26T03:36:28.2378054Z                Name:            CUDA_ERROR_ILLEGAL_ADDRESS
2022-07-26T03:36:28.2378152Z                Description:     an illegal memory access was encountered
2022-07-26T03:36:28.2378217Z                Function:        build_program
2022-07-26T03:36:28.2378326Z                Source Location: /__w/llvm/llvm/src/sycl/plugins/cuda/pi_cuda.cpp:703
2022-07-26T03:36:28.2378331Z 
2022-07-26T03:36:28.2378335Z 
2022-07-26T03:36:28.2378386Z PI CUDA ERROR:
2022-07-26T03:36:28.2378438Z                Value:           400
2022-07-26T03:36:28.2378512Z                Name:            CUDA_ERROR_INVALID_HANDLE
2022-07-26T03:36:28.2378590Z                Description:     invalid resource handle
2022-07-26T03:36:28.2378661Z                Function:        cuda_piProgramRelease
2022-07-26T03:36:28.2378770Z                Source Location: /__w/llvm/llvm/src/sycl/plugins/cuda/pi_cuda.cpp:3546
2022-07-26T03:36:28.2378775Z 
2022-07-26T03:36:28.2379013Z terminate called after throwing an instance of 'cl::sycl::runtime_error'
2022-07-26T03:36:28.2379294Z   what():  Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)

@yinyangsx, I think something is wrong either with the HW or with the CUDA driver installation.
@AerialMantis, any other ideas?

@pvchupin, FYI.

bader · 2022-07-26T11:05:01Z

devops/actions/khronos_cts_test/action.yml

+      export LD_LIBRARY_PATH=$PWD/toolchain/lib/:$LD_LIBRARY_PATH
+      export PATH=$PWD/toolchain/bin/:$PATH
+      # TODO make this part of container build
+      export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/rocm/hip/lib/:/opt/rocm/lib
+      export SYCL_DEVICE_FILTER=${{ matrix.sycl_device_filter }}
+      if [ -e /runtimes/oneapi-tbb/env/vars.sh ]; then
+        source /runtimes/oneapi-tbb/env/vars.sh;
+      elif [ -e /opt/runtimes/oneapi-tbb/env/vars.sh ]; then
+        source /opt/runtimes/oneapi-tbb/env/vars.sh;
+      else
+        echo "no TBB vars in /opt/runtimes or /runtimes";
+      fi
+      # TODO remove workaround of FPGA emu bug
+      mkdir -p icd
+      echo /usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so > icd/gpu.icd
+      echo /runtimes/oclcpu/x64/libintelocl.so > icd/cpu.icd
+      echo /opt/runtimes/oclcpu/x64/libintelocl.so > icd/cpu2.icd
+      export OCL_ICD_VENDORS=$PWD/icd


Why is this needed? I expect the container to be already correctly configured.

This is what I have set up to mimic the environment of CUDA LLVM Test Suite, currently CUDA LLVM Test Suite is working well, so I think the environment should be correct. And when the test is stable I will do this configuration when building the image.

Ok. I'll do review when it's marked as ready for review then.

NOTE: Some of the tests you disabled should be fixed by KhronosGroup/SYCL-CTS#367.
I suggest checking the status using latest SYCL-CTS source and re-enable disabled tests. Ideally we should use the filter file from the Khronos SYCL-CTS repository instead of creating a local duplicate.

Signed-off-by: Yin Yang <yin.yang@intel.com>

bader

I can't reproduce the issues from the latest CI test results on machine with NVIDIA GPU, but I don't use docker containers. So, I suspect it might be the reason of massive failures. Something wrong might with with configuring NVIDIA GPU device/driver inside a container.

bader · 2022-08-01T17:50:40Z

devops/cts_exclude_filter

+buffer
+exceptions
+multi_ptr
+opencl_interop
+reduction
+group
+kernel
+math_builtin_api
+nd_item
+optional_kernel_features


Suggested change

buffer

exceptions

multi_ptr

opencl_interop

reduction

group

kernel

math_builtin_api

nd_item

optional_kernel_features

buffer

exceptions

multi_ptr

reduction

vector_swizzles

optional_kernel_features

In addition to this list:

accessor_legacy will abort due to unsupported image types, so I suggest excluding it too. EDIT: for NVIDIA GPU ONLY(!). Most likely the issue exists for AMD GPUs.

bader · 2022-08-01T17:53:07Z

devops/test_configs.json

+      ],
+      "image": "${{ inputs.cuda_image }}",
+      "container_options": "--gpus all",
+      "sycl_device_filter": "cuda:gpu,host",


Suggested change

"sycl_device_filter": "cuda:gpu,host",

"sycl_device_filter": "ext_oneapi_cuda:gpu,host",

Signed-off-by: Yin Yang <yin.yang@intel.com>

This reverts commit 8fa06e9.

There is an internal error in ptxas tool. ptxas fatal : Internal error: reference to deleted symbol It's unclear what is causing this problem. Let's disable some tests to work around this issue.

Test aborts on NVIDIA GPU tests/kernel_bundle/../common/../../util/proxy.h:47: FAILED: due to a fatal error condition: SIGABRT - Abort (abnormal termination) signal

bader · 2022-09-18T08:03:25Z

@yinyangsx, @pvchupin, I addressed code review comments and enabled more tests. There are a few issues with compilation and running CTS, which I'm going to report via GitHub issues.

bader · 2022-09-18T11:17:05Z

Current status of issues for SYCL-CTS on NVIDIA GPU.

I started with building clean version of Khronos SYCL CTS:

Compiler commit - f553870
Test log - https://github.com/intel/llvm/actions/runs/3072914412/jobs/4964801578
Following tests failed at compile step: context, device, event, exceptions, kernel, multi_ptr, platform, queue, reduction.
I think KhronosGroup/SYCL-CTS#375 is going to address a part of these failures, but not all of them.

I've added all failing tests to the filter to exclude them from the build:
Compiler commit - c25703d
Test log - https://github.com/intel/llvm/actions/runs/3073539738/jobs/4965828019
optional_kernel_features failed at link step.

ptxas fatal   : Unresolved extern function '_Z18__spirv_AtomicIAddPyN5__spv5Scope4FlagENS0_19MemorySemanticsMask4FlagEy'
llvm-foreach: 
ptxas fatal   : Unresolved extern function '_Z18__spirv_AtomicIAddPyN5__spv5Scope4FlagENS0_19MemorySemanticsMask4FlagEy'
llvm-foreach: 
clang-16: error: ptxas command failed with exit code 255 (use -v to see invocation)

I've excluded optional_kernel_features from the build in:
Compiler commit - 2d3e84d
Test log - https://github.com/intel/llvm/actions/runs/3073804267/jobs/4966256628
Build failed with ptxas fatal : Internal error: reference to deleted symbol. It was not clear which tests are causing this issue, so I started with disabling vector_* tests.

I've excluded vector tests from the build in:
Compiler commit - 7d782f3
Test log - https://github.com/intel/llvm/actions/runs/3074051359/jobs/4966656652
Unfortunately it didn't help and build failed with the same error message.

Next I excluded accessor and accessor_legacy tests from the build in:
Compiler commit - cb46c14
Test log - https://github.com/intel/llvm/actions/runs/3074244346/jobs/4966997247
That fixed the build, but execution of kernel_bundle tests aborted.

/__w/llvm/llvm/khronos_sycl_cts/tests/kernel_bundle/../common/../../util/proxy.h:47: FAILED:
due to a fatal error condition:
  SIGABRT - Abort (abnormal termination) signal

I've excluded kernel_bundle tests from the build in:
Compiler commit - 6516063
Test log - https://github.com/intel/llvm/actions/runs/3074416382/jobs/4967284648
Tests were killed after 6 hours. The latest output was emitted by specialiazation_constants

I've excluded specialization_constants tests from the build in:
Compiler commit - 420f24b
Test log - https://github.com/intel/llvm/actions/runs/3075941019/jobs/4969835769
Tests pass and job took 23 minutes.

TODO:

Check if we can re-enable vector tests and one of accessor tests. AFAIK, these tests take a lot of time to compile and execute.
Enable CTS testing for other back-ends

pvchupin · 2022-09-20T14:56:45Z

devops/scripts/generate_test_matrix.js

+
+      const ctsConfigs = inputs.cts_config.split(';');
+
+      const enabledCTSConfigs = [];
+
+      testConfigs.cts.forEach(v => {
+        if (ctsConfigs.includes(v.config)) {
+          if (needsDrivers) {
+            v["env"] = {
+              "compute_runtime_tag" :
+                  driverNew["linux"]["compute_runtime"]["github_tag"],
+              "igc_tag" : driverNew["linux"]["igc"]["github_tag"],
+              "cm_tag" : driverNew["linux"]["cm"]["github_tag"],
+              "tbb_tag" : driverNew["linux"]["tbb"]["github_tag"],
+              "cpu_tag" : driverNew["linux"]["oclcpu"]["github_tag"],
+              "fpgaemu_tag" : driverNew["linux"]["fpgaemu"]["github_tag"],
+            };
+          } else {
+            v["env"] = {};
+          }
+          enabledCTSConfigs.push(v);
+        }
+      });
+
+      let ctsString = JSON.stringify(enabledCTSConfigs);
+      console.log(ctsString);
+
+      for (let [key, value] of Object.entries(inputs)) {
+        ctsString = ctsString.replaceAll("${{ inputs." + key + " }}", value);
+      }
+      if (needsDrivers) {
+        ctsString = ctsString.replaceAll(
+            "ghcr.io/intel/llvm/ubuntu2004_intel_drivers:latest",
+            "ghcr.io/intel/llvm/ubuntu2004_base:latest");
+      }
+
+      core.setOutput('cts_matrix', ctsString);


Please integrate all of that to the code above (see how lts_matrix and lts_aws_matrix handled together)

I'm not sure if it make sense to "integrate" CTS code into llmv-test-suite. I see that @yinyangsx duplicated configuration. I guess it's done to be able to configure two independent execution environments for running llvm-test-suite and CTS.
I suggest we try to refactor this code in a separate PR. Are you okay with it?

I'm ok with separate PR. Thanks. I think it will be cleaner code after refactoring.

pvchupin · 2022-09-20T14:57:31Z

devops/actions/khronos_cts_test/action.yml

+      # TODO remove workaround of FPGA emu bug
+      mkdir -p icd
+      echo /usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so > icd/gpu.icd
+      echo /runtimes/oclcpu/x64/libintelocl.so > icd/cpu.icd
+      echo /opt/runtimes/oclcpu/x64/libintelocl.so > icd/cpu2.icd


Is it needed?

It's needed for execution on OpenCL GPU/CPU devices. I can add these configuration right away. Are you okay with that?

Let's start with CUDA first, but I think we want to add Level Zero GPU or OpenCL CPU, or maybe both as a reference.
Hard to estimate impact on the runners available.
Can be done in a separate patch.

devops/actions/khronos_cts_test/action.yml

This reverts commit 1326960.

devops/cts_exclude_filter

Yin Yang added 6 commits July 14, 2022 14:49

enable SYCL CTS test

9380b6e

Signed-off-by: Yin Yang <yin.yang@intel.com>

add exclude filter

f4e6f74

Signed-off-by: Yin Yang <yin.yang@intel.com>

fix

2a770cb

Signed-off-by: Yin Yang <yin.yang@intel.com>

fix

2bd0da9

Signed-off-by: Yin Yang <yin.yang@intel.com>

fix

365d677

Signed-off-by: Yin Yang <yin.yang@intel.com>

fix

852330b

Signed-off-by: Yin Yang <yin.yang@intel.com>

keryell reviewed Jul 14, 2022

View reviewed changes

devops/cts_exclude_filter Outdated Show resolved Hide resolved

Yin Yang added 10 commits July 15, 2022 00:43

fix

a88c951

Signed-off-by: Yin Yang <yin.yang@intel.com>

fix

6669347

Signed-off-by: Yin Yang <yin.yang@intel.com>

fix

52820f7

Signed-off-by: Yin Yang <yin.yang@intel.com>

fix

03fa149

Signed-off-by: Yin Yang <yin.yang@intel.com>

fix

0332b63

Signed-off-by: Yin Yang <yin.yang@intel.com>

fix

ca8a13c

Signed-off-by: Yin Yang <yin.yang@intel.com>

fix

28028f1

Signed-off-by: Yin Yang <yin.yang@intel.com>

fix

7324ca3

Signed-off-by: Yin Yang <yin.yang@intel.com>

Add sycl cts as post comment test

64248af

Signed-off-by: Yin Yang <yin.yang@intel.com>

remove host_task/host_task_interop_api.cpp

2f76907

Signed-off-by: Yin Yang <yin.yang@intel.com>

bader mentioned this pull request Jul 21, 2022

[SYCL][ABI-break] Remove kernel::get_work_group_info #6414

Merged

remove group

9b98571

Signed-off-by: Yin Yang <yin.yang@intel.com>

yinyangsx force-pushed the enable_cts branch from 6d1e43f to 9b98571 Compare July 22, 2022 02:50

Yin Yang added 4 commits July 22, 2022 14:35

remove kernel

a84cfe8

Signed-off-by: Yin Yang <yin.yang@intel.com>

remove math_builtin_api due to compile fail

84f0078

Signed-off-by: Yin Yang <yin.yang@intel.com>

remove nd_item and optional_kernel_features due to compile fail

406be61

Signed-off-by: Yin Yang <yin.yang@intel.com>

remove accessor_api_image_core.cpp

7f40660

Signed-off-by: Yin Yang <yin.yang@intel.com>

bader changed the title ~~Enable CUDA SYCL CTS tests~~ [CI] Enable CUDA SYCL CTS tests Jul 26, 2022

bader reviewed Jul 26, 2022

View reviewed changes

fix

da24487

Signed-off-by: Yin Yang <yin.yang@intel.com>

bader reviewed Aug 1, 2022

View reviewed changes

fix

ad2e277

Signed-off-by: Yin Yang <yin.yang@intel.com>

bader added 10 commits September 15, 2022 08:39

Fix test matrix generation job.

8fa06e9

Revert "Fix test matrix generation job."

8d9186e

This reverts commit 8fa06e9.

Add dpcpp toolchain to the PATH to find sycl-ls.

e41485c

Clear filter file to check the status of all tests

f553870

Add failing tests to the filter file

c25703d

optional_kernel_features fails at linking

2d3e84d

Filter vector tests

7d782f3

There is an internal error in ptxas tool. ptxas fatal : Internal error: reference to deleted symbol It's unclear what is causing this problem. Let's disable some tests to work around this issue.

Exclude accessor tests.

cb46c14

Disable kernel_bundle test

6516063

Test aborts on NVIDIA GPU tests/kernel_bundle/../common/../../util/proxy.h:47: FAILED: due to a fatal error condition: SIGABRT - Abort (abnormal termination) signal

Exclude spec constants test to reduce execution time

420f24b

bader previously approved these changes Sep 18, 2022

View reviewed changes

bader requested review from keryell and pvchupin and removed request for keryell September 20, 2022 07:36

pvchupin reviewed Sep 20, 2022

View reviewed changes

Apply code review.

bcbc050

bader dismissed their stale review via bcbc050 September 20, 2022 18:17

bader requested review from keryell and pvchupin and removed request for keryell September 20, 2022 18:19

pvchupin previously approved these changes Sep 20, 2022

View reviewed changes

Attempt to fix CI.

1326960

bader dismissed pvchupin’s stale review via 1326960 September 20, 2022 19:18

bader added 2 commits September 21, 2022 09:16

Revert "Attempt to fix CI."

50dd318

This reverts commit 1326960.

Disable device_selector due to failed assertions.

36ec191

pvchupin reviewed Sep 21, 2022

View reviewed changes

devops/cts_exclude_filter Show resolved Hide resolved

bader merged commit 4a440da into intel:sycl Sep 21, 2022

	"sycl_device_filter": "cuda:gpu,host",
	"sycl_device_filter": "ext_oneapi_cuda:gpu,host",

[CI] Enable CUDA SYCL CTS tests #6439

[CI] Enable CUDA SYCL CTS tests #6439

Uh oh!

Conversation

yinyangsx commented Jul 14, 2022

Uh oh!

Uh oh!

bader commented Jul 26, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yinyangsx Jul 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bader left a comment

Choose a reason for hiding this comment

Uh oh!

bader Aug 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bader commented Sep 18, 2022

Uh oh!

bader commented Sep 18, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yinyangsx Jul 29, 2022 •

edited

Loading

bader Aug 1, 2022 •

edited

Loading