Skip to content

Conversation

@yinyangsx
Copy link

No description provided.

Yin Yang added 6 commits July 14, 2022 14:49
Signed-off-by: Yin Yang <yin.yang@intel.com>
Signed-off-by: Yin Yang <yin.yang@intel.com>
fix
Signed-off-by: Yin Yang <yin.yang@intel.com>
fix
Signed-off-by: Yin Yang <yin.yang@intel.com>
fix
Signed-off-by: Yin Yang <yin.yang@intel.com>
fix
Signed-off-by: Yin Yang <yin.yang@intel.com>
Yin Yang added 10 commits July 15, 2022 00:43
fix
Signed-off-by: Yin Yang <yin.yang@intel.com>
fix
Signed-off-by: Yin Yang <yin.yang@intel.com>
fix
Signed-off-by: Yin Yang <yin.yang@intel.com>
fix
Signed-off-by: Yin Yang <yin.yang@intel.com>
fix
Signed-off-by: Yin Yang <yin.yang@intel.com>
fix
Signed-off-by: Yin Yang <yin.yang@intel.com>
fix
Signed-off-by: Yin Yang <yin.yang@intel.com>
fix
Signed-off-by: Yin Yang <yin.yang@intel.com>
Signed-off-by: Yin Yang <yin.yang@intel.com>
Signed-off-by: Yin Yang <yin.yang@intel.com>
Signed-off-by: Yin Yang <yin.yang@intel.com>
Yin Yang added 4 commits July 22, 2022 14:35
Signed-off-by: Yin Yang <yin.yang@intel.com>
Signed-off-by: Yin Yang <yin.yang@intel.com>
Signed-off-by: Yin Yang <yin.yang@intel.com>
Signed-off-by: Yin Yang <yin.yang@intel.com>
@bader bader changed the title Enable CUDA SYCL CTS tests [CI] Enable CUDA SYCL CTS tests Jul 26, 2022
@bader
Copy link
Contributor

bader commented Jul 26, 2022

Here is the summary of the SYCL-CTS execution on NVIDIA GPU.

Tests we can't build:
buffer
exceptions
multi_ptr
opencl_interop
reduction
group
kernel
math_builtin_api
nd_item
optional_kernel_features

Many test fail due to below error:

2022-07-26T03:36:28.2377097Z get_kernel_bundle_with_incompatible_kernels
2022-07-26T03:36:28.2377274Z -------------------------------------------------------------------------------
2022-07-26T03:36:28.2377387Z /__w/llvm/llvm/khronos_sycl_cts/tests/kernel_bundle/../common/../../util/proxy.h:35
2022-07-26T03:36:28.2377462Z ...............................................................................
2022-07-26T03:36:28.2377466Z 
2022-07-26T03:36:28.2377584Z /__w/llvm/llvm/khronos_sycl_cts/tests/kernel_bundle/../common/assertions.h:85: FAILED:
2022-07-26T03:36:28.2377646Z explicitly with message:
2022-07-26T03:36:28.2377762Z   No expected exception thrown for sycl::get_kernel_bundle(context, devices,
2022-07-26T03:36:28.2377833Z   kernel_ids) with bundle state input
2022-07-26T03:36:28.2377838Z 
2022-07-26T03:36:28.2377873Z 
2022-07-26T03:36:28.2377930Z PI CUDA ERROR:
2022-07-26T03:36:28.2377984Z                Value:           700
2022-07-26T03:36:28.2378054Z                Name:            CUDA_ERROR_ILLEGAL_ADDRESS
2022-07-26T03:36:28.2378152Z                Description:     an illegal memory access was encountered
2022-07-26T03:36:28.2378217Z                Function:        build_program
2022-07-26T03:36:28.2378326Z                Source Location: /__w/llvm/llvm/src/sycl/plugins/cuda/pi_cuda.cpp:703
2022-07-26T03:36:28.2378331Z 
2022-07-26T03:36:28.2378335Z 
2022-07-26T03:36:28.2378386Z PI CUDA ERROR:
2022-07-26T03:36:28.2378438Z                Value:           400
2022-07-26T03:36:28.2378512Z                Name:            CUDA_ERROR_INVALID_HANDLE
2022-07-26T03:36:28.2378590Z                Description:     invalid resource handle
2022-07-26T03:36:28.2378661Z                Function:        cuda_piProgramRelease
2022-07-26T03:36:28.2378770Z                Source Location: /__w/llvm/llvm/src/sycl/plugins/cuda/pi_cuda.cpp:3546
2022-07-26T03:36:28.2378775Z 
2022-07-26T03:36:28.2379013Z terminate called after throwing an instance of 'cl::sycl::runtime_error'
2022-07-26T03:36:28.2379294Z   what():  Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)

@yinyangsx, I think something is wrong either with the HW or with the CUDA driver installation.
@AerialMantis, any other ideas?

@pvchupin, FYI.

Comment on lines 88 to 105
export LD_LIBRARY_PATH=$PWD/toolchain/lib/:$LD_LIBRARY_PATH
export PATH=$PWD/toolchain/bin/:$PATH
# TODO make this part of container build
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/rocm/hip/lib/:/opt/rocm/lib
export SYCL_DEVICE_FILTER=${{ matrix.sycl_device_filter }}
if [ -e /runtimes/oneapi-tbb/env/vars.sh ]; then
source /runtimes/oneapi-tbb/env/vars.sh;
elif [ -e /opt/runtimes/oneapi-tbb/env/vars.sh ]; then
source /opt/runtimes/oneapi-tbb/env/vars.sh;
else
echo "no TBB vars in /opt/runtimes or /runtimes";
fi
# TODO remove workaround of FPGA emu bug
mkdir -p icd
echo /usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so > icd/gpu.icd
echo /runtimes/oclcpu/x64/libintelocl.so > icd/cpu.icd
echo /opt/runtimes/oclcpu/x64/libintelocl.so > icd/cpu2.icd
export OCL_ICD_VENDORS=$PWD/icd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? I expect the container to be already correctly configured.

Copy link
Author

@yinyangsx yinyangsx Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I have set up to mimic the environment of CUDA LLVM Test Suite, currently CUDA LLVM Test Suite is working well, so I think the environment should be correct. And when the test is stable I will do this configuration when building the image.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'll do review when it's marked as ready for review then.

NOTE: Some of the tests you disabled should be fixed by KhronosGroup/SYCL-CTS#367.
I suggest checking the status using latest SYCL-CTS source and re-enable disabled tests. Ideally we should use the filter file from the Khronos SYCL-CTS repository instead of creating a local duplicate.

fix
Signed-off-by: Yin Yang <yin.yang@intel.com>
Copy link
Contributor

@bader bader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't reproduce the issues from the latest CI test results on machine with NVIDIA GPU, but I don't use docker containers. So, I suspect it might be the reason of massive failures. Something wrong might with with configuring NVIDIA GPU device/driver inside a container.

Comment on lines 1 to 10
buffer
exceptions
multi_ptr
opencl_interop
reduction
group
kernel
math_builtin_api
nd_item
optional_kernel_features
Copy link
Contributor

@bader bader Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
buffer
exceptions
multi_ptr
opencl_interop
reduction
group
kernel
math_builtin_api
nd_item
optional_kernel_features
buffer
exceptions
multi_ptr
reduction
vector_swizzles
optional_kernel_features

In addition to this list:

  • accessor_legacy will abort due to unsupported image types, so I suggest excluding it too. EDIT: for NVIDIA GPU ONLY(!). Most likely the issue exists for AMD GPUs.

],
"image": "${{ inputs.cuda_image }}",
"container_options": "--gpus all",
"sycl_device_filter": "cuda:gpu,host",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"sycl_device_filter": "cuda:gpu,host",
"sycl_device_filter": "ext_oneapi_cuda:gpu,host",

fix
Signed-off-by: Yin Yang <yin.yang@intel.com>
There is an internal error in ptxas tool.

ptxas fatal   : Internal error: reference to deleted symbol

It's unclear what is causing this problem.
Let's disable some tests to work around this issue.
Test aborts on NVIDIA GPU

tests/kernel_bundle/../common/../../util/proxy.h:47: FAILED:
due to a fatal error condition:
SIGABRT - Abort (abnormal termination) signal
@bader
Copy link
Contributor

bader commented Sep 18, 2022

@yinyangsx, @pvchupin, I addressed code review comments and enabled more tests. There are a few issues with compilation and running CTS, which I'm going to report via GitHub issues.

bader
bader previously approved these changes Sep 18, 2022
@bader
Copy link
Contributor

bader commented Sep 18, 2022

Current status of issues for SYCL-CTS on NVIDIA GPU.

I started with building clean version of Khronos SYCL CTS:

Compiler commit - f553870
Test log - https://github.com/intel/llvm/actions/runs/3072914412/jobs/4964801578
Following tests failed at compile step: context, device, event, exceptions, kernel, multi_ptr, platform, queue, reduction.
I think KhronosGroup/SYCL-CTS#375 is going to address a part of these failures, but not all of them.

I've added all failing tests to the filter to exclude them from the build:
Compiler commit - c25703d
Test log - https://github.com/intel/llvm/actions/runs/3073539738/jobs/4965828019
optional_kernel_features failed at link step.

ptxas fatal   : Unresolved extern function '_Z18__spirv_AtomicIAddPyN5__spv5Scope4FlagENS0_19MemorySemanticsMask4FlagEy'
llvm-foreach: 
ptxas fatal   : Unresolved extern function '_Z18__spirv_AtomicIAddPyN5__spv5Scope4FlagENS0_19MemorySemanticsMask4FlagEy'
llvm-foreach: 
clang-16: error: ptxas command failed with exit code 255 (use -v to see invocation)

I've excluded optional_kernel_features from the build in:
Compiler commit - 2d3e84d
Test log - https://github.com/intel/llvm/actions/runs/3073804267/jobs/4966256628
Build failed with ptxas fatal : Internal error: reference to deleted symbol. It was not clear which tests are causing this issue, so I started with disabling vector_* tests.

I've excluded vector tests from the build in:
Compiler commit - 7d782f3
Test log - https://github.com/intel/llvm/actions/runs/3074051359/jobs/4966656652
Unfortunately it didn't help and build failed with the same error message.

Next I excluded accessor and accessor_legacy tests from the build in:
Compiler commit - cb46c14
Test log - https://github.com/intel/llvm/actions/runs/3074244346/jobs/4966997247
That fixed the build, but execution of kernel_bundle tests aborted.

/__w/llvm/llvm/khronos_sycl_cts/tests/kernel_bundle/../common/../../util/proxy.h:47: FAILED:
due to a fatal error condition:
  SIGABRT - Abort (abnormal termination) signal

I've excluded kernel_bundle tests from the build in:
Compiler commit - 6516063
Test log - https://github.com/intel/llvm/actions/runs/3074416382/jobs/4967284648
Tests were killed after 6 hours. The latest output was emitted by specialiazation_constants

I've excluded specialization_constants tests from the build in:
Compiler commit - 420f24b
Test log - https://github.com/intel/llvm/actions/runs/3075941019/jobs/4969835769
Tests pass and job took 23 minutes.

TODO:

  1. Check if we can re-enable vector tests and one of accessor tests. AFAIK, these tests take a lot of time to compile and execute.
  2. Enable CTS testing for other back-ends

@bader bader requested review from keryell and pvchupin and removed request for keryell September 20, 2022 07:36
Comment on lines +72 to +108

const ctsConfigs = inputs.cts_config.split(';');

const enabledCTSConfigs = [];

testConfigs.cts.forEach(v => {
if (ctsConfigs.includes(v.config)) {
if (needsDrivers) {
v["env"] = {
"compute_runtime_tag" :
driverNew["linux"]["compute_runtime"]["github_tag"],
"igc_tag" : driverNew["linux"]["igc"]["github_tag"],
"cm_tag" : driverNew["linux"]["cm"]["github_tag"],
"tbb_tag" : driverNew["linux"]["tbb"]["github_tag"],
"cpu_tag" : driverNew["linux"]["oclcpu"]["github_tag"],
"fpgaemu_tag" : driverNew["linux"]["fpgaemu"]["github_tag"],
};
} else {
v["env"] = {};
}
enabledCTSConfigs.push(v);
}
});

let ctsString = JSON.stringify(enabledCTSConfigs);
console.log(ctsString);

for (let [key, value] of Object.entries(inputs)) {
ctsString = ctsString.replaceAll("${{ inputs." + key + " }}", value);
}
if (needsDrivers) {
ctsString = ctsString.replaceAll(
"ghcr.io/intel/llvm/ubuntu2004_intel_drivers:latest",
"ghcr.io/intel/llvm/ubuntu2004_base:latest");
}

core.setOutput('cts_matrix', ctsString);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please integrate all of that to the code above (see how lts_matrix and lts_aws_matrix handled together)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it make sense to "integrate" CTS code into llmv-test-suite. I see that @yinyangsx duplicated configuration. I guess it's done to be able to configure two independent execution environments for running llvm-test-suite and CTS.
I suggest we try to refactor this code in a separate PR. Are you okay with it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with separate PR. Thanks. I think it will be cleaner code after refactoring.

Comment on lines +91 to +95
# TODO remove workaround of FPGA emu bug
mkdir -p icd
echo /usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so > icd/gpu.icd
echo /runtimes/oclcpu/x64/libintelocl.so > icd/cpu.icd
echo /opt/runtimes/oclcpu/x64/libintelocl.so > icd/cpu2.icd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's needed for execution on OpenCL GPU/CPU devices. I can add these configuration right away. Are you okay with that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's start with CUDA first, but I think we want to add Level Zero GPU or OpenCL CPU, or maybe both as a reference.
Hard to estimate impact on the runners available.
Can be done in a separate patch.

@bader bader requested review from keryell and pvchupin and removed request for keryell September 20, 2022 18:19
pvchupin
pvchupin previously approved these changes Sep 20, 2022
@bader bader merged commit 4a440da into intel:sycl Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants