Skip to content

LIT tests are hanging on CUDA sporadically #1919

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vladimirlaz opened this issue Jun 18, 2020 · 5 comments · Fixed by #1974
Closed

LIT tests are hanging on CUDA sporadically #1919

vladimirlaz opened this issue Jun 18, 2020 · 5 comments · Fixed by #1974
Labels
cuda CUDA back-end

Comments

@vladimirlaz
Copy link
Contributor

The problem is seen once per 20 runs. NVidia card goes into faulty state after hang.

e.g.

{noformat}
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN RTX On | 00000000:01:00.0 Off | N/A |
|ERR! 48C P0 ERR! / 280W | 288MiB / 24190MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 31576 C ...ls/sycl/unittests/pi/cuda/./PiCudaTests 123MiB |
| 0 31586 C - 19MiB |
| 0 31593 C ...ls/sycl/unittests/pi/cuda/./PiCudaTests 12MiB |
+-----------------------------------------------------------------------------+
{noformat}

The other faulty processes for the last week:
| 0 28178 C ...s/buffer/Output/reinterpret.cpp.tmp.out 123MiB |
| 0 21227 C ...c_tests/Output/device_event.cpp.tmp.run 123MiB |
| 0 5074 C ...c_tests/Output/device_event.cpp.tmp.run 123MiB |
| 0 9151 C ...sts/Output/access_to_subset.cpp.tmp.out 123MiB |
| 0 30549 C ...ls/sycl/unittests/pi/cuda/./PiCudaTests 123MiB |

@bjoernknafla
Copy link
Contributor

I have not seen something like locally. Wondering if it could be connected to the driver version as I am using 440.33.01 (supports CUDA 1.1 and 1.2) instead of 418.87.00 locally?

@bader
Copy link
Contributor

bader commented Jun 23, 2020

I have not seen something like locally. Wondering if it could be connected to the driver version as I am using 440.33.01 (supports CUDA 1.1 and 1.2) instead of 418.87.00 locally?

@vladimirlaz, @tfzhu, could you check if updating CUDA driver to 440.33.01 fixes the issue, please?

@vladimirlaz, @pvchupin, can we document the CUDA driver version in https://github.com/intel/llvm/blob/sycl/buildbot/dependency.conf?

@vladimirlaz
Copy link
Contributor Author

We uplifted CUDA driver to the latest one (450.36.06-1) and CUDA toolchain to 10.2

I have not seen hangs after the change but after that I see the warning on compilation time:
clang-11: warning: Unknown CUDA version 10.2. Assuming the latest supported version 10.1 [-Wunknown-cuda-version]

it looks like it is hardcoded in ./clang/include/clang/Basic/Cuda.h
LATEST_SUPPORTED = CUDA_101
Isn't it safe to uplift to CUDA_102 or CUDA_110?

$ apt list --installed | grep cuda

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

cuda-command-line-tools-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-compiler-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,auto-removable]
cuda-cudart-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-cudart-dev-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-cufft-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-cufft-dev-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-cuobjdump-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-cupti-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-cupti-dev-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-curand-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-curand-dev-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-cusolver-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-cusolver-dev-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-cusparse-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-cusparse-dev-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-documentation-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,auto-removable]
cuda-driver-dev-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-drivers/unknown,now 450.36.06-1 amd64 [installed]
cuda-drivers-450/unknown,now 450.36.06-1 amd64 [installed,automatic]
cuda-gdb-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-libraries-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,auto-removable]
cuda-libraries-dev-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,auto-removable]
cuda-license-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-memcheck-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-misc-headers-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-npp-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-npp-dev-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-nsight-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-nsight-compute-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-nsight-systems-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-nvcc-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-nvdisasm-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-nvgraph-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-nvgraph-dev-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-nvjpeg-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-nvjpeg-dev-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-nvml-dev-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-nvprof-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-nvprune-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,auto-removable]
cuda-nvrtc-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-nvrtc-dev-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-nvtx-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-nvvp-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01/now 1.0-1 amd64 [installed,local]
cuda-samples-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,auto-removable]
cuda-sanitizer-api-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]
cuda-toolkit-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,auto-removable]
cuda-tools-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed]
cuda-visual-tools-10-2/unknown,unknown,now 10.2.89-1 amd64 [installed,automatic]

@vladimirlaz
Copy link
Contributor Author

vladimirlaz commented Jun 25, 2020

I have updated dependency.conf file and getStartedGuide.md to reflect recent changes in CUDA CI environment: https://github.com/intel/llvm/pull/1974/files.

no hangs since uplifting driver version.

@bjoernknafla
Copy link
Contributor

Clang officially only supports CUDA 10.1 (for example for its own PTX output) and therefore prints a warning when using a newer CUDA version but doesn't block its use.

@bader bader linked a pull request Jun 25, 2020 that will close this issue
bb-sycl pushed a commit that referenced this issue Apr 4, 2023
Ensure that ExprLB is non-NULL before using it.

Signed-off-by: Lu, John <john.lu@intel.com>

Original commit:
KhronosGroup/SPIRV-LLVM-Translator@b30a2d2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda CUDA back-end
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants