-
Notifications
You must be signed in to change notification settings - Fork 770
LIT tests are hanging on CUDA sporadically #1919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I have not seen something like locally. Wondering if it could be connected to the driver version as I am using |
@vladimirlaz, @tfzhu, could you check if updating CUDA driver to @vladimirlaz, @pvchupin, can we document the CUDA driver version in https://github.com/intel/llvm/blob/sycl/buildbot/dependency.conf? |
We uplifted CUDA driver to the latest one (450.36.06-1) and CUDA toolchain to 10.2 I have not seen hangs after the change but after that I see the warning on compilation time: it looks like it is hardcoded in ./clang/include/clang/Basic/Cuda.h $ apt list --installed | grep cuda WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
|
I have updated dependency.conf file and getStartedGuide.md to reflect recent changes in CUDA CI environment: https://github.com/intel/llvm/pull/1974/files. no hangs since uplifting driver version. |
Clang officially only supports CUDA 10.1 (for example for its own PTX output) and therefore prints a warning when using a newer CUDA version but doesn't block its use. |
Ensure that ExprLB is non-NULL before using it. Signed-off-by: Lu, John <john.lu@intel.com> Original commit: KhronosGroup/SPIRV-LLVM-Translator@b30a2d2
The problem is seen once per 20 runs. NVidia card goes into faulty state after hang.
e.g.
The next test job shows the following status:
{noformat}
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN RTX On | 00000000:01:00.0 Off | N/A |
|ERR! 48C P0 ERR! / 280W | 288MiB / 24190MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 31576 C ...ls/sycl/unittests/pi/cuda/./PiCudaTests 123MiB |
| 0 31586 C - 19MiB |
| 0 31593 C ...ls/sycl/unittests/pi/cuda/./PiCudaTests 12MiB |
+-----------------------------------------------------------------------------+
{noformat}
The other faulty processes for the last week:
| 0 28178 C ...s/buffer/Output/reinterpret.cpp.tmp.out 123MiB |
| 0 21227 C ...c_tests/Output/device_event.cpp.tmp.run 123MiB |
| 0 5074 C ...c_tests/Output/device_event.cpp.tmp.run 123MiB |
| 0 9151 C ...sts/Output/access_to_subset.cpp.tmp.out 123MiB |
| 0 30549 C ...ls/sycl/unittests/pi/cuda/./PiCudaTests 123MiB |
The text was updated successfully, but these errors were encountered: