LIT tests are hanging on CUDA sporadically

The problem is seen once per 20 runs.  NVidia card goes into faulty state after hang.

e.g.
 - http://ci.llvm.intel.com:8010/#/builders/37/builds/1505 - timeout LIT
 - http://ci.llvm.intel.com:8010/#/builders/37/builds/1506 - faulty NVIDIA card
The next test job shows the following status:

{noformat}
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN RTX           On   | 00000000:01:00.0 Off |                  N/A |
|ERR!   48C    P0   ERR! / 280W |    288MiB / 24190MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     31576      C   ...ls/sycl/unittests/pi/cuda/./PiCudaTests   123MiB |
|    0     31586      C   -                                             19MiB |
|    0     31593      C   ...ls/sycl/unittests/pi/cuda/./PiCudaTests    12MiB |
+-----------------------------------------------------------------------------+
{noformat}

The other faulty processes for the last week:
|    0     28178      C   ...s/buffer/Output/reinterpret.cpp.tmp.out   123MiB |
|    0     21227      C   ...c_tests/Output/device_event.cpp.tmp.run   123MiB |
|    0      5074      C   ...c_tests/Output/device_event.cpp.tmp.run   123MiB |
|    0      9151      C   ...sts/Output/access_to_subset.cpp.tmp.out   123MiB |
|    0     30549      C   ...ls/sycl/unittests/pi/cuda/./PiCudaTests   123MiB |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LIT tests are hanging on CUDA sporadically #1919

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LIT tests are hanging on CUDA sporadically #1919

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions