[GPU] Unit test HeterogeneousCore/SonicTriton failed: CUDA driver version is insufficient for CUDA runtime version #40911

iarspider · 2023-03-01T11:31:02Z

E0301 02:32:36.968122 25 metrics.cc:506] Unable to get device UUID: Bad parameter passed to function
E0301 02:32:36.968144 25 metrics.cc:506] Unable to get device UUID: Bad parameter passed to function
(...)
E0301 02:32:37.097455 25 model_repository_manager.cc:1215] failed to load 'gat_test' version 1: Internal: failed to load model 'gat_test': CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. CUDA error: unknown error
Exception raised from currentStreamCaptureStatusMayInitCtx at ../c10/cuda/CUDAGraphsC10Utils.h:83 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x2b161f78178c in /opt/tritonserver/backends/pytorch/libc10.so)
frame #1: <unknown function> + 0x16d05 (0x2b161fa59d05 in /opt/tritonserver/backends/pytorch/libc10_cuda.so)
frame #2: <unknown function> + 0x1b938 (0x2b161fa5e938 in /opt/tritonserver/backends/pytorch/libc10_cuda.so)
frame #3: <unknown function> + 0x1d21c (0x2b161fa6021c in /opt/tritonserver/backends/pytorch/libc10_cuda.so)
frame #4: <unknown function> + 0x1d815 (0x2b161fa60815 in /opt/tritonserver/backends/pytorch/libc10_cuda.so)
frame #5: THCStorage_resizeBytes(THCState*, c10::StorageImpl*, long) + 0x96 (0x2b162c635586 in /opt/tritonserver/backends/pytorch/libtorch_cuda.so)
frame #6: <unknown function> + 0x231cc15 (0x2b162b466c15 in /opt/tritonserver/backends/pytorch/libtorch_cuda.so)
frame #7: at::native::empty_strided_cuda(c10::ArrayRef<long>, c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>) + 0x86 (0x2b162c397896 in /opt/tritonserver/backends/pytorch/libtorch_cuda.so)
frame #8: <unknown function> + 0x336c0b9 (0x2b162c4b60b9 in /opt/tritonserver/backends/pytorch/libtorch_cuda.so)
frame #9: <unknown function> + 0x336c154 (0x2b162c4b6154 in /opt/tritonserver/backends/pytorch/libtorch_cuda.so)
frame #10: <unknown function> + 0x1c0fd71 (0x2b162167fd71 in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #11: at::empty_strided(c10::ArrayRef<long>, c10::ArrayRef<long>, c10::TensorOptions) + 0x237 (0x2b16211bf2b7 in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #12: <unknown function> + 0x13c1325 (0x2b1620e31325 in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #13: at::native::to(at::Tensor const&, c10::Device, c10::ScalarType, bool, bool, c10::optional<c10::MemoryFormat>) + 0x218 (0x2b1620e32728 in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #14: <unknown function> + 0x1e0b84d (0x2b162187b84d in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #15: at::Tensor::to(c10::Device, c10::ScalarType, bool, bool, c10::optional<c10::MemoryFormat>) const + 0x17a (0x2b1621ad045a in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #16: torch::jit::Unpickler::readInstruction() + 0x18cf (0x2b1623a62a9f in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #17: torch::jit::Unpickler::run() + 0xa8 (0x2b1623a633f8 in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #18: torch::jit::Unpickler::parse_ivalue() + 0x32 (0x2b1623a635a2 in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #19: torch::jit::readArchiveAndTensors(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&) + 0x456 (0x2b1623a234a6 in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #20: <unknown function> + 0x3fac535 (0x2b1623a1c535 in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #21: <unknown function> + 0x3faedf1 (0x2b1623a1edf1 in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #22: torch::jit::load(std::shared_ptr<caffe2::serialize::ReadAdapterInterface>, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0x1ba (0x2b1623a1fd9a in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #23: torch::jit::load(std::istream&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0xc2 (0x2b1623a21d72 in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #24: torch::jit::load(std::istream&, c10::optional<c10::Device>) + 0x6a (0x2b1623a21e5a in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #25: <unknown function> + 0x16d7b (0x2b16fa0dcd7b in /opt/tritonserver/backends/pytorch/libtriton_pytorch.so)
frame #26: <unknown function> + 0x1915a (0x2b16fa0df15a in /opt/tritonserver/backends/pytorch/libtriton_pytorch.so)
frame #27: <unknown function> + 0x19622 (0x2b16fa0df622 in /opt/tritonserver/backends/pytorch/libtriton_pytorch.so)
frame #28: TRITONBACKEND_ModelInstanceInitialize + 0x374 (0x2b16fa0df9e4 in /opt/tritonserver/backends/pytorch/libtriton_pytorch.so)
frame #29: <unknown function> + 0x3085bc (0x2b161e51a5bc in /opt/tritonserver/bin/../lib/libtritonserver.so)
frame #30: <unknown function> + 0x30a0c4 (0x2b161e51c0c4 in /opt/tritonserver/bin/../lib/libtritonserver.so)
frame #31: <unknown function> + 0x3053ee (0x2b161e5173ee in /opt/tritonserver/bin/../lib/libtritonserver.so)
frame #32: <unknown function> + 0x18ca4b (0x2b161e39ea4b in /opt/tritonserver/bin/../lib/libtritonserver.so)
frame #33: <unknown function> + 0x19aaf1 (0x2b161e3acaf1 in /opt/tritonserver/bin/../lib/libtritonserver.so)
frame #34: <unknown function> + 0xd6d84 (0x2b161f2c5d84 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #35: <unknown function> + 0x9609 (0x2b161ee6a609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #36: clone + 0x43 (0x2b161f65c293 in /lib/x86_64-linux-gnu/libc.so.6)

The text was updated successfully, but these errors were encountered:

cmsbuild · 2023-03-01T11:31:23Z

A new Issue was created by @iarspider .

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

iarspider · 2023-03-01T11:31:42Z

assign heterogeneous

cmsbuild · 2023-03-01T11:31:55Z

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel · 2023-03-01T14:08:32Z

@kpedro88 Please take a look

kpedro88 · 2023-03-01T14:16:11Z

I notice this is an A100 GPU: GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-8b5aff3f-0f0c-f75d-2c23-218268379d98)

Is this GPU partitioned with MIG?

iarspider · 2023-03-04T10:05:17Z

@kpedro88 The actual issue with that error seems to be this:

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use Docker with NVIDIA Container Toolkit to start this container; see
   https://github.com/NVIDIA/nvidia-docker.

Since IB CMSSW_13_1_ROOT6_X_2023-03-02-2300 (which is the 1st IB after update of SonicTriton in #40814), the error is different:

----- Begin Fatal Exception 03-Mar-2023 06:48:22 CET-----------------------
An exception of category 'TritonFailure' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 0
   [1] Running path 'p'
   [2] Calling method for module TritonGraphProducer/'TritonGraphProducer'
Exception Message:
unable to register CUDA shared memory region: 3353496_input3: failed to register CUDA shared memory region '3353496_input3': failed to open CUDA IPC handle: CUDA driver version is insufficient for CUDA runtime version
----- End Fatal Exception -------------------------------------------------

aandvalenzuela · 2023-03-09T10:59:04Z

Hello,

This issue is still present in latest IBs CMSSW_13_1_GPU_X_2023-03-08-2300.

Thanks!
Andrea

kpedro88 · 2023-03-13T23:52:20Z

@iarspider is it possible to access one of the GPU machines used to run the unit test? It's hard to debug potential driver version incompatibilities without seeing what the machines actually use.

iarspider · 2023-03-15T10:05:42Z

@kpedro88 this IB is built (and tested) on lxplus8-gpu and lxplus9-gpu. Most recent failed unit test was executed on lxplus8s05.

kpedro88 · 2023-03-15T21:53:42Z

This seems to be caused by a nested apptainer bug that just appeared: apptainer/apptainer#1205
(If I run the tests without starting inside a container, logging directly into lxplus8s05 and using the host OS, it works fine.)

In fact, the test now fails even in older releases e.g. CMSSW_13_0_GPU_X_2023-03-13-2300 where the only recent change in the SonicTriton package is s/singularity/apptainer/, indicating that the cause of the failure is external.

smuzaffar · 2023-04-04T05:48:34Z

wf 10804.31 again failed for 13.0.X with exception unable to register shared memory region: 26373_input17: Socket closed
. Hopefully apptainer version 1.1.7 might fix this issue. It is tagged but not yet available for installation.

smuzaffar · 2023-04-04T05:51:22Z

@kpedro88 , apptainer/apptainer#1205 looks like effecting the GPU libs but we are getting these errors for non-gpu nodes , so I am not sure if fix for apptainer/apptainer#1205 going to solve this issue

kpedro88 · 2023-04-04T16:48:29Z

The problem being discussed in this github issue relates to GPUs and was identified as being caused by a failure of apptainer to forward symlinked GPU drivers in nested containers. When apptainer version 1.1.7 is deployed, that problem will be fixed. A failure on CPU nodes is a different problem. The particular failure you point out (Socket closed) seems to be sporadic/transient (most IBs for most SCRAM_ARCHs in the past week do not demonstrate the failure).

kpedro88 · 2023-04-20T16:20:36Z

The failure no longer occurs now that apptainer 1.1.7 is deployed.

makortel · 2023-04-20T16:31:08Z

+heterogeneous

makortel · 2023-04-20T16:31:18Z

@cmsbuild, please close

cmsbuild · 2023-04-20T16:31:33Z

This issue is fully signed and ready to be closed.

cmsbuild added the pending-assignment label Mar 1, 2023

iarspider changed the title ~~[GPU] RelVal 10805.31 failed: failed to load model 'gat_test'~~ [GPU] Unit test HeterogeneousCore/SonicTriton failed: failed to load model 'gat_test' Mar 1, 2023

cmsbuild added pending-signatures heterogeneous-pending and removed pending-assignment labels Mar 1, 2023

iarspider changed the title ~~[GPU] Unit test HeterogeneousCore/SonicTriton failed: failed to load model 'gat_test'~~ [GPU] Unit test HeterogeneousCore/SonicTriton failed: CUDA driver version is insufficient for CUDA runtime version Mar 9, 2023

cmsbuild added fully-signed heterogeneous-approved and removed pending-signatures heterogeneous-pending labels Apr 20, 2023

cmsbuild closed this as completed Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] Unit test HeterogeneousCore/SonicTriton failed: CUDA driver version is insufficient for CUDA runtime version #40911

[GPU] Unit test HeterogeneousCore/SonicTriton failed: CUDA driver version is insufficient for CUDA runtime version #40911

iarspider commented Mar 1, 2023

cmsbuild commented Mar 1, 2023

iarspider commented Mar 1, 2023

cmsbuild commented Mar 1, 2023

makortel commented Mar 1, 2023

kpedro88 commented Mar 1, 2023

iarspider commented Mar 4, 2023

aandvalenzuela commented Mar 9, 2023

kpedro88 commented Mar 13, 2023

iarspider commented Mar 15, 2023

kpedro88 commented Mar 15, 2023

smuzaffar commented Apr 4, 2023

smuzaffar commented Apr 4, 2023

kpedro88 commented Apr 4, 2023

kpedro88 commented Apr 20, 2023

makortel commented Apr 20, 2023

makortel commented Apr 20, 2023

cmsbuild commented Apr 20, 2023

[GPU] Unit test HeterogeneousCore/SonicTriton failed: CUDA driver version is insufficient for CUDA runtime version #40911

[GPU] Unit test HeterogeneousCore/SonicTriton failed: CUDA driver version is insufficient for CUDA runtime version #40911

Comments

iarspider commented Mar 1, 2023

cmsbuild commented Mar 1, 2023

iarspider commented Mar 1, 2023

cmsbuild commented Mar 1, 2023

makortel commented Mar 1, 2023

kpedro88 commented Mar 1, 2023

iarspider commented Mar 4, 2023

aandvalenzuela commented Mar 9, 2023

kpedro88 commented Mar 13, 2023

iarspider commented Mar 15, 2023

kpedro88 commented Mar 15, 2023

smuzaffar commented Apr 4, 2023

smuzaffar commented Apr 4, 2023

kpedro88 commented Apr 4, 2023

kpedro88 commented Apr 20, 2023

makortel commented Apr 20, 2023

makortel commented Apr 20, 2023

cmsbuild commented Apr 20, 2023