Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] Unit test HeterogeneousCore/SonicTriton failed: CUDA driver version is insufficient for CUDA runtime version #40911

Closed
iarspider opened this issue Mar 1, 2023 · 17 comments

Comments

@iarspider
Copy link
Contributor

Log: link

E0301 02:32:36.968122 25 metrics.cc:506] Unable to get device UUID: Bad parameter passed to function
E0301 02:32:36.968144 25 metrics.cc:506] Unable to get device UUID: Bad parameter passed to function
(...)
E0301 02:32:37.097455 25 model_repository_manager.cc:1215] failed to load 'gat_test' version 1: Internal: failed to load model 'gat_test': CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. CUDA error: unknown error
Exception raised from currentStreamCaptureStatusMayInitCtx at ../c10/cuda/CUDAGraphsC10Utils.h:83 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x2b161f78178c in /opt/tritonserver/backends/pytorch/libc10.so)
frame #1: <unknown function> + 0x16d05 (0x2b161fa59d05 in /opt/tritonserver/backends/pytorch/libc10_cuda.so)
frame #2: <unknown function> + 0x1b938 (0x2b161fa5e938 in /opt/tritonserver/backends/pytorch/libc10_cuda.so)
frame #3: <unknown function> + 0x1d21c (0x2b161fa6021c in /opt/tritonserver/backends/pytorch/libc10_cuda.so)
frame #4: <unknown function> + 0x1d815 (0x2b161fa60815 in /opt/tritonserver/backends/pytorch/libc10_cuda.so)
frame #5: THCStorage_resizeBytes(THCState*, c10::StorageImpl*, long) + 0x96 (0x2b162c635586 in /opt/tritonserver/backends/pytorch/libtorch_cuda.so)
frame #6: <unknown function> + 0x231cc15 (0x2b162b466c15 in /opt/tritonserver/backends/pytorch/libtorch_cuda.so)
frame #7: at::native::empty_strided_cuda(c10::ArrayRef<long>, c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>) + 0x86 (0x2b162c397896 in /opt/tritonserver/backends/pytorch/libtorch_cuda.so)
frame #8: <unknown function> + 0x336c0b9 (0x2b162c4b60b9 in /opt/tritonserver/backends/pytorch/libtorch_cuda.so)
frame #9: <unknown function> + 0x336c154 (0x2b162c4b6154 in /opt/tritonserver/backends/pytorch/libtorch_cuda.so)
frame #10: <unknown function> + 0x1c0fd71 (0x2b162167fd71 in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #11: at::empty_strided(c10::ArrayRef<long>, c10::ArrayRef<long>, c10::TensorOptions) + 0x237 (0x2b16211bf2b7 in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #12: <unknown function> + 0x13c1325 (0x2b1620e31325 in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #13: at::native::to(at::Tensor const&, c10::Device, c10::ScalarType, bool, bool, c10::optional<c10::MemoryFormat>) + 0x218 (0x2b1620e32728 in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #14: <unknown function> + 0x1e0b84d (0x2b162187b84d in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #15: at::Tensor::to(c10::Device, c10::ScalarType, bool, bool, c10::optional<c10::MemoryFormat>) const + 0x17a (0x2b1621ad045a in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #16: torch::jit::Unpickler::readInstruction() + 0x18cf (0x2b1623a62a9f in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #17: torch::jit::Unpickler::run() + 0xa8 (0x2b1623a633f8 in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #18: torch::jit::Unpickler::parse_ivalue() + 0x32 (0x2b1623a635a2 in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #19: torch::jit::readArchiveAndTensors(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&) + 0x456 (0x2b1623a234a6 in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #20: <unknown function> + 0x3fac535 (0x2b1623a1c535 in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #21: <unknown function> + 0x3faedf1 (0x2b1623a1edf1 in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #22: torch::jit::load(std::shared_ptr<caffe2::serialize::ReadAdapterInterface>, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0x1ba (0x2b1623a1fd9a in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #23: torch::jit::load(std::istream&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0xc2 (0x2b1623a21d72 in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #24: torch::jit::load(std::istream&, c10::optional<c10::Device>) + 0x6a (0x2b1623a21e5a in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #25: <unknown function> + 0x16d7b (0x2b16fa0dcd7b in /opt/tritonserver/backends/pytorch/libtriton_pytorch.so)
frame #26: <unknown function> + 0x1915a (0x2b16fa0df15a in /opt/tritonserver/backends/pytorch/libtriton_pytorch.so)
frame #27: <unknown function> + 0x19622 (0x2b16fa0df622 in /opt/tritonserver/backends/pytorch/libtriton_pytorch.so)
frame #28: TRITONBACKEND_ModelInstanceInitialize + 0x374 (0x2b16fa0df9e4 in /opt/tritonserver/backends/pytorch/libtriton_pytorch.so)
frame #29: <unknown function> + 0x3085bc (0x2b161e51a5bc in /opt/tritonserver/bin/../lib/libtritonserver.so)
frame #30: <unknown function> + 0x30a0c4 (0x2b161e51c0c4 in /opt/tritonserver/bin/../lib/libtritonserver.so)
frame #31: <unknown function> + 0x3053ee (0x2b161e5173ee in /opt/tritonserver/bin/../lib/libtritonserver.so)
frame #32: <unknown function> + 0x18ca4b (0x2b161e39ea4b in /opt/tritonserver/bin/../lib/libtritonserver.so)
frame #33: <unknown function> + 0x19aaf1 (0x2b161e3acaf1 in /opt/tritonserver/bin/../lib/libtritonserver.so)
frame #34: <unknown function> + 0xd6d84 (0x2b161f2c5d84 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #35: <unknown function> + 0x9609 (0x2b161ee6a609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #36: clone + 0x43 (0x2b161f65c293 in /lib/x86_64-linux-gnu/libc.so.6)
@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 1, 2023

A new Issue was created by @iarspider .

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@iarspider iarspider changed the title [GPU] RelVal 10805.31 failed: failed to load model 'gat_test' [GPU] Unit test HeterogeneousCore/SonicTriton failed: failed to load model 'gat_test' Mar 1, 2023
@iarspider
Copy link
Contributor Author

assign heterogeneous

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 1, 2023

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

makortel commented Mar 1, 2023

@kpedro88 Please take a look

@kpedro88
Copy link
Contributor

kpedro88 commented Mar 1, 2023

I notice this is an A100 GPU: GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-8b5aff3f-0f0c-f75d-2c23-218268379d98)

Is this GPU partitioned with MIG?

@iarspider
Copy link
Contributor Author

@kpedro88 The actual issue with that error seems to be this:

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use Docker with NVIDIA Container Toolkit to start this container; see
   https://github.com/NVIDIA/nvidia-docker.

Since IB CMSSW_13_1_ROOT6_X_2023-03-02-2300 (which is the 1st IB after update of SonicTriton in #40814), the error is different:

----- Begin Fatal Exception 03-Mar-2023 06:48:22 CET-----------------------
An exception of category 'TritonFailure' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 0
   [1] Running path 'p'
   [2] Calling method for module TritonGraphProducer/'TritonGraphProducer'
Exception Message:
unable to register CUDA shared memory region: 3353496_input3: failed to register CUDA shared memory region '3353496_input3': failed to open CUDA IPC handle: CUDA driver version is insufficient for CUDA runtime version
----- End Fatal Exception -------------------------------------------------

@aandvalenzuela
Copy link
Contributor

Hello,

This issue is still present in latest IBs CMSSW_13_1_GPU_X_2023-03-08-2300.

Thanks!
Andrea

@iarspider iarspider changed the title [GPU] Unit test HeterogeneousCore/SonicTriton failed: failed to load model 'gat_test' [GPU] Unit test HeterogeneousCore/SonicTriton failed: CUDA driver version is insufficient for CUDA runtime version Mar 9, 2023
@kpedro88
Copy link
Contributor

@iarspider is it possible to access one of the GPU machines used to run the unit test? It's hard to debug potential driver version incompatibilities without seeing what the machines actually use.

@iarspider
Copy link
Contributor Author

@kpedro88 this IB is built (and tested) on lxplus8-gpu and lxplus9-gpu. Most recent failed unit test was executed on lxplus8s05.

@kpedro88
Copy link
Contributor

This seems to be caused by a nested apptainer bug that just appeared: apptainer/apptainer#1205
(If I run the tests without starting inside a container, logging directly into lxplus8s05 and using the host OS, it works fine.)

In fact, the test now fails even in older releases e.g. CMSSW_13_0_GPU_X_2023-03-13-2300 where the only recent change in the SonicTriton package is s/singularity/apptainer/, indicating that the cause of the failure is external.

@smuzaffar
Copy link
Contributor

wf 10804.31 again failed for 13.0.X with exception unable to register shared memory region: 26373_input17: Socket closed
. Hopefully apptainer version 1.1.7 might fix this issue. It is tagged but not yet available for installation.

@smuzaffar
Copy link
Contributor

@kpedro88 , apptainer/apptainer#1205 looks like effecting the GPU libs but we are getting these errors for non-gpu nodes , so I am not sure if fix for apptainer/apptainer#1205 going to solve this issue

@kpedro88
Copy link
Contributor

kpedro88 commented Apr 4, 2023

The problem being discussed in this github issue relates to GPUs and was identified as being caused by a failure of apptainer to forward symlinked GPU drivers in nested containers. When apptainer version 1.1.7 is deployed, that problem will be fixed. A failure on CPU nodes is a different problem. The particular failure you point out (Socket closed) seems to be sporadic/transient (most IBs for most SCRAM_ARCHs in the past week do not demonstrate the failure).

@kpedro88
Copy link
Contributor

The failure no longer occurs now that apptainer 1.1.7 is deployed.

@makortel
Copy link
Contributor

+heterogeneous

@makortel
Copy link
Contributor

@cmsbuild, please close

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants