-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU] Unit test HeterogeneousCore/SonicTriton failed: CUDA driver version is insufficient for CUDA runtime version #40911
Comments
A new Issue was created by @iarspider . @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign heterogeneous |
@kpedro88 Please take a look |
I notice this is an A100 GPU: Is this GPU partitioned with MIG? |
@kpedro88 The actual issue with that error seems to be this:
Since IB CMSSW_13_1_ROOT6_X_2023-03-02-2300 (which is the 1st IB after update of SonicTriton in #40814), the error is different:
|
Hello, This issue is still present in latest IBs CMSSW_13_1_GPU_X_2023-03-08-2300. Thanks! |
@iarspider is it possible to access one of the GPU machines used to run the unit test? It's hard to debug potential driver version incompatibilities without seeing what the machines actually use. |
@kpedro88 this IB is built (and tested) on |
This seems to be caused by a nested apptainer bug that just appeared: apptainer/apptainer#1205 In fact, the test now fails even in older releases e.g. CMSSW_13_0_GPU_X_2023-03-13-2300 where the only recent change in the SonicTriton package is |
wf 10804.31 again failed for 13.0.X with exception |
@kpedro88 , apptainer/apptainer#1205 looks like effecting the GPU libs but we are getting these errors for non-gpu nodes , so I am not sure if fix for apptainer/apptainer#1205 going to solve this issue |
The problem being discussed in this github issue relates to GPUs and was identified as being caused by a failure of apptainer to forward symlinked GPU drivers in nested containers. When apptainer version 1.1.7 is deployed, that problem will be fixed. A failure on CPU nodes is a different problem. The particular failure you point out ( |
The failure no longer occurs now that apptainer 1.1.7 is deployed. |
+heterogeneous |
@cmsbuild, please close |
This issue is fully signed and ready to be closed. |
Log: link
The text was updated successfully, but these errors were encountered: