[ENH] Utility for catching GPUs in exclusive mode #697

IAlibay · 2023-05-24T07:21:53Z

After discussions with @mikemhenry we think this might need to live here.

We have had issues with folks trying to use a multistatesampler derived class (repex, sams, etc...) and things failing in a non-clean manner beause their GPU setup was set in exclusive mode (tagging @ijpulidos who found this issue at the OMSF workshop).

It would be good to have a utility here that does a call to nvidia-smi --query-gpu="compute_mode" --format=csv or similar to get the compute mode and check that we aren't in exclusive mode.

The text was updated successfully, but these errors were encountered:

ijpulidos · 2023-05-24T15:48:03Z

Yes, this is something we can do, detect if the device is in Exclusive_Process compute mode and stop early if this is the case, because the subsequent errors are kinda cryptic.

On the other hand, maybe there could be something else going on here. This issue seems to suggest that we are running more than job in the GPU and, if so, why is it that we need to run more than one job on the device? Just trying to think if this is a symptom of not deleting/removing contexts properly in the device from our part.

ijpulidos · 2023-05-24T15:51:39Z

FYI. When I monitor the usage of the GPU when running simulations I only see one process, as expected.

IAlibay · 2023-05-24T16:23:13Z

Just trying to think if this is a symptom of not deleting/removing contexts properly in the device from our part.

Yeah so my thought here was that maybe we're keeping GPU contexts for the replicas alive on the CUDA device (even though only one process is running at a given time), so it's being registered as a multi-process execution.

mikemhenry · 2023-05-24T16:31:07Z

I think it is one process with more than one context maybe? I can't work on this right now, but when I get around to it (or someone else) if you have a GPU on a headless box, you can put it in Exclusive_Process mode and see what happens.

Like one proc will show up in linux but you can have multi threads

ijpulidos · 2023-05-24T17:11:49Z

@mikemhenry I did that, in case we need it the error is as follows (this is the same error folks were getting at the OMSF meeting) https://gist.github.com/ijpulidos/ac6e59ee30471154f857ff2fb6635961

In the end is a openmm.OpenMMException: No compatible CUDA device is available

mikemhenry · 2023-05-24T17:53:39Z

Ah okay I misunderstood your comment.

I'm thinking we make a separate util function that does this check and raises an error.

@ijpulidos can you run the tests with Exclusive_Process to see what GPU code paths still work? I think we should be careful where we throw the error. We could also just throw a warning and call the function when we make the first GPU context.

jchodera · 2023-05-25T01:20:33Z

+1 for adding intelligent error checking to warn the user if they are trying to use thread- or process-exclusive mode with CUDA.

The overhead with creating CUDA contexts is large enough that we generally cache multiple OpenMM Context objects to avoid having to keep creating and destroying them, but many CUDA installations are set up to use thread- or process-exclusive mode by default, which won't allow more than one CUDA context without triggering an exception.

@mikemhenry : Would you be able to tackle this?

mikemhenry · 2023-05-25T05:39:26Z

Yes!

ijpulidos · 2023-05-25T15:42:13Z

The only thing that may be difficult to handle is that I don't know if there's an easy way to know what device is actually being used, think for example in the case where we are running in a machine with more than one GPU and only one of them is in exclusive_process mode, with nvidia-smi we can tell which device is in that mode or not, but we cannot easily tell which device is actually going to be used, as far as I know.

On the other hand, we have a function that already calls nvidia-smi in

openmmtools/openmmtools/multistate/multistatesampler.py

Line 1774 in abd4011

def _display_cuda_devices():

. @mikemhenry Just in case you think that's also the best place to try implementing this logic.

ijpulidos · 2023-05-25T15:43:21Z

We probably can tackle #693 in the same set of changes

IAlibay · 2023-05-25T15:46:17Z

@ijpulidos could looking at CUDA_VISIBLE_DEVICES work? If multiple devices are present and also defined in the env var or the env var is not defined then I think the assumption of only needing to default to 0 should be fine? (iirc OpenMM defaults to visible device 0 right?)

ijpulidos · 2023-05-25T15:50:20Z

@ijpulidos could looking at CUDA_VISIBLE_DEVICES work?

Maybe, but it gets kinda tricky. CUDA_VISIBLE_DEVICES uses the API order, which orders devices by compute capabilities, whereas nvidia-smi orders devices by PCI ID (which depends on the OS). So we cannot guarantee correspondence there.

mikemhenry · 2023-05-25T18:22:25Z

think for example in the case where we are running in a machine with more than one GPU and only one of them is in exclusive_process mode

This is why I will throw a warning and not an error, if our heuristic is wrong, then the warning will be printed and everything would work OR the error isn't printed and everything will still work.

ijpulidos · 2023-05-25T19:22:54Z

We could also use the nvidia UUID (not just the index) to work around this limitation.

We can try-catch the openmm.OpenMMException: No compatible CUDA device is available and try debugging that once it's caught.

mikemhenry · 2023-05-25T19:28:26Z

"serial" or "gpu_serial"
This number matches the serial number physically printed on each board. It is a globally unique immutable alphanumeric value.

"uuid" or "gpu_uuid"
This value is the globally unique immutable alphanumeric identifier of the GPU. It does not correspond to any physical label on the board.

Do we want to log the uuid or the serial number?

I was thinking serial since they both are unique but if someone needed to remove a card causing issues, the serial would be more helpful.

mikemhenry · 2023-05-25T19:33:33Z

Ah if the GPU is integrated, there is no serial number printed and comes back as NA, so I will use the uuid

ijpulidos modified the milestones: 0.23.0, 0.22.2 May 24, 2023

jchodera added 🌠 enhancement low-effort labels May 25, 2023

mikemhenry self-assigned this May 25, 2023

mikemhenry mentioned this issue May 25, 2023

report mode GPU is in and write out gpu UUID #699

Merged

5 tasks

mikemhenry closed this as completed in #699 Jun 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Utility for catching GPUs in exclusive mode #697

[ENH] Utility for catching GPUs in exclusive mode #697

IAlibay commented May 24, 2023

ijpulidos commented May 24, 2023

ijpulidos commented May 24, 2023

IAlibay commented May 24, 2023

mikemhenry commented May 24, 2023 •

edited

Loading

ijpulidos commented May 24, 2023 •

edited

Loading

mikemhenry commented May 24, 2023

jchodera commented May 25, 2023

mikemhenry commented May 25, 2023

ijpulidos commented May 25, 2023

ijpulidos commented May 25, 2023

IAlibay commented May 25, 2023

ijpulidos commented May 25, 2023

mikemhenry commented May 25, 2023

ijpulidos commented May 25, 2023

mikemhenry commented May 25, 2023 •

edited

Loading

mikemhenry commented May 25, 2023

[ENH] Utility for catching GPUs in exclusive mode #697

[ENH] Utility for catching GPUs in exclusive mode #697

Comments

IAlibay commented May 24, 2023

ijpulidos commented May 24, 2023

ijpulidos commented May 24, 2023

IAlibay commented May 24, 2023

mikemhenry commented May 24, 2023 • edited Loading

ijpulidos commented May 24, 2023 • edited Loading

mikemhenry commented May 24, 2023

jchodera commented May 25, 2023

mikemhenry commented May 25, 2023

ijpulidos commented May 25, 2023

ijpulidos commented May 25, 2023

IAlibay commented May 25, 2023

ijpulidos commented May 25, 2023

mikemhenry commented May 25, 2023

ijpulidos commented May 25, 2023

mikemhenry commented May 25, 2023 • edited Loading

mikemhenry commented May 25, 2023

mikemhenry commented May 24, 2023 •

edited

Loading

ijpulidos commented May 24, 2023 •

edited

Loading

mikemhenry commented May 25, 2023 •

edited

Loading