Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Utility for catching GPUs in exclusive mode #697

Closed
IAlibay opened this issue May 24, 2023 · 16 comments · Fixed by #699
Closed

[ENH] Utility for catching GPUs in exclusive mode #697

IAlibay opened this issue May 24, 2023 · 16 comments · Fixed by #699

Comments

@IAlibay
Copy link
Contributor

IAlibay commented May 24, 2023

After discussions with @mikemhenry we think this might need to live here.

We have had issues with folks trying to use a multistatesampler derived class (repex, sams, etc...) and things failing in a non-clean manner beause their GPU setup was set in exclusive mode (tagging @ijpulidos who found this issue at the OMSF workshop).

It would be good to have a utility here that does a call to nvidia-smi --query-gpu="compute_mode" --format=csv or similar to get the compute mode and check that we aren't in exclusive mode.

@ijpulidos
Copy link
Contributor

Yes, this is something we can do, detect if the device is in Exclusive_Process compute mode and stop early if this is the case, because the subsequent errors are kinda cryptic.

On the other hand, maybe there could be something else going on here. This issue seems to suggest that we are running more than job in the GPU and, if so, why is it that we need to run more than one job on the device? Just trying to think if this is a symptom of not deleting/removing contexts properly in the device from our part.

@ijpulidos ijpulidos modified the milestones: 0.23.0, 0.22.2 May 24, 2023
@ijpulidos
Copy link
Contributor

FYI. When I monitor the usage of the GPU when running simulations I only see one process, as expected.

@IAlibay
Copy link
Contributor Author

IAlibay commented May 24, 2023

Just trying to think if this is a symptom of not deleting/removing contexts properly in the device from our part.

Yeah so my thought here was that maybe we're keeping GPU contexts for the replicas alive on the CUDA device (even though only one process is running at a given time), so it's being registered as a multi-process execution.

@mikemhenry
Copy link
Contributor

mikemhenry commented May 24, 2023

I think it is one process with more than one context maybe? I can't work on this right now, but when I get around to it (or someone else) if you have a GPU on a headless box, you can put it in Exclusive_Process mode and see what happens.

Like one proc will show up in linux but you can have multi threads

@ijpulidos
Copy link
Contributor

ijpulidos commented May 24, 2023

@mikemhenry I did that, in case we need it the error is as follows (this is the same error folks were getting at the OMSF meeting) https://gist.github.com/ijpulidos/ac6e59ee30471154f857ff2fb6635961

In the end is a openmm.OpenMMException: No compatible CUDA device is available

@mikemhenry
Copy link
Contributor

Ah okay I misunderstood your comment.

I'm thinking we make a separate util function that does this check and raises an error.

@ijpulidos can you run the tests with Exclusive_Process to see what GPU code paths still work? I think we should be careful where we throw the error. We could also just throw a warning and call the function when we make the first GPU context.

@jchodera
Copy link
Member

+1 for adding intelligent error checking to warn the user if they are trying to use thread- or process-exclusive mode with CUDA.

The overhead with creating CUDA contexts is large enough that we generally cache multiple OpenMM Context objects to avoid having to keep creating and destroying them, but many CUDA installations are set up to use thread- or process-exclusive mode by default, which won't allow more than one CUDA context without triggering an exception.

@mikemhenry : Would you be able to tackle this?

@mikemhenry
Copy link
Contributor

Yes!

@ijpulidos
Copy link
Contributor

The only thing that may be difficult to handle is that I don't know if there's an easy way to know what device is actually being used, think for example in the case where we are running in a machine with more than one GPU and only one of them is in exclusive_process mode, with nvidia-smi we can tell which device is in that mode or not, but we cannot easily tell which device is actually going to be used, as far as I know.

On the other hand, we have a function that already calls nvidia-smi in

def _display_cuda_devices():
. @mikemhenry Just in case you think that's also the best place to try implementing this logic.

@ijpulidos
Copy link
Contributor

We probably can tackle #693 in the same set of changes

@IAlibay
Copy link
Contributor Author

IAlibay commented May 25, 2023

@ijpulidos could looking at CUDA_VISIBLE_DEVICES work? If multiple devices are present and also defined in the env var or the env var is not defined then I think the assumption of only needing to default to 0 should be fine? (iirc OpenMM defaults to visible device 0 right?)

@ijpulidos
Copy link
Contributor

@ijpulidos could looking at CUDA_VISIBLE_DEVICES work?

Maybe, but it gets kinda tricky. CUDA_VISIBLE_DEVICES uses the API order, which orders devices by compute capabilities, whereas nvidia-smi orders devices by PCI ID (which depends on the OS). So we cannot guarantee correspondence there.

@mikemhenry
Copy link
Contributor

think for example in the case where we are running in a machine with more than one GPU and only one of them is in exclusive_process mode

This is why I will throw a warning and not an error, if our heuristic is wrong, then the warning will be printed and everything would work OR the error isn't printed and everything will still work.

@ijpulidos
Copy link
Contributor

We could also use the nvidia UUID (not just the index) to work around this limitation.

We can try-catch the openmm.OpenMMException: No compatible CUDA device is available and try debugging that once it's caught.

@mikemhenry
Copy link
Contributor

mikemhenry commented May 25, 2023

"serial" or "gpu_serial"
This number matches the serial number physically printed on each board. It is a globally unique immutable alphanumeric value.

"uuid" or "gpu_uuid"
This value is the globally unique immutable alphanumeric identifier of the GPU. It does not correspond to any physical label on the board.

Do we want to log the uuid or the serial number?

I was thinking serial since they both are unique but if someone needed to remove a card causing issues, the serial would be more helpful.

@mikemhenry
Copy link
Contributor

Ah if the GPU is integrated, there is no serial number printed and comes back as NA, so I will use the uuid

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants