-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Utility for catching GPUs in exclusive mode #697
Comments
Yes, this is something we can do, detect if the device is in On the other hand, maybe there could be something else going on here. This issue seems to suggest that we are running more than job in the GPU and, if so, why is it that we need to run more than one job on the device? Just trying to think if this is a symptom of not deleting/removing contexts properly in the device from our part. |
FYI. When I monitor the usage of the GPU when running simulations I only see one process, as expected. |
Yeah so my thought here was that maybe we're keeping GPU contexts for the replicas alive on the CUDA device (even though only one process is running at a given time), so it's being registered as a multi-process execution. |
I think it is one process with more than one context maybe? I can't work on this right now, but when I get around to it (or someone else) if you have a GPU on a headless box, you can put it in Like one proc will show up in linux but you can have multi threads |
@mikemhenry I did that, in case we need it the error is as follows (this is the same error folks were getting at the OMSF meeting) https://gist.github.com/ijpulidos/ac6e59ee30471154f857ff2fb6635961 In the end is a |
Ah okay I misunderstood your comment. I'm thinking we make a separate util function that does this check and raises an error. @ijpulidos can you run the tests with |
+1 for adding intelligent error checking to warn the user if they are trying to use thread- or process-exclusive mode with CUDA. The overhead with creating CUDA contexts is large enough that we generally cache multiple OpenMM @mikemhenry : Would you be able to tackle this? |
Yes! |
The only thing that may be difficult to handle is that I don't know if there's an easy way to know what device is actually being used, think for example in the case where we are running in a machine with more than one GPU and only one of them is in exclusive_process mode, with On the other hand, we have a function that already calls
|
We probably can tackle #693 in the same set of changes |
@ijpulidos could looking at CUDA_VISIBLE_DEVICES work? If multiple devices are present and also defined in the env var or the env var is not defined then I think the assumption of only needing to default to 0 should be fine? (iirc OpenMM defaults to visible device 0 right?) |
Maybe, but it gets kinda tricky. |
This is why I will throw a warning and not an error, if our heuristic is wrong, then the warning will be printed and everything would work OR the error isn't printed and everything will still work. |
We could also use the nvidia UUID (not just the index) to work around this limitation. We can try-catch the |
Do we want to log the I was thinking |
Ah if the GPU is integrated, there is no serial number printed and comes back as NA, so I will use the |
After discussions with @mikemhenry we think this might need to live here.
We have had issues with folks trying to use a multistatesampler derived class (repex, sams, etc...) and things failing in a non-clean manner beause their GPU setup was set in exclusive mode (tagging @ijpulidos who found this issue at the OMSF workshop).
It would be good to have a utility here that does a call to
nvidia-smi --query-gpu="compute_mode" --format=csv
or similar to get the compute mode and check that we aren't in exclusive mode.The text was updated successfully, but these errors were encountered: