-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optionally release GIL when calling run
on simulations
#1463
Comments
Yes, supporting multiple simultaneous simulations within a single HOOMD-blue script is something I would like to support. In addition to the excellent points you bring up, one also needs to reacquire the GIL every time HOOMD may use the Python interpreter. Some of these events include:
The pybind11 docs state:
(https://pybind11.readthedocs.io/en/stable/advanced/misc.html#global-interpreter-lock-gil), but I don't know how much we can rely on that. The alternate solution is to selectively release the GIL only in the known C++ only expensive kernel loops. This solution modifies more code, but the failure mode is a lack of performance scaling instead of undefined behavior / segmentation faults.
I'm certain we already do this. It is an error to add an Operation instance to more than one Simulation. |
We additionally need to consider GPU synchronization more carefully to ensure that user's threads are accessing the data they intend and not data from previous steps while asynchronous kernels are still in flight. Also, each independent simulation should be run in a separate CUDA stream so that there is the possibility for kernels to overlap. If this worked, it could could enable better throughput batching many small simulations on one GPU. |
I believe that it can be relied on. The custom triggers uses
I agree, but I hope that wouldn't need to be the case. I think the API surface that goes from C++->Python is small enough that it can be understood encapsulated. I think code with numpy arrays, like you mention, will need attention (various properties and accessors), and classes that hold onto
I think
Yea I was gonna add that, but I wanted to get y'all onboard first haha. That would require us to add a stream parameter to ever gpu driver, a bit of work, but we might be able to regex-match and replace for the most part. I'd assume we would add an additional parameter to device initialization like I'm happy that there is interest here! I'll get started on a draft PR. |
I've found that a |
I would prefer to make As you say, this complicates how parallel streams would be assigned - but lets not get to far ahead. I'm not 100% certain that this is feasible. It is something I have had in mind since day 1 with HOOMD - to enable better use of the parallel resources on the GPU for small simulations. However, the early generations of GPUs where not capable of concurrent kernels or had severe limitations on what could run concurrently. I have not recently reviewed the limitations on the latest GPUs. |
Understood, though you would still image at most one stream per simulation, right? If that's all, I don't think the streams addition is that complicated code-wise. Just that if you want stream partitioning to happen automatically, the I don't think the streams refactor is essential to the GIL release being useful, so that can be discussed more, and maybe turn into a separate PR. Like you said, biggest benefit would come from many small sims run in parallel, masking the overhead of kernel launches. |
Yes exactly. Conceptually, the stream should be owned by the simulation, not the device, so that all kernels for that simulation are processed in order. However, the
The CUDA/HIP runtime API handles creating a new stream each time
Yes, of course. The PRs for GIL and streams should be separate. I am planning on replacing all the HIP calls with another interface layer in #1063. Since I will be touching all the kernel calls, I could add streams at that time. |
Aaahh, I hadn't noticed that looking at the stream API, super simple! I think either model is fine, though naturally I would have thought the stream belongs with the I don't know if there are any cases where you would want everything to be forced into the same stream, but if so that is a reason to opt for the keeping those details with |
Yes, at the C++ level it is more natural to put the stream in Regarding the GIL, the pybind11 docs suggest testing in Debug mode. I do have one job in the CI matrix doing just this: https://github.com/glotzerlab/hoomd-blue/actions/runs/3904935986/jobs/6672203929 . That should help catch some issues with the GIL. |
Okay cool! I'll get separate PRs going for these! |
@ianrgraham it may be worth waiting on doing this. Python 3.12 will have a compile time option to remove the GIL entirely. In even further releases it may be removed by default. I think that this is the easiest and most promising approach as it is done by the CPython developers directly. edit: linking PEP https://peps.python.org/pep-0703/ |
Ahh, well that's cool to hear! But that's still quite a ways away, probably an October 2023 release? I don't mind spending some time with this since I need the behavior for some of my plugins to run efficiently. I can keep the no-GIL changes in my personal fork, but would you all mind if I intend to merge them? I don't intend to expose the "no-GIL" path as the default, just as an option that should be really hard for an novice users to notice and use. Possibly as a flag on the |
Separately, I've made quite a bit of progress with refactoring a unique |
Have you profiled with Nsight systems and verified that you get parallel execution on the separate streams? |
No not yet, but I certainly will. A couple more regex-replaces before I'm there. |
I strongly prefer not to merge experimental code, even when activated as an opt-in. Doing so introduces maintenance challenges for many years to come. Most Python packages (e.g. numpy) release the GIL without any user intervention, HOOMD-blue should do so as well if possible. I'd be happy to merge your changes that release the GIL if it does so by default and you have tested and are reasonably sure that the GIL is reacquired when needed. The release can come with a warning that multithreaded simulations are experimental - but just releasing the GIL can enable workflows that don't involve that (e.g. analysis in parallel with simulation). |
Of course, I'll do my best to ensure everything is airtight. I had one other question for the both of you, regarding existing usage of streams in HOOMD. Would you prefer that I not refactor these existing cases that use a hoomd-blue/hoomd/hpmc/IntegratorHPMCMonoGPUDepletantsTypes.cuh Lines 60 to 77 in 0b4b6b5
Is there an intention down the road to phase out that code? Since plain old MPI is good enough to scale HOOMD to multi-GPU systems? At the moment I have left these kernels alone, and there doesn't seem to be any harm in them running on a different stream, since that was already happening in the code before. Another case that I've noticed the usage of streams is in |
I've also found some places where kernel drivers are not found in a |
Both the depletants code and
Feel free to fix any kernel namespace issues you find. Sorry I didn't see this earlier, but take a look at: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#default-stream |
Got it. Oh and it's no problem. That may be a better solution, since it might also apply to |
Would note on the GIL front that while that work is very exciting there are still a lot of concerns to be hammered out. Based on the ongoing discussions, I would not be at all surprised by any of the following happening (or not happening):
Any of the above would delay its usability for HOOMD, potentially significantly. Moreover, removing the GIL isn't a magic button that will suddenly make the code thread-safe, so that work needs to be done regardless. Given how tight HOOMD 3.0's Python-C++ integration is, there's probably quite a few places to think about how HOOMD uses Python (pybind) objects in C++ to ensure that they're all thread-safe. Once the core HOOMD code is thread-safe, actually releasing the GIL is pretty straightforward with pybind. As such, if getting multithreaded HOOMD support is of sufficient interest I wouldn't advise waiting on GIL removal for that. |
My intention here is not to release the GIL to make a single simulation multi-threaded, but rather to be able to run many simulations concurrently in one python script and use my own locks to orchestrate reads and writes across systems. So users can't create undefined behavior unless they were to write some C++ plugin code that purposely shared data across threads and didn't orchestrate it properly. Trying to make all of And it actually wasn't too bad to patch all the holes that are made by releasing the GIL during calls to |
Indeed, I would have no intention to provide any type of thread safety within a single Simulation object. Advanced sampling techniques do not require threading. You can use MPI partitions and |
That makes sense, I'm fine with that limited scope. My only point was to say that even if Python removes the GIL (or makes it possible to turn off the GIL) it won't actually make the code thread-safe, and there's no indication that a nogil Python will be the default any time soon, so I would proceed with making any necessary changes to HOOMD now. In a future with a nogil Python the only change would then be to remove the pybind code for releasing the GIL since that would become redundant. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
I'll put together the PR for releasing the GIL soon. That part was completed a while ago. The overhaul of the kernel launches to independent cuda streams I'm less confident about, and so I'd omit that for the sake of just getting the primary feature in. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
This issue has been automatically closed because it has not had recent activity. |
Description
There are a number of research applications where it would be great to be able to run multiple simulation objects concurrently and safely share data among them. A few that I can think of: Gibbs Ensemble method, nudged elastic band minimization, and parallel tempering.
One obstacle in efficiently performing these methods is the GIL. Though fortunately, the vast majority of code in HOOMD does not directly talk to the Python C API, and thus in theory the GIL can be released.
Though, this certainly opens up the chance of abuse and bugs-galore when C++ objects are mistakenly shared across threads. I ran into this in my personal testing by creating a single
device = hoomd.device.GPU()
, instantiating simulations with it, and then sending those simulations to a bunch of threads. This lead to random segfaults in the internalCachedAllocator
when trying to allocate a temporary buffer. This was easily fixed by attaching a distinct device object to each simulation, but this example highlights the possibility of misuse that this feature would allow.There would certainly be some thorny bits in allowing the GIL to be released, but I think the freedom it gives to plugin authors greatly outweighs the chances of misuse. For example, I was able to quickly implement an efficient nudged elastic band plugin by sharing the particle data arrays (on host and device) across threads. Each simulation object was able to work independently until the
HalfStepHook
was reached, wait until all threads reach an atomic barrier, share positional information (without copying!) to modify the net force, wait at an atomic barrier once more, and then continue independently.Proposed solution
The implementation of the release is simple. Though knowing if there are any parts of the C++ code that require the GIL (like writing to stdout without introducing formatting errors) is much harder and I'm not 100% familiar with the entire code base.
I think the unsafety that this feature introduces can be mitigated by keeping it experimental, and/or warning developers that undefined behavior can easily be triggered if not careful.
Some undefined behavior can be mitigated by (easiest) ensuring in Python that objects are attached to at most one simulation, or (hardest) rewriting these implementations to be thread-safe. For example, the current implementation of
hoomd.Device
is able to be shared by multiple simulations, but it is not internally thread-safe. Internal C++ classes like theCachedAllocator
were written to be called by one thread at a time.One other detail is that for useful information to be shared across simulation threads (like position, velocities, etc.), the access logic for
GlobalArray
should be updated to (at a minimum) allow multiple threads to borrow immutably if not already borrowed mutably by the owning thread. This access could also be made atomic, but that's not totally necessary. Other concurrency mechanism (like barriers) can be used by plugin authors to ensure safety is maintained.Additional context
No response
The text was updated successfully, but these errors were encountered: