-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] A pool memory resource backed by virtual memory #1109
Comments
I will be happy to port rapidsai/dask-cuda#998 to C++ when we get it to work. |
It's something we have discussed. However I wonder if just using the MR we have that uses |
We have considered
We are currently trying an approach where our RMM resource maintains a pool of physical memory blocks that are mapped to virtual addresses at |
On the Spark side we use bounce buffers for UCX communication. Generally RMM does not play nicely with UCX so we allocate several bounce buffers using regular |
We use normally 4MB buffers, and we set aside up to ~500MB for these buffers (except for T4s, we normally set aside ~200MiB there). The main benefit of bounce buffers is that they allow us to work around BAR issues (we don't have to worry about 256MB BAR spaces, we hardcode how much BAR we will ever register), and also allows us to have a regular Another benefit is that we can use really fast D2D copies to pack these buffers with what would have been many small calls to send/recv. All of this requires a metadata layer that defines what is in a buffer, so we do have to send that metadata message ahead of our actual message. The main drawback is that now we have a hard upper limit for the amount bytes in flight and that we have extra d2d copies and complexity to manage this. |
Thanks @revans2 and @abellina for the comments. That was indeed one of our ideas, just use a bounce buffer for UCX, and the remaining memory for our regular pool. It is something we could explore more, and that is still in @madsbk and mine TODO list. As a more complex solution, but potentially more performant, Mads and I were prototyping a pool that utilizes small memory blocks backed by physical memory that we can distribute to the user application upon the user's request, either by delivering a piece of one of the blocks if the allocation is smaller, or combining multiple blocks for larger allocations. In this way it seems possible to possible to not use a bounce buffer at all, given we could pre-register all of the allocations (except for devices with small BAR sizes) and, for now, let the application (e.g., UCX-Py) deal with a buffer than spans multiple blocks by transferring them in multiple steps. |
I am not familiar with the APIs to get physical memory so I can't comment on that at this point, but I would think that if you can allocate from a different pool for data that is destined to UCX from the beginning, that is clearly better. It removes the D2D copy, and potentially brings more benefit with the physical addresses(?) (cheaper to register?) In Spark, we use UCX much later after the original results are produced. We cache a bunch of GPU buffers that will eventually be transferred. This means we want to get access to the whole GPU, because if it fits, it's really great. If it doesn't fit, we have to spill to host memory or disk. Regardless, the UCX transfer happens after all of the writes have completed, so at that point we would need to send from random addresses on the GPU (assuming it all fit), or copy to the bounce buffer and then send. |
A common issue with the current
rmm::pool_memory_resource
is fragmentation. Is it possible to provide a pool memory resource that is backed by a virtual address space to hide fragmentation?@madsbk is experimenting with this using the RMM Python API here. It would be great if this could eventually be upstreamed to C++.
The text was updated successfully, but these errors were encountered: