[FEA] A pool memory resource backed by virtual memory #1109

shwina · 2022-09-28T15:01:52Z

A common issue with the current rmm::pool_memory_resource is fragmentation. Is it possible to provide a pool memory resource that is backed by a virtual address space to hide fragmentation?

@madsbk is experimenting with this using the RMM Python API here. It would be great if this could eventually be upstreamed to C++.

The text was updated successfully, but these errors were encountered:

madsbk · 2022-09-28T15:58:21Z

I will be happy to port rapidsai/dask-cuda#998 to C++ when we get it to work.
Currently, @pentschev and I are working on UCX support.

harrism · 2022-09-28T22:56:47Z

It's something we have discussed. However I wonder if just using the MR we have that uses cudaMallocAsync wouldn't solve the same problem. I believe cudaMallocAsync is backed by the same virtual memory APIs and should do a pretty good job of minimizing fragmentation.

https://github.com/rapidsai/rmm/blob/branch-22.10/include/rmm/mr/device/cuda_async_memory_resource.hpp

madsbk · 2022-09-30T13:51:25Z

We have considered cudaMallocAsync but the problem is UCX support, which requires a more low-level control of the physical and virtual memory mapping.
As far as we can tell, UCX requires:

Once used in communication, physical memory should never be freed.
Once used in communication, mapped virtual addresses should never be unmapped.

We are currently trying an approach where our RMM resource maintains a pool of physical memory blocks that are mapped to virtual addresses at mr.allocate() such that the user sees one contiguous memory allocation.
To support UCX, we split a user allocation back into its underlying physical memory blocks and translate UCX operations into a series of operations on the physical memory blocks.

revans2 · 2022-09-30T14:16:10Z

On the Spark side we use bounce buffers for UCX communication. Generally RMM does not play nicely with UCX so we allocate several bounce buffers using regular cudaMalloc that we can use to send/receive data. Having several of them allows us to be filling one, while the other is being sent/etc. I know it is not zero copy, but the performance impact is relatively small. @abellina might be able to comment more about how we tuned it all and the sizes. But it turned out that the size needed was not that big.

abellina · 2022-09-30T14:25:10Z

We use normally 4MB buffers, and we set aside up to ~500MB for these buffers (except for T4s, we normally set aside ~200MiB there). The main benefit of bounce buffers is that they allow us to work around BAR issues (we don't have to worry about 256MB BAR spaces, we hardcode how much BAR we will ever register), and also allows us to have a regular cudaMallocAsync pool for the remaining of our app (we can copy from GPU memory allocated in the pool to a bounce buffer). Additionally, memory registration (ibv_reg_mr, and opening of IPC mem handles) happens once, and early during startup, not at task time.

Another benefit is that we can use really fast D2D copies to pack these buffers with what would have been many small calls to send/recv. All of this requires a metadata layer that defines what is in a buffer, so we do have to send that metadata message ahead of our actual message.

The main drawback is that now we have a hard upper limit for the amount bytes in flight and that we have extra d2d copies and complexity to manage this.

pentschev · 2022-09-30T14:35:20Z

Thanks @revans2 and @abellina for the comments. That was indeed one of our ideas, just use a bounce buffer for UCX, and the remaining memory for our regular pool. It is something we could explore more, and that is still in @madsbk and mine TODO list.

As a more complex solution, but potentially more performant, Mads and I were prototyping a pool that utilizes small memory blocks backed by physical memory that we can distribute to the user application upon the user's request, either by delivering a piece of one of the blocks if the allocation is smaller, or combining multiple blocks for larger allocations. In this way it seems possible to possible to not use a bounce buffer at all, given we could pre-register all of the allocations (except for devices with small BAR sizes) and, for now, let the application (e.g., UCX-Py) deal with a buffer than spans multiple blocks by transferring them in multiple steps.

abellina · 2022-09-30T14:48:02Z

I am not familiar with the APIs to get physical memory so I can't comment on that at this point, but I would think that if you can allocate from a different pool for data that is destined to UCX from the beginning, that is clearly better. It removes the D2D copy, and potentially brings more benefit with the physical addresses(?) (cheaper to register?)

In Spark, we use UCX much later after the original results are produced. We cache a bunch of GPU buffers that will eventually be transferred. This means we want to get access to the whole GPU, because if it fits, it's really great. If it doesn't fit, we have to spill to host memory or disk. Regardless, the UCX transfer happens after all of the writes have completed, so at that point we would need to send from random addresses on the GPU (assuming it all fit), or copy to the bounce buffer and then send.

shwina added feature request New feature or request ? - Needs Triage Need team to review and classify labels Sep 28, 2022

jarmak-nv added this to RMM Project Board Nov 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] A pool memory resource backed by virtual memory #1109

[FEA] A pool memory resource backed by virtual memory #1109

shwina commented Sep 28, 2022

madsbk commented Sep 28, 2022 •

edited

Loading

harrism commented Sep 28, 2022

madsbk commented Sep 30, 2022

revans2 commented Sep 30, 2022

abellina commented Sep 30, 2022 •

edited

Loading

pentschev commented Sep 30, 2022

abellina commented Sep 30, 2022

[FEA] A pool memory resource backed by virtual memory #1109

[FEA] A pool memory resource backed by virtual memory #1109

Comments

shwina commented Sep 28, 2022

madsbk commented Sep 28, 2022 • edited Loading

harrism commented Sep 28, 2022

madsbk commented Sep 30, 2022

revans2 commented Sep 30, 2022

abellina commented Sep 30, 2022 • edited Loading

pentschev commented Sep 30, 2022

abellina commented Sep 30, 2022

madsbk commented Sep 28, 2022 •

edited

Loading

abellina commented Sep 30, 2022 •

edited

Loading