Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] A pool memory resource backed by virtual memory #1109

Open
shwina opened this issue Sep 28, 2022 · 7 comments
Open

[FEA] A pool memory resource backed by virtual memory #1109

shwina opened this issue Sep 28, 2022 · 7 comments
Labels
? - Needs Triage Need team to review and classify feature request New feature or request

Comments

@shwina
Copy link
Contributor

shwina commented Sep 28, 2022

A common issue with the current rmm::pool_memory_resource is fragmentation. Is it possible to provide a pool memory resource that is backed by a virtual address space to hide fragmentation?

@madsbk is experimenting with this using the RMM Python API here. It would be great if this could eventually be upstreamed to C++.

@shwina shwina added feature request New feature or request ? - Needs Triage Need team to review and classify labels Sep 28, 2022
@madsbk
Copy link
Member

madsbk commented Sep 28, 2022

I will be happy to port rapidsai/dask-cuda#998 to C++ when we get it to work.
Currently, @pentschev and I are working on UCX support.

@harrism
Copy link
Member

harrism commented Sep 28, 2022

It's something we have discussed. However I wonder if just using the MR we have that uses cudaMallocAsync wouldn't solve the same problem. I believe cudaMallocAsync is backed by the same virtual memory APIs and should do a pretty good job of minimizing fragmentation.

https://github.com/rapidsai/rmm/blob/branch-22.10/include/rmm/mr/device/cuda_async_memory_resource.hpp

@madsbk
Copy link
Member

madsbk commented Sep 30, 2022

We have considered cudaMallocAsync but the problem is UCX support, which requires a more low-level control of the physical and virtual memory mapping.
As far as we can tell, UCX requires:

  • Once used in communication, physical memory should never be freed.
  • Once used in communication, mapped virtual addresses should never be unmapped.

We are currently trying an approach where our RMM resource maintains a pool of physical memory blocks that are mapped to virtual addresses at mr.allocate() such that the user sees one contiguous memory allocation.
To support UCX, we split a user allocation back into its underlying physical memory blocks and translate UCX operations into a series of operations on the physical memory blocks.

@revans2
Copy link
Contributor

revans2 commented Sep 30, 2022

On the Spark side we use bounce buffers for UCX communication. Generally RMM does not play nicely with UCX so we allocate several bounce buffers using regular cudaMalloc that we can use to send/receive data. Having several of them allows us to be filling one, while the other is being sent/etc. I know it is not zero copy, but the performance impact is relatively small. @abellina might be able to comment more about how we tuned it all and the sizes. But it turned out that the size needed was not that big.

@abellina
Copy link
Contributor

abellina commented Sep 30, 2022

We use normally 4MB buffers, and we set aside up to ~500MB for these buffers (except for T4s, we normally set aside ~200MiB there). The main benefit of bounce buffers is that they allow us to work around BAR issues (we don't have to worry about 256MB BAR spaces, we hardcode how much BAR we will ever register), and also allows us to have a regular cudaMallocAsync pool for the remaining of our app (we can copy from GPU memory allocated in the pool to a bounce buffer). Additionally, memory registration (ibv_reg_mr, and opening of IPC mem handles) happens once, and early during startup, not at task time.

Another benefit is that we can use really fast D2D copies to pack these buffers with what would have been many small calls to send/recv. All of this requires a metadata layer that defines what is in a buffer, so we do have to send that metadata message ahead of our actual message.

The main drawback is that now we have a hard upper limit for the amount bytes in flight and that we have extra d2d copies and complexity to manage this.

@pentschev
Copy link
Member

Thanks @revans2 and @abellina for the comments. That was indeed one of our ideas, just use a bounce buffer for UCX, and the remaining memory for our regular pool. It is something we could explore more, and that is still in @madsbk and mine TODO list.

As a more complex solution, but potentially more performant, Mads and I were prototyping a pool that utilizes small memory blocks backed by physical memory that we can distribute to the user application upon the user's request, either by delivering a piece of one of the blocks if the allocation is smaller, or combining multiple blocks for larger allocations. In this way it seems possible to possible to not use a bounce buffer at all, given we could pre-register all of the allocations (except for devices with small BAR sizes) and, for now, let the application (e.g., UCX-Py) deal with a buffer than spans multiple blocks by transferring them in multiple steps.

@abellina
Copy link
Contributor

I am not familiar with the APIs to get physical memory so I can't comment on that at this point, but I would think that if you can allocate from a different pool for data that is destined to UCX from the beginning, that is clearly better. It removes the D2D copy, and potentially brings more benefit with the physical addresses(?) (cheaper to register?)

In Spark, we use UCX much later after the original results are produced. We cache a bunch of GPU buffers that will eventually be transferred. This means we want to get access to the whole GPU, because if it fits, it's really great. If it doesn't fit, we have to spill to host memory or disk. Regardless, the UCX transfer happens after all of the writes have completed, so at that point we would need to send from random addresses on the GPU (assuming it all fit), or copy to the bounce buffer and then send.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify feature request New feature or request
Projects
Status: No status
Development

No branches or pull requests

6 participants