Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pinned memory pooling #7

Open
revans2 opened this issue Apr 27, 2021 · 4 comments
Open

Support pinned memory pooling #7

revans2 opened this issue Apr 27, 2021 · 4 comments

Comments

@revans2
Copy link

revans2 commented Apr 27, 2021

I read your article at https://pingcap.com/blog/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times

Really great work.

I work on a related project https://github.com/NVIDIA/spark-rapids/

In the article you mentioned that you cannot get above 3 GiB/s PCIe transfers. This is probably because you are not using pinned memory for the transfers. For the java API we added in a pinned memory pool similar to RMM. It is not a great implementation and rapidsai/rmm#618 is to have RMM also handle pinned memory at some point. The pinned pool lets us get about 12-13 GiB/sec on PCIe gen 3.

Be aware that it is rather expensive to allocate the pinned memory, which is why we pool it. If you really have a 1:1 ratio for data transfer to CPU compute, this could be a really big win.

@zanmato1984
Copy link
Owner

Hi Bobby, thank you for the comments. That's very helpful!

Yes I believe a pinned memory pool could definitely be a big win. I'll keep watching the progress of rmm's implementation.

Besides what you've mentioned about the expensiveness of pinned memory allocation, a pool is also significant for our pipeline model. We intend to achieve higher GPU utilization by concurrently executing multiple CPU threads, each executes a pipelined cuDF calls. IIRC, a raw cudaMalloc or cudaFree call will force sync the device thus brutally kill the potential parallelism among the GPU jobs submitted by these concurrent CPU threads.

@revans2
Copy link
Author

revans2 commented Apr 28, 2021

Having multiple threads helps, especially for larger devices that have enough memory and compute to do everything multiple things at once. I am not sure how deep you can get your pipelined calls to be. For a lot of operations in cuDF (like aggregations and joins) the output size is not know ahead of time. This means that they have to do a calculation on the GPU and then pull back a result to the CPU, allocate memory on the GPU and finally compute the answer. This can result in a lot of synchronization. This is one of the main reasons we put in per thread default stream so each thread is not stuck waiting on the other threads. Also in CUDA an asyncMemCopy is not actually async unless both sides are either on the GPU or against pinned memory. If you use non pinned memory the CUDA driver will use a small pinned bounce buffer to do the transfer (because the GPU can only access pinned memory directly). This is why the transfers are slower when not using pinned memory, but it also means that it is harder to get a deep pipeline, because the thread cannot start to issue other calls while doing the transfer.

I think there is a lot of overlap with what you are trying to do here and what we have done with our plugin to Apache Spark. Let me know if you have any questions or just want to meet up some time to talk.

@zanmato1984
Copy link
Owner

I am not sure how deep you can get your pipelined calls to be.

Since our pipeline model will aggressively pipe as many cuDF calls as possible, a pipeline could be very deep. Consider an N-way join, the optimizer may decide to compute N-1 hash tables and do a chained probe to each of them. The pipeline executing this chained probe will have a chained N-1 calls to cuDF's hash table probe function.

This can result in a lot of synchronization. This is one of the main reasons we put in per thread default stream so each thread is not stuck waiting on the other threads.

Yes we observed these lot of synchronizations as well, and we used ptds too. Our story is, when we firstly enabled ptds, we saw that the GPU tasks from different CPU threads were decoupled as expected. But we were still using cuda_memory_resource back then, so the device sync introduced by raw cudaMalloc and cudaFree APIs were still in the way. It only got better after we switched to other pooled memory resources. I just want to be sure that both ptds and a pooled memory resource are critical to achieve the utilization level we are expecting.

Also in CUDA an asyncMemCopy is not actually async unless both sides are either on the GPU or against pinned memory.

I mean, WOW, this might be the answer to the question that we haven't figured out. If I understand correctly, a copy to/from an unpinned memory still requires CPU's involvement to move it between the "bounce buffer" and the real host memory? We expect CPU to go forward immediately after making an asyncMemCopy call but it somehow gets busy with managing the implicit "bound buffer"? If that's the case, it probably explains why we weren't getting high enough GPU utilization.

I think there is a lot of overlap with what you are trying to do here and what we have done with our plugin to Apache Spark. Let me know if you have any questions or just want to meet up some time to talk.

Yes I can imagine that we have a lot of work in common and you must have dealt with many problems that I haven't even met. Your help will be really helpful. Thanks a lot, Bobby! I appreciate that!

@revans2
Copy link
Author

revans2 commented Apr 29, 2021

If I understand correctly, a copy to/from an unpinned memory still requires CPU's involvement to move it between the "bounce buffer" and the real host memory?

Yes that is 100% correct.

The documentation is not clear about this at all, and it might not be that way for all systems, but from what we have tested it is the case for x86 servers running Linux.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants