-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support pinned memory pooling #7
Comments
Hi Bobby, thank you for the comments. That's very helpful! Yes I believe a pinned memory pool could definitely be a big win. I'll keep watching the progress of rmm's implementation. Besides what you've mentioned about the expensiveness of pinned memory allocation, a pool is also significant for our pipeline model. We intend to achieve higher GPU utilization by concurrently executing multiple CPU threads, each executes a pipelined cuDF calls. IIRC, a raw |
Having multiple threads helps, especially for larger devices that have enough memory and compute to do everything multiple things at once. I am not sure how deep you can get your pipelined calls to be. For a lot of operations in cuDF (like aggregations and joins) the output size is not know ahead of time. This means that they have to do a calculation on the GPU and then pull back a result to the CPU, allocate memory on the GPU and finally compute the answer. This can result in a lot of synchronization. This is one of the main reasons we put in per thread default stream so each thread is not stuck waiting on the other threads. Also in CUDA an I think there is a lot of overlap with what you are trying to do here and what we have done with our plugin to Apache Spark. Let me know if you have any questions or just want to meet up some time to talk. |
Since our pipeline model will aggressively pipe as many cuDF calls as possible, a pipeline could be very deep. Consider an N-way join, the optimizer may decide to compute N-1 hash tables and do a chained probe to each of them. The pipeline executing this chained probe will have a chained N-1 calls to cuDF's hash table probe function.
Yes we observed these lot of synchronizations as well, and we used ptds too. Our story is, when we firstly enabled ptds, we saw that the GPU tasks from different CPU threads were decoupled as expected. But we were still using
I mean, WOW, this might be the answer to the question that we haven't figured out. If I understand correctly, a copy to/from an unpinned memory still requires CPU's involvement to move it between the "bounce buffer" and the real host memory? We expect CPU to go forward immediately after making an
Yes I can imagine that we have a lot of work in common and you must have dealt with many problems that I haven't even met. Your help will be really helpful. Thanks a lot, Bobby! I appreciate that! |
Yes that is 100% correct. The documentation is not clear about this at all, and it might not be that way for all systems, but from what we have tested it is the case for x86 servers running Linux. |
I read your article at https://pingcap.com/blog/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times
Really great work.
I work on a related project https://github.com/NVIDIA/spark-rapids/
In the article you mentioned that you cannot get above 3 GiB/s PCIe transfers. This is probably because you are not using pinned memory for the transfers. For the java API we added in a pinned memory pool similar to RMM. It is not a great implementation and rapidsai/rmm#618 is to have RMM also handle pinned memory at some point. The pinned pool lets us get about 12-13 GiB/sec on PCIe gen 3.
Be aware that it is rather expensive to allocate the pinned memory, which is why we pool it. If you really have a 1:1 ratio for data transfer to CPU compute, this could be a really big win.
The text was updated successfully, but these errors were encountered: