Use the new host memory allocation API #11671

revans2 · 2024-10-29T18:53:41Z

This is step 2 in getting all host memory allocations to go through a single API. rapidsai/cudf#17197 added in a new raw allocation API. This starts to use that new API. I will then put up another PR to CUDF that will switch HostMemoryBuffer.allocate to always call the default allocator's allocate API, which should make it so that effectively all large host memory allocations go through our default allocator HostAlloc.

After that I will start to clean up the code and remove places where we pass in HostAlloc as a separate allocation API to CUDF APIs.

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

revans2 · 2024-10-29T18:54:19Z

Please note that this will not compile until Oct 30th or so, when it's dependency finishes going through CI.

abellina

Copyrights need updating.

revans2 · 2024-10-31T17:40:00Z

I updated the code to also fix an issue that was showing up with row to columnar exec. It is a bug where the integration tests were failing because we are now injecting retry OOMs in places that we were not before. Because all of the processing goes through a single API now.

I think there may be some more issues that I need to debug. I am seeing an issue with the time zone DB too. But it is a bit harder to reproduce and the fix for it would likely be in spark-rapids-jni. So feel free to review this.

abellina · 2024-11-01T13:52:43Z

sql-plugin/src/main/java/com/nvidia/spark/rapids/GpuColumnVector.java

-            }
-          }
+        } else if (builders[i] == null && wipHostColumns[i] == null) {
+          throw new IllegalStateException("buildHostColumns cannot be called more than once");


this exception message seems to go against the comment at the top of this function (buildHostColumns), but I can see why you need to throw here. Maybe "buildHostColumns is in an invalid state due to a retry"?

So here is where things are odd. The old code had some APIs that could be called multiple times and other APIs that could not, but in reality none of them could be called multiple times. I fixed try build, but this is called from try build. try build caches the result of calling this, if it succeeded. So this odd function needs to be able to be called again if an exception was thrown while it was running. But if it succeeded, then it cannot be called again, partly because the builders themselves follow this same behavior.

I can fix all of them to be able to be called as many times as we want and even add more rows in between building new HostColumnVectors. The problem is that is a much larger change and does not provide any real value at this time.

The fact that you have this comment here on github is good enough for me.

abellina

LGTM, small nit

revans2 · 2024-11-01T14:47:19Z

build

This is step 3 in a process of getting java host memory allocation to be plugable under a single allocation API. This is really only used for large memory allocations, which is what matters. This changes the most common java host memory allocation API to call into the plugable host memory allocation API. The reason that this had to be done in multiple steps is because the Spark Plugin code was calling into the common memory allocation API, and memory allocation would end up calling itself recursively. Step 1. Create a new API that will not be called recursively (#17197) Step 2. Have the Java plugin use that new API instead of the old one to avoid any recursive invocations (NVIDIA/spark-rapids#11671) Step 3. Update the common API to use the new backend (this) There are likely to be more steps after this that involve cleaning up and removing APIs that are no longer needed. This is marked as breaking even though it does not break any APIs, it changes the semantics enough that it feels like a breaking change. This is blocked and should not be merged in until Step 2 is merged in, to avoid breaking the Spark plugin. Authors: - Robert (Bobby) Evans (https://github.com/revans2) Approvers: - Nghia Truong (https://github.com/ttnghia) - Alessandro Bellina (https://github.com/abellina) URL: #17204

Move to use the new APi from CUDF for HostAlloc

c6a02dc

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

abellina reviewed Oct 29, 2024

View reviewed changes

revans2 mentioned this pull request Oct 29, 2024

Make HostMemoryBuffer call into the DefaultHostMemoryAllocator rapidsai/cudf#17204

Merged

3 tasks

revans2 added 2 commits October 30, 2024 15:48

Fix some issues with building a host memory buffer cannot be retried

664db6f

More fixes

a6b9b5c

abellina reviewed Nov 1, 2024

View reviewed changes

abellina approved these changes Nov 1, 2024

View reviewed changes

revans2 merged commit 372ca80 into NVIDIA:branch-24.12 Nov 1, 2024
46 of 47 checks passed

sameerz added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the new host memory allocation API #11671

Use the new host memory allocation API #11671

revans2 commented Oct 29, 2024

revans2 commented Oct 29, 2024

abellina left a comment

revans2 commented Oct 31, 2024

abellina Nov 1, 2024

revans2 Nov 1, 2024

abellina Nov 1, 2024

abellina left a comment

revans2 commented Nov 1, 2024

Use the new host memory allocation API #11671

Use the new host memory allocation API #11671

Conversation

revans2 commented Oct 29, 2024

revans2 commented Oct 29, 2024

abellina left a comment

Choose a reason for hiding this comment

revans2 commented Oct 31, 2024

abellina Nov 1, 2024

Choose a reason for hiding this comment

revans2 Nov 1, 2024

Choose a reason for hiding this comment

abellina Nov 1, 2024

Choose a reason for hiding this comment

abellina left a comment

Choose a reason for hiding this comment

revans2 commented Nov 1, 2024