-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use the new host memory allocation API #11671
Use the new host memory allocation API #11671
Conversation
Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>
Please note that this will not compile until Oct 30th or so, when it's dependency finishes going through CI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copyrights need updating.
I updated the code to also fix an issue that was showing up with row to columnar exec. It is a bug where the integration tests were failing because we are now injecting retry OOMs in places that we were not before. Because all of the processing goes through a single API now. I think there may be some more issues that I need to debug. I am seeing an issue with the time zone DB too. But it is a bit harder to reproduce and the fix for it would likely be in spark-rapids-jni. So feel free to review this. |
} | ||
} | ||
} else if (builders[i] == null && wipHostColumns[i] == null) { | ||
throw new IllegalStateException("buildHostColumns cannot be called more than once"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this exception message seems to go against the comment at the top of this function (buildHostColumns
), but I can see why you need to throw here. Maybe "buildHostColumns
is in an invalid state due to a retry"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here is where things are odd. The old code had some APIs that could be called multiple times and other APIs that could not, but in reality none of them could be called multiple times. I fixed try build, but this is called from try build. try build caches the result of calling this, if it succeeded. So this odd function needs to be able to be called again if an exception was thrown while it was running. But if it succeeded, then it cannot be called again, partly because the builders themselves follow this same behavior.
I can fix all of them to be able to be called as many times as we want and even add more rows in between building new HostColumnVectors. The problem is that is a much larger change and does not provide any real value at this time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that you have this comment here on github is good enough for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, small nit
build |
This is step 3 in a process of getting java host memory allocation to be plugable under a single allocation API. This is really only used for large memory allocations, which is what matters. This changes the most common java host memory allocation API to call into the plugable host memory allocation API. The reason that this had to be done in multiple steps is because the Spark Plugin code was calling into the common memory allocation API, and memory allocation would end up calling itself recursively. Step 1. Create a new API that will not be called recursively (#17197) Step 2. Have the Java plugin use that new API instead of the old one to avoid any recursive invocations (NVIDIA/spark-rapids#11671) Step 3. Update the common API to use the new backend (this) There are likely to be more steps after this that involve cleaning up and removing APIs that are no longer needed. This is marked as breaking even though it does not break any APIs, it changes the semantics enough that it feels like a breaking change. This is blocked and should not be merged in until Step 2 is merged in, to avoid breaking the Spark plugin. Authors: - Robert (Bobby) Evans (https://github.com/revans2) Approvers: - Nghia Truong (https://github.com/ttnghia) - Alessandro Bellina (https://github.com/abellina) URL: #17204
This is step 2 in getting all host memory allocations to go through a single API. rapidsai/cudf#17197 added in a new raw allocation API. This starts to use that new API. I will then put up another PR to CUDF that will switch HostMemoryBuffer.allocate to always call the default allocator's allocate API, which should make it so that effectively all large host memory allocations go through our default allocator
HostAlloc
.After that I will start to clean up the code and remove places where we pass in HostAlloc as a separate allocation API to CUDF APIs.