Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the new host memory allocation API #11671

Merged
merged 3 commits into from
Nov 1, 2024

Conversation

revans2
Copy link
Collaborator

@revans2 revans2 commented Oct 29, 2024

This is step 2 in getting all host memory allocations to go through a single API. rapidsai/cudf#17197 added in a new raw allocation API. This starts to use that new API. I will then put up another PR to CUDF that will switch HostMemoryBuffer.allocate to always call the default allocator's allocate API, which should make it so that effectively all large host memory allocations go through our default allocator HostAlloc.

After that I will start to clean up the code and remove places where we pass in HostAlloc as a separate allocation API to CUDF APIs.

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>
@revans2
Copy link
Collaborator Author

revans2 commented Oct 29, 2024

Please note that this will not compile until Oct 30th or so, when it's dependency finishes going through CI.

Copy link
Collaborator

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copyrights need updating.

@revans2
Copy link
Collaborator Author

revans2 commented Oct 31, 2024

I updated the code to also fix an issue that was showing up with row to columnar exec. It is a bug where the integration tests were failing because we are now injecting retry OOMs in places that we were not before. Because all of the processing goes through a single API now.

I think there may be some more issues that I need to debug. I am seeing an issue with the time zone DB too. But it is a bit harder to reproduce and the fix for it would likely be in spark-rapids-jni. So feel free to review this.

}
}
} else if (builders[i] == null && wipHostColumns[i] == null) {
throw new IllegalStateException("buildHostColumns cannot be called more than once");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this exception message seems to go against the comment at the top of this function (buildHostColumns), but I can see why you need to throw here. Maybe "buildHostColumns is in an invalid state due to a retry"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here is where things are odd. The old code had some APIs that could be called multiple times and other APIs that could not, but in reality none of them could be called multiple times. I fixed try build, but this is called from try build. try build caches the result of calling this, if it succeeded. So this odd function needs to be able to be called again if an exception was thrown while it was running. But if it succeeded, then it cannot be called again, partly because the builders themselves follow this same behavior.

I can fix all of them to be able to be called as many times as we want and even add more rows in between building new HostColumnVectors. The problem is that is a much larger change and does not provide any real value at this time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that you have this comment here on github is good enough for me.

Copy link
Collaborator

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, small nit

@revans2
Copy link
Collaborator Author

revans2 commented Nov 1, 2024

build

@revans2 revans2 merged commit 372ca80 into NVIDIA:branch-24.12 Nov 1, 2024
46 of 47 checks passed
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request Nov 4, 2024
This is step 3 in a process of getting java host memory allocation to be plugable under a single allocation API. This is really only used for large memory allocations, which is what matters.

This changes the most common java host memory allocation API to call into the plugable host memory allocation API. The reason that this had to be done in multiple steps is because the Spark Plugin code was calling into the common memory allocation API, and memory allocation would end up calling itself recursively.

Step 1. Create a new API that will not be called recursively (#17197)
Step 2. Have the Java plugin use that new API instead of the old one to avoid any recursive invocations (NVIDIA/spark-rapids#11671)
Step 3. Update the common API to use the new backend (this)

There are likely to be more steps after this that involve cleaning up and removing APIs that are no longer needed.

This is marked as breaking even though it does not break any APIs, it changes the semantics enough that it feels like a breaking change.

This is blocked and should not be merged in until Step 2 is merged in, to avoid breaking the Spark plugin.

Authors:
  - Robert (Bobby) Evans (https://github.com/revans2)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Alessandro Bellina (https://github.com/abellina)

URL: #17204
@sameerz sameerz added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants