-
Notifications
You must be signed in to change notification settings - Fork 918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use managed memory for NDSH benchmarks #17039
Use managed memory for NDSH benchmarks #17039
Conversation
cpp/benchmarks/ndsh/utilities.cpp
Outdated
auto old_mr = cudf::get_current_device_resource(); // fixme: already pool takes 50% of free memory | ||
// TODO: release it, and restore it later? | ||
auto managed_pool_mr = make_managed_pool(); | ||
cudf::set_current_device_resource(managed_pool_mr.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not pass the new mr all the way through instead of resetting the current one?
All the libcudf APIs should take an mr
parameter now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @davidwendt. I've asked @karthikeyann to separate the MR used for data generation versus the one used for timed query runs. We need managed memory to avoid OOM in the generator, but we mostly care about async and pool for timed runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidwendt This brings us to an old question. Right now, we can only pass mr
for output values of a libcudf function. All intermediate allocations happen using cudf::get_current_device_resource()
. So, if we are targeting larger than GPU memory, the intermediate allocations might run out of GPU memory if the cudf::get_current_device_resource()
is not managed memory. Right now, libcudf functions does not have a way pass an mr
for intermediate allocations. It's set globally using cudf::set_current_device_resource
. Hence cudf::set_current_device_resource
is updated here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I thought that may be the case but wanted to make sure. It seems like we should purposely pass an mr
for the returned objects if only for illustration purposes to highlight there are 2 mrs in play here.
Also, it may be worth a detailed comment in the code similar to what you responded here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add it to comments.
If we have a way to solve intermediate allocations, that would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, old_mr
takes some memory (50% is default) already if it is a pooled memory resource. So, it will cause managed memory resource to spill more often. It's better if that also can be reclaimed until data generation is over. I am still working on this part; a shrink function in pooled memory resource would be great, but not available right now. releasing pool memory would be dangerous (all old allocations become dangling pointers). I am looking into other memory resources for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could just request these benchmarks be run using a different default pool (command-line parameter rmm_mode
) to start with. These benchmarks are not run in CI. I'm not sure it is worth circumventing the pool since that logic would need to account for the parameter setting the default to not-pool in any case.
std::string rmm_mode{"pool"}; |
Perhaps you could even check the default
mr
somehow and if it is set to pool
then throw an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the code to create managed pool, if the existing mr is not managed or managed_pool.
It has a drawback when pool
memory is used, which is default. and data generation may be slow. If that's not acceptable, only way we could fix it is by creating new nvbench_fixture for ndsh benchmarks alone.
@GregoryKimball can we limit rmm_mode to be managed/managed_pool only for running these benchmarks? if we can limit to managed_pool only, no fix required for mr, just use managed_pool or managed mode in cli. Alternatively, if rmm_mode could be anything, and we still want the data generator to be fastest, we should fix this PR by creating a new nvbench_fixture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GregoryKimball We can run upto SF=30 on 48 GB GPU machine. is that sufficient?
can we merge this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karthikeyann and I had an offline discussion about the NDS-H-cpp benchmarks. We agreed to collect some data on the max scale factor for the simple single-MR case and the more complex two-MR case.
The benchmarked query's runtime did not have any effect due to this PR change. (tested with Q05) This is overall runtime for single run with data generation. eg.
If this data generation time does not matter, we can merge this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm!
on 48 GB GPU.
The final version in this PR is hss + chunked pq + datagen=managed_pool. |
Thank you @karthikeyann for studying this. It looks like the chunked PQ writer has a big impact as well - thank you for identifying that. I'm happy to proceed with the current state of |
/merge |
Description
Fixes #16987
Use managed memory to generate the parquet data, and write parquet data to host buffer.
Replace use of parquet_device_buffer with cuio_source_sink_pair
Checklist