-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New benchmark compares concurrent throughput of device_vector and device_uvector #981
New benchmark compares concurrent throughput of device_vector and device_uvector #981
Conversation
…nd device_uvector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Only couple small trivial changes. Also, we seem to swap between using int
and int32_t
. Not sure if it is worth switching the int
s to int32_t
s.
auto num_elements = state.range(0); | ||
int block_size = 256; | ||
int num_blocks = 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto num_elements = state.range(0); | |
int block_size = 256; | |
int num_blocks = 16; | |
auto const num_elements = state.range(0); | |
int constexpr block_size = 256; | |
int constexpr num_blocks = 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Fixed. Most int
s can be auto
.
Hmmm gpuCI doesn't seem to be rerunning automatically. rerun tests. |
Perhaps just a browser caching issue? I occasionally require a hard refresh to see that tests have started running if I already had the page open in a tab for a while before a push. |
I could see the comments from today in the browser, but the linked CI results were from yesterday. |
That happens to me sometimes. It will update the comment list, but the checks list won't update until I hard refresh the web page. It's a possible explanation at least, but no way to know unless it happens again. |
@gpucibot merge |
Adds a new benchmark in
device_uvector_benchmark.cpp
that compares using multiple streams and concurrent kernels interleaved with vector creation. This is then parameterized on the type of the vector:thrust::device_vector
-- uses cudaMalloc allocationrmm::device_vector
-- uses RMM allocationrmm::device_uvector
-- uses RMM allocation and uninitialized vectorThe benchmark uses the
cuda_async_memory_resource
so that cudaMallocAsync is used for allocation of thermm::
vector types.The performance on V100 demonstrates that option 1. is slowest due to allocation bottlenecks. 2. alleviates these by using
cudaMallocFromPoolAsync
, but there is no concurrency among the kernels becausethrust::device_vector
synchronizes the default stream. 3. Is fastest and achieves full concurrency (verified innsight-sys
).