Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New benchmark compares concurrent throughput of device_vector and device_uvector #981

Merged

Conversation

harrism
Copy link
Member

@harrism harrism commented Feb 15, 2022

Adds a new benchmark in device_uvector_benchmark.cpp that compares using multiple streams and concurrent kernels interleaved with vector creation. This is then parameterized on the type of the vector:

  1. thrust::device_vector -- uses cudaMalloc allocation
  2. rmm::device_vector -- uses RMM allocation
  3. rmm::device_uvector -- uses RMM allocation and uninitialized vector

The benchmark uses the cuda_async_memory_resource so that cudaMallocAsync is used for allocation of the rmm:: vector types.

The performance on V100 demonstrates that option 1. is slowest due to allocation bottlenecks. 2. alleviates these by using cudaMallocFromPoolAsync, but there is no concurrency among the kernels because thrust::device_vector synchronizes the default stream. 3. Is fastest and achieves full concurrency (verified in nsight-sys).

Benchmark                                                                        Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------------------------
BM_VectorWorkflow<thrust::device_vector<int32_t>>/100000/manual_time           242 us          267 us         2962 bytes_per_second=13.8375G/s
BM_VectorWorkflow<thrust::device_vector<int32_t>>/1000000/manual_time         1441 us         1465 us          472 bytes_per_second=23.273G/s
BM_VectorWorkflow<thrust::device_vector<int32_t>>/10000000/manual_time       10483 us        10498 us           68 bytes_per_second=31.9829G/s
BM_VectorWorkflow<thrust::device_vector<int32_t>>/100000000/manual_time      63583 us        63567 us           12 bytes_per_second=52.7303G/s
BM_VectorWorkflow<rmm::device_vector<int32_t>>/100000/manual_time             82.0 us          105 us         8181 bytes_per_second=40.8661G/s
BM_VectorWorkflow<rmm::device_vector<int32_t>>/1000000/manual_time             502 us          527 us         1357 bytes_per_second=66.8029G/s
BM_VectorWorkflow<rmm::device_vector<int32_t>>/10000000/manual_time           4714 us         4746 us          148 bytes_per_second=71.1222G/s
BM_VectorWorkflow<rmm::device_vector<int32_t>>/100000000/manual_time         46451 us        46478 us           13 bytes_per_second=72.1784G/s
BM_VectorWorkflow<rmm::device_uvector<int32_t>>/100000/manual_time            39.0 us         59.9 us        17970 bytes_per_second=85.8733G/s
BM_VectorWorkflow<rmm::device_uvector<int32_t>>/1000000/manual_time            135 us          159 us         5253 bytes_per_second=248.987G/s
BM_VectorWorkflow<rmm::device_uvector<int32_t>>/10000000/manual_time          1319 us         1351 us          516 bytes_per_second=254.169G/s
BM_VectorWorkflow<rmm::device_uvector<int32_t>>/100000000/manual_time        12841 us        12865 us           54 bytes_per_second=261.099G/s

@harrism harrism added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Feb 15, 2022
@harrism harrism requested a review from jrhemstad February 15, 2022 02:15
@harrism harrism self-assigned this Feb 15, 2022
@harrism harrism requested a review from a team as a code owner February 15, 2022 02:15
@harrism harrism requested a review from codereport February 15, 2022 02:15
Copy link
Contributor

@codereport codereport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Only couple small trivial changes. Also, we seem to swap between using int and int32_t. Not sure if it is worth switching the ints to int32_ts.

Comment on lines 144 to 146
auto num_elements = state.range(0);
int block_size = 256;
int num_blocks = 16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto num_elements = state.range(0);
int block_size = 256;
int num_blocks = 16;
auto const num_elements = state.range(0);
int constexpr block_size = 256;
int constexpr num_blocks = 16;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed. Most ints can be auto.

@harrism
Copy link
Member Author

harrism commented Feb 17, 2022

Hmmm gpuCI doesn't seem to be rerunning automatically. rerun tests.

@vyasr
Copy link
Contributor

vyasr commented Feb 17, 2022

Hmmm gpuCI doesn't seem to be rerunning automatically. rerun tests.

Perhaps just a browser caching issue? I occasionally require a hard refresh to see that tests have started running if I already had the page open in a tab for a while before a push.

@harrism
Copy link
Member Author

harrism commented Feb 17, 2022

I could see the comments from today in the browser, but the linked CI results were from yesterday.

@vyasr
Copy link
Contributor

vyasr commented Feb 17, 2022

That happens to me sometimes. It will update the comment list, but the checks list won't update until I hard refresh the web page. It's a possible explanation at least, but no way to know unless it happens again.

@harrism
Copy link
Member Author

harrism commented Feb 17, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit cf33a5a into rapidsai:branch-22.04 Feb 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants