Add throughput metrics for REDUCTION_BENCH/REDUCTION_NVBENCH benchmarks #16126

jihoonson · 2024-06-28T17:35:14Z

Description

This PR addresses #13735 for reduction benchmarks. There are 3 new utils added.

int64_t estimate_size(cudf::table_view) returns a size estimate for the given table. Add bytes_per_second to groupby max benchmark. #13984 was a previous attempt to add a similar utility, but this implementation uses cudf::row_bit_count() as suggested in Add bytes_per_second to groupby max benchmark. #13984 (comment) instead of manually estimating the size.
void set_items_processed(State& state, int64_t items_processed_per_iteration) is a thin wrapper of State.SetItemsProcessed(). This wrapper takes items_processed_per_iteration as a parameter instead of total_items_processed. This could be useful to avoid repeating State.iterations() * items_processed_per_iteration in each benchmark class.
void set_throughputs(nvbench::state& state) is added as a workaround for Throughput statistics are not calculated when reads/writes are declared after state.exec() NVIDIA/nvbench#175. We sometimes want to set throughput statistics after state.exec() calls especially when it is hard to estimate the result size upfront.

Here are snippets of reduction benchmarks after this change.

$ cpp/build/benchmarks/REDUCTION_BENCH
...
-----------------------------------------------------------------------------------------------------------------
Benchmark                                                       Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------
Reduction/bool_all/10000/manual_time                        10257 ns        26845 ns        68185 bytes_per_second=929.907M/s items_per_second=975.078M/s
Reduction/bool_all/100000/manual_time                       11000 ns        27454 ns        63634 bytes_per_second=8.46642G/s items_per_second=9.09075G/s
Reduction/bool_all/1000000/manual_time                      12671 ns        28658 ns        55261 bytes_per_second=73.5018G/s items_per_second=78.922G/s
...

$ cpp/build/benchmarks/REDUCTION_NVBENCH
...
## rank_scan

### [0] NVIDIA RTX A5500

|        T        | null_probability | data_size | Samples |  CPU Time  | Noise  |  GPU Time  | Noise |  Elem/s  | GlobalMem BW |  BWUtil   |
|-----------------|------------------|-----------|---------|------------|--------|------------|-------|----------|--------------|-----------|
|             I32 |                0 |     10000 |  16992x |  33.544 us | 14.95% |  29.446 us | 5.58% |  82.321M |   5.596 TB/s |   728.54% |
|             I32 |              0.1 |     10000 |  16512x |  34.358 us | 13.66% |  30.292 us | 2.87% |  80.020M |   5.286 TB/s |   688.17% |
|             I32 |              0.5 |     10000 |  16736x |  34.058 us | 14.31% |  29.890 us | 3.40% |  81.097M |   5.430 TB/s |   706.89% |
...

Note that, when the data type is a 1-byte-width type in the google benchmark result summary, bytes_per_second appears to be smaller than items_per_second. This is because the former is a multiple of 1000 whereas the latter is a multiple of 1024. They are in fact the same number.

Implementation-wise, these are what I'm not sure if I made a best decision.

Each of new utils above is declared and defined in different files. I did this because I could not find a good place to have them all, and they seem to belong to different utilities. Please let me know if there is a better place for them.
All the new utils are defined in the global namespace since other util functions seem to have been defined in the same way. Please let me know if this is not the convention.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…s or GlobalMem BW for nvbench, for reduction benchmarks

copy-pr-bot · 2024-06-28T17:35:17Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

davidwendt · 2024-06-28T17:51:29Z

Thanks for working on this @jihoonson
This looks good so far.
You should update the copyrights of all the new files to be just 2024 (no range).

jihoonson · 2024-06-28T17:56:44Z

Thanks @davidwendt, I will fix the copyrights. Do you think it's a good idea to fix them in the reduction benchmark files as well in this PR? I can do that if so.

davidwendt · 2024-06-28T18:11:34Z

Thanks @davidwendt, I will fix the copyrights. Do you think it's a good idea to fix them in the reduction benchmark files as well in this PR? I can do that if so.

Any existing file you change should have 2024 added/updated to its range if it does not already have it.
Looks like for the changed files you have here, just change the 2023 to 2024.
For new files just 2024 (no range) is required.

jihoonson · 2024-06-28T18:23:57Z

Thanks @davidwendt. Fixed the copyrights as suggested.

cpp/benchmarks/common/table_utilities.hpp

cpp/benchmarks/common/table_utilities.cpp

karthikeyann

Looks good. 👍
Minor updates with new function argument usage.

karthikeyann · 2024-06-28T19:01:52Z

cpp/benchmarks/reduction/anyall.cpp

+
+  // The benchmark takes a column and produces one scalar.
+  set_items_processed(state, column_size + 1);
+  set_bytes_processed(state, estimate_size(std::move(values)) + cudf::size_of(output_dtype));


Suggested change

set_bytes_processed(state, estimate_size(std::move(values)) + cudf::size_of(output_dtype));

set_bytes_processed(state, estimate_size(*values) + cudf::size_of(output_dtype));

similarly at other places.

I think values->view() is more clear but leave it up to you if you'd rather use *values

Hi @karthikeyann, thanks for the review. I just want to better understand your comment. Your seem to be suggesting to pass a column_view instead of moving the column. This has been done in 40804e2. Or, are you suggesting to use the *?

Just saw David's comment above. I also find values->view() more explicit and clear, so would like to keep this pattern unless you feel strongly about it.

cpp/benchmarks/reduction/scan.cpp

davidwendt · 2024-06-28T19:50:34Z

/ok to test

davidwendt · 2024-06-28T20:17:42Z

/ok to test

jihoonson · 2024-07-01T20:59:08Z

@davidwendt @karthikeyann thanks for the review! This PR seems to have passed all checks. What will be the next step?

cpp/benchmarks/reduction/scan_structs.cpp

cpp/benchmarks/reduction/scan.cpp

harrism

Nice one!

davidwendt · 2024-07-02T12:21:03Z

/ok to test

jihoonson · 2024-07-02T18:18:40Z

Hmm I'm not sure why the job pr / wheel-tests-cudf / 12.2.2, 3.9, amd64, ubuntu22.04, v100 (push) failed. I don't seem to see any error there.

davidwendt · 2024-07-02T18:43:52Z

Hmm I'm not sure why the job pr / wheel-tests-cudf / 12.2.2, 3.9, amd64, ubuntu22.04, v100 (push) failed. I don't seem to see any error there.

Looks like something got stuck. I kicked off a re-run.

jihoonson · 2024-07-02T21:46:25Z

Hmm I'm not sure why the job pr / wheel-tests-cudf / 12.2.2, 3.9, amd64, ubuntu22.04, v100 (push) failed. I don't seem to see any error there.

Looks like something got stuck. I kicked off a re-run.

Thanks! It's all green now 🙂

harrism · 2024-07-02T23:05:34Z

/merge

Add throughput metrics, such as bytes_per_second for google benchmark…

013e6bf

…s or GlobalMem BW for nvbench, for reduction benchmarks

jihoonson requested a review from a team as a code owner June 28, 2024 17:35

jihoonson requested review from harrism and davidwendt June 28, 2024 17:35

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jun 28, 2024

davidwendt added 3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 28, 2024

fix copyrights

2daeff5

davidwendt requested changes Jun 28, 2024

View reviewed changes

cpp/benchmarks/common/table_utilities.hpp Outdated Show resolved Hide resolved

cpp/benchmarks/common/table_utilities.cpp Outdated Show resolved Hide resolved

davidwendt reviewed Jun 28, 2024

View reviewed changes

cpp/benchmarks/common/table_utilities.cpp Outdated Show resolved Hide resolved

jihoonson added 2 commits June 28, 2024 11:57

estimate_size() should take a column_view

40804e2

add missing roundup

951795e

karthikeyann reviewed Jun 28, 2024

View reviewed changes

style fix

d844bf1

davidwendt reviewed Jul 1, 2024

View reviewed changes

cpp/benchmarks/reduction/scan_structs.cpp Outdated Show resolved Hide resolved

more explicit result size computation

6cb33b3

davidwendt reviewed Jul 1, 2024

View reviewed changes

cpp/benchmarks/reduction/scan.cpp Outdated Show resolved Hide resolved

fix another result size to not accumulate

8d051b5

harrism approved these changes Jul 2, 2024

View reviewed changes

davidwendt approved these changes Jul 2, 2024

View reviewed changes

harrism mentioned this pull request Jul 2, 2024

[FEA] Add bytes_per_second to all libcudf benchmarks #13735

Open

rapids-bot bot merged commit 25febbc into rapidsai:branch-24.08 Jul 2, 2024
76 of 77 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add throughput metrics for REDUCTION_BENCH/REDUCTION_NVBENCH benchmarks #16126

Add throughput metrics for REDUCTION_BENCH/REDUCTION_NVBENCH benchmarks #16126

jihoonson commented Jun 28, 2024

copy-pr-bot bot commented Jun 28, 2024

davidwendt commented Jun 28, 2024

jihoonson commented Jun 28, 2024

davidwendt commented Jun 28, 2024 •

edited

Loading

jihoonson commented Jun 28, 2024

karthikeyann left a comment •

edited

Loading

karthikeyann Jun 28, 2024

davidwendt Jun 28, 2024

jihoonson Jun 28, 2024 •

edited

Loading

jihoonson Jun 28, 2024

davidwendt commented Jun 28, 2024

davidwendt commented Jun 28, 2024

jihoonson commented Jul 1, 2024

harrism left a comment

davidwendt commented Jul 2, 2024

jihoonson commented Jul 2, 2024

davidwendt commented Jul 2, 2024

jihoonson commented Jul 2, 2024

harrism commented Jul 2, 2024

	set_bytes_processed(state, estimate_size(std::move(values)) + cudf::size_of(output_dtype));
	set_bytes_processed(state, estimate_size(*values) + cudf::size_of(output_dtype));

Add throughput metrics for REDUCTION_BENCH/REDUCTION_NVBENCH benchmarks #16126

Add throughput metrics for REDUCTION_BENCH/REDUCTION_NVBENCH benchmarks #16126

Conversation

jihoonson commented Jun 28, 2024

Description

Checklist

copy-pr-bot bot commented Jun 28, 2024

davidwendt commented Jun 28, 2024

jihoonson commented Jun 28, 2024

davidwendt commented Jun 28, 2024 • edited Loading

jihoonson commented Jun 28, 2024

karthikeyann left a comment • edited Loading

Choose a reason for hiding this comment

karthikeyann Jun 28, 2024

Choose a reason for hiding this comment

davidwendt Jun 28, 2024

Choose a reason for hiding this comment

jihoonson Jun 28, 2024 • edited Loading

Choose a reason for hiding this comment

jihoonson Jun 28, 2024

Choose a reason for hiding this comment

davidwendt commented Jun 28, 2024

davidwendt commented Jun 28, 2024

jihoonson commented Jul 1, 2024

harrism left a comment

Choose a reason for hiding this comment

davidwendt commented Jul 2, 2024

jihoonson commented Jul 2, 2024

davidwendt commented Jul 2, 2024

jihoonson commented Jul 2, 2024

harrism commented Jul 2, 2024

davidwendt commented Jun 28, 2024 •

edited

Loading

karthikeyann left a comment •

edited

Loading

jihoonson Jun 28, 2024 •

edited

Loading