Add libcudf example with large strings #15983

davidwendt · 2024-06-11T21:23:02Z

Description

Creating an example that shows reading large strings columns. This uses the 1 billion row challenge input data and provides three examples of loading this data:

brc uses the CSV reader to load the input file in one call and aggregates the results using groupby
brc_chunks uses the CSV reader to load the input file in chunks, aggregates each chunk, and computes the results
brc_pipeline same as brc_chunks but input chunks are processed in separate threads/streams.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…ample-1billion

GregoryKimball · 2024-08-30T22:32:00Z

On GH200, the brc_pipeline example seems to be calling cudaHostRegister on the entire file for each chunk.

And it ends up much slower. I'm looking to see if anything in the data source handling could be changed.

vuule · 2024-08-30T22:33:36Z

On GH200, the brc_pipeline example seems to be calling cudaHostRegister on the entire file for each chunk.

That unexpected. I'll check the code and update here.

…ample-1billion

bdice

Approving -- only a few small suggestions.

bdice · 2024-09-04T22:36:05Z

cpp/examples/1billion/CMakeLists.txt

+rapids_cuda_set_architectures(RAPIDS)
+
+project(
+  billion


Can we rename this to brc? Or does the project need a unique name?

It's a little confusing to have three unique names for this example:

The directory is named 1billion

The project is named billion

The executable is named brc (and variations thereof)

Let's consolidate these.

I agree with renaming the directory and project the same. I want to use something easily readable/discoverable from the perspective of someone looking at the examples folder. My suggestion is to use billion_rows for the directory and project name.
I would like to keep the executable names shorter since they are built in context of the parent directory and would look less cumbersome in my opinion. I'd like to keep the brc variations also because the blog uses those names in charts that would need to be regenerated.

That’s fine! Let’s do that.

cpp/examples/1billion/README.md

vuule

Looks good, just a few non-blocking suggestions

cpp/examples/1billion/brc.cpp

cpp/examples/1billion/brc_chunks.cpp

cpp/examples/1billion/brc_pipeline.cpp

ttnghia · 2024-09-05T03:06:22Z

Nit:

IMO the folder name 1billion is vague/not clean. Typically I would avoid naming anything starting with a number.
I'm kind of OCD with formatting. For printing information, I would prefer to see the printed sentences as "First letter of each sentence is capitalized" instead of "everything is in lower-case".

vuule

Lovely examples. Expected more complex code, especially for the pipeline.

karthikeyann

Looks great

karthikeyann · 2024-09-05T20:11:26Z

cpp/examples/billion_rows/brc.cpp

+  auto const mr_name = std::string("pool");
+  auto resource      = create_memory_resource(mr_name);


nit: to keep it simple,

Suggested change

auto const mr_name = std::string("pool");

auto resource = create_memory_resource(mr_name);

auto resource = create_memory_resource("pool");

davidwendt · 2024-09-05T22:16:21Z

/merge

Add libcudf example with large strings

6b902ca

davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 11, 2024

davidwendt self-assigned this Jun 11, 2024

github-actions bot added the CMake CMake build issue label Jun 11, 2024

davidwendt added 23 commits June 11, 2024 20:34

Merge branch 'branch-24.08' into example-1billion

99ec768

Merge branch 'branch-24.08' into example-1billion

2a18176

add load_text_chunks

8f9e8ca

Merge branch 'branch-24.08' into example-1billion

bc3b260

Merge branch 'example-1billion' of github.com:davidwendt/cudf into ex…

b9cb419

…ample-1billion

fix load_text_chunks chunks

2770cb6

Merge branch 'branch-24.08' into example-1billion

880e65c

Merge branch 'branch-24.08' into example-1billion

c1c5494

Merge branch 'branch-24.08' into example-1billion

e9237ef

Merge branch 'branch-24.08' into example-1billion

8258e8d

Merge branch 'branch-24.08' into example-1billion

8036203

fix merge conflict

5c8f5fd

empty commit to trigger CI

f435a82

Merge branch 'example-1billion' of github.com:davidwendt/cudf into ex…

b730d4e

…ample-1billion

Merge branch 'branch-24.08' into example-1billion

16cc11b

Merge branch 'branch-24.08' into example-1billion

fc2bb3b

Merge branch 'branch-24.08' into example-1billion

7cc29af

Merge branch 'branch-24.08' into example-1billion

768fe30

Merge branch 'branch-24.08' into example-1billion

5f58e9f

Merge branch 'branch-24.08' into example-1billion

5ef13f6

Merge branch 'branch-24.08' into example-1billion

36f29e2

Merge branch 'branch-24.08' into example-1billion

3f55ee8

Merge branch 'branch-24.08' into example-1billion

1bd1370

davidwendt added 6 commits August 28, 2024 11:10

add README.md

35c1a8b

remove commented out line

0bed67a

update readme.md

d409f4d

Merge branch 'branch-24.10' into example-1billion

49aef7d

Merge branch 'example-1billion' of github.com:davidwendt/cudf into ex…

44cdb3f

…ample-1billion

Merge branch 'branch-24.10' into example-1billion

2fdfd1f

davidwendt requested a review from bdice August 29, 2024 20:38

GregoryKimball requested review from vuule and removed request for srinivasyadav18 August 30, 2024 15:54

GregoryKimball mentioned this pull request Aug 31, 2024

[FEA] Add multi-threaded Parquet read example #16717

Closed

davidwendt added 4 commits September 3, 2024 08:42

Merge branch 'branch-24.10' into example-1billion

9ef26ea

re-align load_chunk utility

392e0e9

Merge branch 'example-1billion' of github.com:davidwendt/cudf into ex…

68aeb07

…ample-1billion

Merge branch 'branch-24.10' into example-1billion

cf48a08

GregoryKimball approved these changes Sep 3, 2024

View reviewed changes

bdice approved these changes Sep 4, 2024

View reviewed changes

vuule reviewed Sep 4, 2024

View reviewed changes

cpp/examples/1billion/brc.cpp Outdated Show resolved Hide resolved

cpp/examples/1billion/brc_chunks.cpp Outdated Show resolved Hide resolved

cpp/examples/1billion/brc_pipeline.cpp Outdated Show resolved Hide resolved

davidwendt added 3 commits September 5, 2024 08:20

Merge branch 'branch-24.10' into example-1billion

a8f559f

change folder/project name

53df838

update examples build.sh

8a5076e

davidwendt requested a review from vuule September 5, 2024 18:08

vuule approved these changes Sep 5, 2024

View reviewed changes

ttnghia approved these changes Sep 5, 2024

View reviewed changes

karthikeyann approved these changes Sep 5, 2024

View reviewed changes

rapids-bot bot merged commit 715677e into rapidsai:branch-24.10 Sep 5, 2024
86 checks passed

davidwendt deleted the example-1billion branch September 5, 2024 22:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add libcudf example with large strings #15983

Add libcudf example with large strings #15983

davidwendt commented Jun 11, 2024 •

edited

Loading

GregoryKimball commented Aug 30, 2024

vuule commented Aug 30, 2024

bdice left a comment

bdice Sep 4, 2024

davidwendt Sep 5, 2024

bdice Sep 5, 2024

vuule left a comment

ttnghia commented Sep 5, 2024 •

edited

Loading

vuule left a comment

karthikeyann left a comment

karthikeyann Sep 5, 2024

davidwendt commented Sep 5, 2024

		auto const mr_name = std::string("pool");
		auto resource = create_memory_resource(mr_name);

	auto const mr_name = std::string("pool");
	auto resource = create_memory_resource(mr_name);
	auto resource = create_memory_resource("pool");

Add libcudf example with large strings #15983

Add libcudf example with large strings #15983

Conversation

davidwendt commented Jun 11, 2024 • edited Loading

Description

Checklist

GregoryKimball commented Aug 30, 2024

vuule commented Aug 30, 2024

bdice left a comment

Choose a reason for hiding this comment

bdice Sep 4, 2024

Choose a reason for hiding this comment

davidwendt Sep 5, 2024

Choose a reason for hiding this comment

bdice Sep 5, 2024

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

ttnghia commented Sep 5, 2024 • edited Loading

vuule left a comment

Choose a reason for hiding this comment

karthikeyann left a comment

Choose a reason for hiding this comment

karthikeyann Sep 5, 2024

Choose a reason for hiding this comment

davidwendt commented Sep 5, 2024

davidwendt commented Jun 11, 2024 •

edited

Loading

ttnghia commented Sep 5, 2024 •

edited

Loading