Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON tree algorithm code reorg #16836

Merged
merged 5 commits into from
Sep 20, 2024

Conversation

karthikeyann
Copy link
Contributor

Description

This PR moves JSON host tree algorithms to separate file.
This code movement will help #16545 review easier.

The code is moved to new file and reorganized for code reuse.
Very long function make_device_json_column is split into

  • code block with reduce_to_column_tree call
  • code moved to function build_tree
  • code moved to function scatter_offsets

No new functionality is added in this PR.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@karthikeyann karthikeyann added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 19, 2024
@karthikeyann karthikeyann self-assigned this Sep 19, 2024
@karthikeyann karthikeyann requested review from a team as code owners September 19, 2024 05:28
@github-actions github-actions bot added the CMake CMake build issue label Sep 19, 2024
Copy link
Contributor

@KyleFromNVIDIA KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved trivial CMake changes

@shrshi shrshi self-requested a review September 19, 2024 16:41
bitmask_type* validity;
};

std::pair<cudf::detail::host_vector<uint8_t>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change ignore_vals returned here to a bool vector? We can also move this change to #16545
EDIT: Similar suggestion for is_pruned and is_str_column_all_nulls

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, I've seen concerns raised in the past regarding the use of bool vectors, and recommendations to use int8_t/uint8_t instead. I didn't think host_vector<bool> would be plagued with the same problems as std::vector<bool>, but I could be wrong.

Watching this conversation for @karthikeyann's opinion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mythrocks That's right. 💯
std::vector<bool> stores each element as a bit (to save memory). so, iteration and writing to data pointer causes issues.
cudf::detail::host_vector is based on thrust::host_vector which are simple vectors. (without any specialization for bool). So, that should be suitable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thrust::host_vector<bool> will use byte-packing (not bits).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason why I used uint8_t is because using 0 and 1 is easier to read. (😄)

  • Updating ignore_vals to host_vector is fine since it needs to be copied to device, so pinned memory will help. I will update it in next PR [REVIEW] JSON host tree algorithms #16545
  • is_pruned and is_str_column_all_nulls are accessed only in host, and not copied to device. So, no benefit in using pinned memory, also allocation is slower. (not sure if CPU read time or write time is slower still).

The only real drawbacks to pinning memory are the reduction in available physical ram to the host demand-paging system, and the time it takes to pin memory (which is significantly longer than an ordinary malloc of the same size).
https://forums.developer.nvidia.com/t/advantages-disadvantages-of-using-pinned-memory/34422/2

https://forums.developer.nvidia.com/t/pinned-memory-slower-than-pageable-memory/18821
Note: These posts are old.

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The re-org looks good to me, FWIW. But I'm curious about @shrshi's suggestion regarding bool vectors.

Copy link
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@karthikeyann
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit f71f53a into rapidsai:branch-24.10 Sep 20, 2024
102 checks passed
rapids-bot bot pushed a commit that referenced this pull request Sep 26, 2024
Depends on #16836
This change adds a new host tree building algorithms for JSON reader and utf8 field name support.

This constructs the device_column_tree using an adjacency list created from parent information.
This adjacency list is pruned based on input schema, and also types are enforced as per schema. `mark_is_pruned`
Tree is constructed from pruned adjacency list, (with mixed types handling). `construct_tree`

utf8 field name support added: (spark requested)
utf8 decoding of field names during hashing of field nodes so that utf8 encoded field names also match to same column.

All unit tests passes, 1 unit test added where old algorithm fails.
This code is kept under experimental flag.

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #16545
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants