-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON tree algorithm code reorg #16836
JSON tree algorithm code reorg #16836
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved trivial CMake changes
bitmask_type* validity; | ||
}; | ||
|
||
std::pair<cudf::detail::host_vector<uint8_t>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change ignore_vals
returned here to a bool vector? We can also move this change to #16545
EDIT: Similar suggestion for is_pruned
and is_str_column_all_nulls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, I've seen concerns raised in the past regarding the use of bool
vectors, and recommendations to use int8_t
/uint8_t
instead. I didn't think host_vector<bool>
would be plagued with the same problems as std::vector<bool>
, but I could be wrong.
Watching this conversation for @karthikeyann's opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mythrocks That's right. 💯
std::vector<bool>
stores each element as a bit (to save memory). so, iteration and writing to data pointer causes issues.
cudf::detail::host_vector
is based on thrust::host_vector
which are simple vectors. (without any specialization for bool). So, that should be suitable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thrust::host_vector<bool>
will use byte-packing (not bits).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reason why I used uint8_t
is because using 0
and 1
is easier to read. (😄)
- Updating
ignore_vals
to host_vector is fine since it needs to be copied to device, so pinned memory will help. I will update it in next PR [REVIEW] JSON host tree algorithms #16545 is_pruned
andis_str_column_all_nulls
are accessed only in host, and not copied to device. So, no benefit in using pinned memory, also allocation is slower. (not sure if CPU read time or write time is slower still).
The only real drawbacks to pinning memory are the reduction in available physical ram to the host demand-paging system, and the time it takes to pin memory (which is significantly longer than an ordinary malloc of the same size).
https://forums.developer.nvidia.com/t/advantages-disadvantages-of-using-pinned-memory/34422/2
https://forums.developer.nvidia.com/t/pinned-memory-slower-than-pageable-memory/18821
Note: These posts are old.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The re-org looks good to me, FWIW. But I'm curious about @shrshi's suggestion regarding bool
vectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
/merge |
Depends on #16836 This change adds a new host tree building algorithms for JSON reader and utf8 field name support. This constructs the device_column_tree using an adjacency list created from parent information. This adjacency list is pruned based on input schema, and also types are enforced as per schema. `mark_is_pruned` Tree is constructed from pruned adjacency list, (with mixed types handling). `construct_tree` utf8 field name support added: (spark requested) utf8 decoding of field names during hashing of field nodes so that utf8 encoded field names also match to same column. All unit tests passes, 1 unit test added where old algorithm fails. This code is kept under experimental flag. Authors: - Shruti Shivakumar (https://github.com/shrshi) - Karthikeyan (https://github.com/karthikeyann) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - Vukasin Milovanovic (https://github.com/vuule) - Karthikeyan (https://github.com/karthikeyann) URL: #16545
Description
This PR moves JSON host tree algorithms to separate file.
This code movement will help #16545 review easier.
The code is moved to new file and reorganized for code reuse.
Very long function
make_device_json_column
is split intoreduce_to_column_tree
callbuild_tree
scatter_offsets
No new functionality is added in this PR.
Checklist