-
Notifications
You must be signed in to change notification settings - Fork 483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC-1264: [C++] Add a writer option to align compression block with row group #2005
Conversation
…ow group boundary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM.
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
c++/include/orc/Reader.hh
Outdated
* Get the row group positions of the specified column in the current stripe. | ||
* @return the position entries for the specified columns. | ||
*/ | ||
virtual std::vector<RowGroupPositions> getPositionEntries(int columnId) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that I may have provided the wrong suggestion.
After reading the code base, we should move this function to Reader
instead of RowReader
. The main reason is that RowReader
is used to read data and Reader
is used to read metadata. A common use case is to read row group positions before creating the RowReader during the planning phase. I have also added std::map<uint32_t, BloomFilterIndex> Reader::getBloomFilters(uint32_t stripeIndex, const std::set<uint32_t>& included) const
in the past, we may follow the same thing.
From orc_proto.proto
file, we can find the exact type used for columnId and positions:
message RowIndexEntry {
repeated uint64 positions = 1 [packed=true];
}
message Stream {
optional uint32 column = 2;
}
Therefore my final suggestion would be as follows:
// Row group index of a single column in a stripe.
struct RowGroupIndex {
// Positions are represented as a two-dimensional array where the first
// dimension is row group index and the second dimension is the position
// list of the row group. The size of the second dimension should be equal
// among all row groups.
std::vector<std::vector<uint64_t>> positions;
};
/**
* Get row group index of all selected columns in the specified stripe
* @param stripeIndex index of the stripe to be read for row group index.
* @param included index of selected columns to return (if not specified,
* all columns will be returned).
* @return map of row group index keyed by its column index.
*/
virtual std::vector<uint32_t, RowGroupIndex> Reader::getRowGroupIndex(
uint32_t stripeIndex, const std::set<uint32_t>& included) const = 0;
In the future, we may also add std::vector<std::unique_ptr<ColumnStatistics>>
to RowGroupIndex without breaking public API.
c++/src/Reader.cc
Outdated
|
||
proto::RowIndex pbRowIndex; | ||
if (!pbRowIndex.ParseFromZeroCopyStream(pbStream.get())) { | ||
throw ParseError("Failed to parse RowIndex"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: provide more debug info, like stripeIndex and columnId?
c++/src/Reader.cc
Outdated
const proto::Stream& stream = currentStripeFooter.streams(i); | ||
uint32_t column = static_cast<uint32_t>(stream.column()); | ||
uint64_t length = static_cast<uint64_t>(stream.length()); | ||
RowGroupIndex rowGroupIndex = ret[column]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RowGroupIndex rowGroupIndex = ret[column]; | |
RowGroupIndex& rowGroupIndex = ret[column]; |
This should break the test, or we need to add at least a test case to cover this new API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the RowGroupIndex size check in TestWriter.cc:97.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @luffy-zh for making the changes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged to main for 2.1.0. Thank you again. |
What changes were proposed in this pull request?
Add support for the ORC writer to ensure that the compression block is aligned with the row group boundary。
Why are the changes needed?
To reduce unnecessary I/O and decompression when PPD is in effect, we can enforce the compression block to be aligned with the row group boundary. For more detail, see link
How was this patch tested?
Uts in TestWriter.cc can convert this patch.
Was this patch authored or co-authored using generative AI tooling?
NO