Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1264: [C++] Add a writer option to align compression block with row group #2005

Closed
wants to merge 18 commits into from

Conversation

luffy-zh
Copy link
Contributor

@luffy-zh luffy-zh commented Aug 16, 2024

What changes were proposed in this pull request?

Add support for the ORC writer to ensure that the compression block is aligned with the row group boundary。

Why are the changes needed?

To reduce unnecessary I/O and decompression when PPD is in effect, we can enforce the compression block to be aligned with the row group boundary. For more detail, see link

How was this patch tested?

Uts in TestWriter.cc can convert this patch.

Was this patch authored or co-authored using generative AI tooling?

NO

@github-actions github-actions bot added the CPP label Aug 16, 2024
c++/include/orc/Reader.hh Outdated Show resolved Hide resolved
c++/include/orc/Reader.hh Outdated Show resolved Hide resolved
c++/include/orc/Writer.hh Show resolved Hide resolved
c++/include/orc/Reader.hh Outdated Show resolved Hide resolved
c++/src/ColumnWriter.cc Outdated Show resolved Hide resolved
c++/src/ColumnWriter.cc Show resolved Hide resolved
c++/test/TestWriter.cc Outdated Show resolved Hide resolved
@wgtmac wgtmac changed the title ORC-1264: [C++] Add a writer option to align compression block with r… ORC-1264: [C++] Add a writer option to align compression block with row group Aug 29, 2024
@luffy-zh luffy-zh requested a review from wgtmac August 29, 2024 08:30
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM.

c++/src/ColumnWriter.hh Outdated Show resolved Hide resolved
c++/include/orc/Reader.hh Outdated Show resolved Hide resolved
c++/include/orc/Reader.hh Outdated Show resolved Hide resolved
luffy-zh and others added 4 commits September 5, 2024 10:21
@dongjoon-hyun dongjoon-hyun added this to the 2.1.0 milestone Sep 20, 2024
* Get the row group positions of the specified column in the current stripe.
* @return the position entries for the specified columns.
*/
virtual std::vector<RowGroupPositions> getPositionEntries(int columnId) = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that I may have provided the wrong suggestion.

After reading the code base, we should move this function to Reader instead of RowReader. The main reason is that RowReader is used to read data and Reader is used to read metadata. A common use case is to read row group positions before creating the RowReader during the planning phase. I have also added std::map<uint32_t, BloomFilterIndex> Reader::getBloomFilters(uint32_t stripeIndex, const std::set<uint32_t>& included) const in the past, we may follow the same thing.

From orc_proto.proto file, we can find the exact type used for columnId and positions:

message RowIndexEntry {
  repeated uint64 positions = 1 [packed=true];
}

message Stream {
  optional uint32 column = 2;
}

Therefore my final suggestion would be as follows:

// Row group index of a single column in a stripe.
struct RowGroupIndex {
  // Positions are represented as a two-dimensional array where the first
  // dimension is row group index and the second dimension is the position
  // list of the row group. The size of the second dimension should be equal
  // among all row groups.
  std::vector<std::vector<uint64_t>> positions;
};

/**
 * Get row group index of all selected columns in the specified stripe
 * @param stripeIndex index of the stripe to be read for row group index.
 * @param included index of selected columns to return (if not specified,
 *        all columns will be returned).
 * @return map of row group index keyed by its column index.
 */
virtual std::vector<uint32_t, RowGroupIndex> Reader::getRowGroupIndex(
    uint32_t stripeIndex, const std::set<uint32_t>& included) const = 0;

In the future, we may also add std::vector<std::unique_ptr<ColumnStatistics>> to RowGroupIndex without breaking public API.


proto::RowIndex pbRowIndex;
if (!pbRowIndex.ParseFromZeroCopyStream(pbStream.get())) {
throw ParseError("Failed to parse RowIndex");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: provide more debug info, like stripeIndex and columnId?

const proto::Stream& stream = currentStripeFooter.streams(i);
uint32_t column = static_cast<uint32_t>(stream.column());
uint64_t length = static_cast<uint64_t>(stream.length());
RowGroupIndex rowGroupIndex = ret[column];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RowGroupIndex rowGroupIndex = ret[column];
RowGroupIndex& rowGroupIndex = ret[column];

This should break the test, or we need to add at least a test case to cover this new API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the RowGroupIndex size check in TestWriter.cc:97.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @luffy-zh for making the changes!

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you so much, @luffy-zh and @wgtmac .
It's great to have this in Apache ORC 2.1.0.

@dongjoon-hyun
Copy link
Member

Merged to main for 2.1.0. Thank you again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants