-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17450 : [C++][Parquet] Support RLE decode for boolean datatype #14147
Conversation
|
@kou Could you please review this change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you fix style by cmake --build BUILD_DIR --config Debug --target format
?
Ack. Ran |
Hey @kou , |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to support encoder as well?
Could you open a new Jira issue for it? I think that it's better that we work on it in a separated pull request. |
Hi @kou , |
Could you rebase and resolve the current conflict? |
Update to static_cast Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Update to auto Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Done. Rebased and resolved merge conflicts. |
Could you confirm this failure? https://github.com/apache/arrow/actions/runs/3155990217/jobs/5136453754#step:6:3035
|
@kou |
Hi @kou , |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Thanks!
Benchmark runs are scheduled for baseline = 97ca1d2 and contender = 4660180. 4660180 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
public: | ||
explicit PlainBooleanDecoder(const ColumnDescriptor* descr); | ||
void SetData(int num_values, const uint8_t* data, int len) override; | ||
|
||
// Two flavors of bool decoding | ||
int Decode(uint8_t* buffer, int max_values) override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kou . Why is this public function removed? This breaks the downstream where is depending on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #14147 (comment) .
Could you share your downstream code that uses this?
@sfc-gh-nthimmegowda Can we keep backward compatibility for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kou I believe we can. Setting out a PR later today for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #14147 (comment) . Could you share your downstream code that uses this?
@sfc-gh-nthimmegowda Can we keep backward compatibility for this?
It is pretty straight-forward. The downstream code can use a vector of uint8_t instead of bool to hold a vector of decoded boolean values.
@sfc-gh-nthimmegowda Thanks for keeping backward compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac
Can you share the code pointer , abstract is fine.
Because this method / function was an abstract function which was overriden, just wondering if you called in base class or called directly via PlainBooleanEncoder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have used the template TypedDecoder class which is created by MakeTypedDecoder function defined here: https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.h#L447. Then it is casted to BooleanDecoder to use the uint8_t variant.
This PR triggers undefined behavior in fuzzing builds:
|
…pache#14147) Currently, parquet-cpp does not support columns encoded with RLE. Although the users of RLE are quite sparse with uses of one of the 3 types [Repetition and definition levels, dictionary indices and boolean values in data pages], [Parquet-encodings](https://parquet.apache.org/docs/file-format/data-pages/encodings/). Some implementations do encode this directly on boolean columns (Athena on AWS). Even though there is encoding and decoding support for repetition and definition levels, there is no support for boolean column with RLE. This PR integrates the column scanning to support columns with RLE. The first 4 bytes of the data length are size of the encoded data, which is parsed first and then passes to decoder. Added two tests with RLE boolean encoded parquet file to validate that values can be parsed individually and in a batch. Lead-authored-by: Nishanth Thimmegowda <nishanth.thimmegowda@snowflake.com> Co-authored-by: sfc-gh-nthimmegowda <nishanth.thimmegowda@snowflake.com> Co-authored-by: Ubuntu <ubuntu@ip-172-29-51-62.us-west-2.compute.internal> Co-authored-by: Ubuntu <ubuntu@ip-172-29-51-79.us-west-2.compute.internal> Co-authored-by: Ubuntu <ubuntu@ip-172-29-51-6.us-west-2.compute.internal> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Currently, parquet-cpp does not support columns encoded with RLE. Although the users of RLE are quite sparse with uses of one of the 3 types [Repetition and definition levels, dictionary indices and boolean values in data pages], Parquet-encodings. Some implementations do encode this directly on boolean columns (Athena on AWS). Even though there is encoding and decoding support for repetition and definition levels, there is no support for boolean column with RLE.
This PR integrates the column scanning to support columns with RLE. The first 4 bytes of the data length are size of the encoded data, which is parsed first and then passes to decoder.
Added two tests with RLE boolean encoded parquet file to validate that values can be parsed individually and in a batch.