Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17450 : [C++][Parquet] Support RLE decode for boolean datatype #14147

Merged
merged 16 commits into from
Oct 4, 2022
Merged

ARROW-17450 : [C++][Parquet] Support RLE decode for boolean datatype #14147

merged 16 commits into from
Oct 4, 2022

Conversation

sfc-gh-nthimmegowda
Copy link
Contributor

@sfc-gh-nthimmegowda sfc-gh-nthimmegowda commented Sep 15, 2022

Currently, parquet-cpp does not support columns encoded with RLE. Although the users of RLE are quite sparse with uses of one of the 3 types [Repetition and definition levels, dictionary indices and boolean values in data pages], Parquet-encodings. Some implementations do encode this directly on boolean columns (Athena on AWS). Even though there is encoding and decoding support for repetition and definition levels, there is no support for boolean column with RLE.

This PR integrates the column scanning to support columns with RLE. The first 4 bytes of the data length are size of the encoded data, which is parsed first and then passes to decoder.

Added two tests with RLE boolean encoded parquet file to validate that values can be parsed individually and in a batch.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@sfc-gh-nthimmegowda
Copy link
Contributor Author

@kou Could you please review this change?

@kou kou changed the title ARROW-17450 : Support RLE encoding for boolean datatype ARROW-17450 : [C++][Parquet] Support RLE encoding for boolean datatype Sep 16, 2022
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you fix style by cmake --build BUILD_DIR --config Debug --target format?

cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
@sfc-gh-nthimmegowda
Copy link
Contributor Author

Could you fix style by cmake --build BUILD_DIR --config Debug --target format?

Ack. Ran --target format . Had changes in 2 other files which I included since those were minor.

cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding.cc Show resolved Hide resolved
cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
cpp/src/parquet/level_comparison.cc Outdated Show resolved Hide resolved
cpp/src/parquet/reader_test.cc Show resolved Hide resolved
cpp/src/parquet/reader_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding.cc Show resolved Hide resolved
@sfc-gh-nthimmegowda
Copy link
Contributor Author

Hey @kou ,
Really appreciate you taking the time to review this. Let me know if there are any more issues, and I will address all of them.
Thanks

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to support encoder as well?

@kou
Copy link
Member

kou commented Sep 22, 2022

Could you open a new Jira issue for it? I think that it's better that we work on it in a separated pull request.

@kou kou changed the title ARROW-17450 : [C++][Parquet] Support RLE encoding for boolean datatype ARROW-17450 : [C++][Parquet] Support RLE decode for boolean datatype Sep 22, 2022
@sfc-gh-nthimmegowda
Copy link
Contributor Author

Hi @kou ,
Updated the new test file and refactored the changes. Please have a look.
Thanks

@kou
Copy link
Member

kou commented Sep 30, 2022

Could you rebase and resolve the current conflict?

@sfc-gh-nthimmegowda
Copy link
Contributor Author

Could you rebase and resolve the current conflict?

Done. Rebased and resolved merge conflicts.

@kou
Copy link
Member

kou commented Sep 30, 2022

Could you confirm this failure? https://github.com/apache/arrow/actions/runs/3155990217/jobs/5136453754#step:6:3035

[----------] 2 tests from TestBooleanRLE
[ RUN      ] TestBooleanRLE.TestBooleanScanner
[       OK ] TestBooleanRLE.TestBooleanScanner (4 ms)
[ RUN      ] TestBooleanRLE.TestBatchRead
=================================================================
==24088==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ffd9a55539f at pc 0x7f4c6b9f033d bp 0x7ffd9a5528b0 sp 0x7ffd9a5528a8
WRITE of size 1 at 0x7ffd9a55539f thread T0
    #0 0x7f4c6b9f033c in void arrow::bit_util::detail::GetValue_<bool>(int, bool*, int, unsigned char const*, int*, int*, unsigned long*) /arrow/cpp/src/arrow/util/bit_stream_utils.h:279:6
    #1 0x7f4c6b9edb05 in int arrow::bit_util::BitReader::GetBatch<bool>(int, bool*, int) /arrow/cpp/src/arrow/util/bit_stream_utils.h:338:7
    #2 0x7f4c6b9ec0e6 in int arrow::util::RleDecoder::GetBatch<bool>(bool*, int) /arrow/cpp/src/arrow/util/rle_encoding.h:320:37
    #3 0x7f4c6b9dd98d in parquet::(anonymous namespace)::RleBooleanDecoder::Decode(bool*, int) /arrow/cpp/src/parquet/encoding.cc:2368:19
    #4 0x7f4c6b676a35 in parquet::(anonymous namespace)::ColumnReaderImplBase<parquet::PhysicalType<(parquet::Type::type)0> >::ReadValues(long, bool*) /arrow/cpp/src/parquet/column_reader.cc:576:45
    #5 0x7f4c6b65c99c in parquet::(anonymous namespace)::TypedColumnReaderImpl<parquet::PhysicalType<(parquet::Type::type)0> >::ReadBatch(long, short*, short*, bool*, long*) /arrow/cpp/src/parquet/column_reader.cc:1044:24
    #6 0x7a3c21 in parquet::TestBooleanRLE_TestBatchRead_Test::TestBody() /arrow/cpp/src/parquet/reader_test.cc:240:14
    #7 0x7f4c6d436a3a in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:2607:10
    #8 0x7f4c6d41b569 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:2643:14
    #9 0x7f4c6d3f4c42 in testing::Test::Run() /build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:2682:5
    #10 0x7f4c6d3f5a08 in testing::TestInfo::Run() /build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:2861:11
    #11 0x7f4c6d3f6223 in testing::TestSuite::Run() /build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:3015:28
    #12 0x7f4c6d407004 in testing::internal::UnitTestImpl::RunAllTests() /build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:5855:44
    #13 0x7f4c6d4398da in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:2607:10
    #14 0x7f4c6d41dd89 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:2643:14
    #15 0x7f4c6d406b60 in testing::UnitTest::Run() /build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:5438:10
    #16 0x7f4c6d471210 in RUN_ALL_TESTS() /build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/include/gtest/gtest.h:2490:46
    #17 0x7f4c6d4711ec in main /build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest_main.cc:52:10
    #18 0x7f4c4efc5082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082)
    #19 0x47330d in _start (/build/cpp/debug/parquet-reader-test+0x47330d)

Address 0x7ffd9a55539f is located in stack of thread T0 at offset 1503 in frame
    #0 0x79e64f in parquet::TestBooleanRLE_TestBatchRead_Test::TestBody() /arrow/cpp/src/parquet/reader_test.cc:188

  This frame has 104 object(s):
    [32, 36) 'nvalues' (line 189)
    [48, 52) 'num_row_groups' (line 190)
    [64, 68) 'metadata_size' (line 191)
    [80, 96) 'group' (line 193)
    [112, 128) 'col' (line 196)
    [144, 160) 'ref.tmp' (line 196)
    [176, 192) 'gtest_ar' (line 199)
    [208, 216) 'ref.tmp9' (line 199)
    [240, 256) 'ref.tmp10' (line 199)
    [272, 280) 'ref.tmp30' (line 199)
    [304, 312) 'ref.tmp33' (line 199)
    [336, 352) 'gtest_ar47' (line 201)
    [368, 372) 'ref.tmp48' (line 201)
    [384, 400) 'ref.tmp49' (line 201)
    [416, 424) 'ref.tmp73' (line 201)
    [448, 456) 'ref.tmp76' (line 201)
    [480, 496) 'gtest_ar94' (line 203)
    [512, 516) 'ref.tmp95' (line 203)
    [528, 544) 'ref.tmp96' (line 203)
    [560, 568) 'ref.tmp120' (line 203)
    [592, 600) 'ref.tmp123' (line 203)
    [624, 640) 'gtest_ar141' (line 205)
    [656, 664) 'ref.tmp142' (line 205)
    [688, 696) 'ref.tmp160' (line 205)
    [720, 728) 'ref.tmp163' (line 205)
    [752, 760) 'col_chunk' (line 208)
    [784, 800) 'gtest_ar_' (line 209)
    [816, 817) 'ref.tmp190' (line 209)
    [832, 840) 'ref.tmp191' (line 209)
    [864, 872) 'agg.tmp'
    [896, 904) 'agg.tmp201'
    [928, 932) 'ref.tmp211' (line 209)
    [944, 952) 'ref.tmp218' (line 209)
    [976, 984) 'ref.tmp240' (line 209)
    [1008, 1016) 'ref.tmp243' (line 209)
    [1040, 1072) 'ref.tmp244' (line 209)
    [1104, 1120) 'gtest_ar_265' (line 213)
    [1136, 1137) 'ref.tmp266' (line 213)
    [1152, 1160) 'ref.tmp281' (line 213)
    [1184, 1192) 'ref.tmp284' (line 213)
    [1216, 1248) 'ref.tmp285' (line 213)
    [1280, 1288) 'curr_batch_read' (line 214)
    [1312, 1314) 'batch_size' (line 216)
    [1328, 1362) 'def_levels' (line 218)
    [1408, 1442) 'rep_levels' (line 219)
    [1488, 1503) 'values' (line 220) <== Memory access at offset 1503 overflows this variable
    [1520, 1528) 'levels_read' (line 222)
    [1552, 1568) 'gtest_ar316' (line 224)
    [1584, 1592) 'ref.tmp324' (line 224)
    [1616, 1624) 'ref.tmp327' (line 224)
    [1648, 1664) 'gtest_ar345' (line 228)
    [1680, 1684) 'ref.tmp346' (line 228)
    [1696, 1704) 'ref.tmp355' (line 228)
    [1728, 1736) 'ref.tmp358' (line 228)
    [1760, 1776) 'gtest_ar376' (line 231)
    [1792, 1860) 'ref.tmp377' (line 231)
    [1904, 1972) 'agg.tmp378'
    [2016, 2020) 'ref.tmp379' (line 231)
    [2032, 2036) 'ref.tmp380' (line 231)
    [2048, 2052) 'ref.tmp381' (line 231)
    [2064, 2068) 'ref.tmp382' (line 231)
    [2080, 2084) 'ref.tmp383' (line 231)
    [2096, 2100) 'ref.tmp384' (line 231)
    [2112, 2116) 'ref.tmp385' (line 231)
    [2128, 2132) 'ref.tmp386' (line 231)
    [2144, 2148) 'ref.tmp387' (line 231)
    [2160, 2164) 'ref.tmp388' (line 231)
    [2176, 2180) 'ref.tmp389' (line 231)
    [2192, 2196) 'ref.tmp390' (line 231)
    [2208, 2212) 'ref.tmp391' (line 231)
    [2224, 2228) 'ref.tmp392' (line 231)
    [2240, 2244) 'ref.tmp393' (line 231)
    [2256, 2260) 'ref.tmp394' (line 231)
    [2272, 2276) 'ref.tmp395' (line 231)
    [2288, 2296) 'ref.tmp423' (line 231)
    [2320, 2328) 'ref.tmp426' (line 231)
    [2352, 2368) 'gtest_ar444' (line 235)
    [2384, 2444) 'ref.tmp445' (line 235)
    [2480, 2540) 'agg.tmp446'
    [2576, 2580) 'ref.tmp447' (line 235)
    [2592, 2596) 'ref.tmp448' (line 235)
    [2608, 2612) 'ref.tmp449' (line 235)
    [2624, 2628) 'ref.tmp450' (line 235)
    [2640, 2644) 'ref.tmp451' (line 235)
    [2656, 2660) 'ref.tmp452' (line 235)
    [2672, 2676) 'ref.tmp453' (line 235)
    [2688, 2692) 'ref.tmp454' (line 235)
    [2704, 2708) 'ref.tmp455' (line 235)
    [2720, 2724) 'ref.tmp456' (line 235)
    [2736, 2740) 'ref.tmp457' (line 235)
    [2752, 2756) 'ref.tmp458' (line 235)
    [2768, 2772) 'ref.tmp459' (line 235)
    [2784, 2788) 'ref.tmp460' (line 235)
    [2800, 2804) 'ref.tmp461' (line 235)
    [2816, 2824) 'ref.tmp487' (line 235)
    [2848, 2856) 'ref.tmp490' (line 235)
    [2880, 2896) 'gtest_ar519' (line 241)
    [2912, 2920) 'ref.tmp527' (line 241)
    [2944, 2952) 'ref.tmp530' (line 241)
    [2976, 2992) 'gtest_ar_552' (line 245)
    [3008, 3009) 'ref.tmp553' (line 245)
    [3024, 3032) 'ref.tmp570' (line 245)
    [3056, 3064) 'ref.tmp573' (line 245)
    [3088, 3120) 'ref.tmp574' (line 245)
HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
      (longjmp and C++ exceptions *are* supported)
SUMMARY: AddressSanitizer: stack-buffer-overflow /arrow/cpp/src/arrow/util/bit_stream_utils.h:279:6 in void arrow::bit_util::detail::GetValue_<bool>(int, bool*, int, unsigned char const*, int*, int*, unsigned long*)
Shadow bytes around the buggy address:
  0x1000334a2a20: f8 f2 f2 f2 00 f2 f2 f2 00 f2 f2 f2 f8 f2 f8 f2
  0x1000334a2a30: f2 f2 f8 f2 f2 f2 f8 f2 f2 f2 f8 f8 f8 f8 f2 f2
  0x1000334a2a40: f2 f2 f8 f8 f2 f2 f8 f2 f8 f2 f2 f2 f8 f2 f2 f2
  0x1000334a2a50: f8 f8 f8 f8 f2 f2 f2 f2 00 f2 f2 f2 02 f2 00 00
  0x1000334a2a60: 00 00 02 f2 f2 f2 f2 f2 00 00 00 00 02 f2 f2 f2
=>0x1000334a2a70: f2 f2 00[07]f2 f2 00 f2 f2 f2 f8 f8 f2 f2 f8 f2
  0x1000334a2a80: f2 f2 f8 f2 f2 f2 f8 f8 f2 f2 f8 f2 f8 f2 f2 f2
  0x1000334a2a90: f8 f2 f2 f2 f8 f8 f2 f2 f8 f8 f8 f8 f8 f8 f8 f8
  0x1000334a2aa0: f8 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 00 04 f2
  0x1000334a2ab0: f2 f2 f2 f2 f8 f2 f8 f2 f8 f2 f8 f2 f8 f2 f8 f2
  0x1000334a2ac0: f8 f2 f8 f2 f8 f2 f8 f2 f8 f2 f8 f2 f8 f2 f8 f2
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==24088==ABORTING
/build/cpp/src/parquet

@sfc-gh-nthimmegowda
Copy link
Contributor Author

sfc-gh-nthimmegowda commented Oct 2, 2022

Could you confirm this failure? https://github.com/apache/arrow/actions/runs/3155990217/jobs/5136453754#step:6:3035

@kou
Thanks for the callout. Hmmm, this was weird. I identified where it was failing in ASAN and fixed it. Validated in my fork that all C++ gates pass as well.
https://github.com/sfc-gh-nthimmegowda/arrow/actions/runs/3167603155

@sfc-gh-nthimmegowda
Copy link
Contributor Author

Hi @kou ,
Can you please take a look. No merge conflicts and all the gates are green.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Thanks!

@kou kou merged commit 4660180 into apache:master Oct 4, 2022
@ursabot
Copy link

ursabot commented Oct 4, 2022

Benchmark runs are scheduled for baseline = 97ca1d2 and contender = 4660180. 4660180 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.56% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.57% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 46601808 ec2-t3-xlarge-us-east-2
[Failed] 46601808 test-mac-arm
[Failed] 46601808 ursa-i9-9960x
[Finished] 46601808 ursa-thinkcentre-m75q
[Finished] 97ca1d25 ec2-t3-xlarge-us-east-2
[Failed] 97ca1d25 test-mac-arm
[Failed] 97ca1d25 ursa-i9-9960x
[Finished] 97ca1d25 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

public:
explicit PlainBooleanDecoder(const ColumnDescriptor* descr);
void SetData(int num_values, const uint8_t* data, int len) override;

// Two flavors of bool decoding
int Decode(uint8_t* buffer, int max_values) override;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kou . Why is this public function removed? This breaks the downstream where is depending on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #14147 (comment) .
Could you share your downstream code that uses this?

@sfc-gh-nthimmegowda Can we keep backward compatibility for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kou I believe we can. Setting out a PR later today for this.

Copy link
Member

@wgtmac wgtmac Oct 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #14147 (comment) . Could you share your downstream code that uses this?

@sfc-gh-nthimmegowda Can we keep backward compatibility for this?

It is pretty straight-forward. The downstream code can use a vector of uint8_t instead of bool to hold a vector of decoded boolean values.

@sfc-gh-nthimmegowda Thanks for keeping backward compatibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac
Can you share the code pointer , abstract is fine.
Because this method / function was an abstract function which was overriden, just wondering if you called in base class or called directly via PlainBooleanEncoder

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have used the template TypedDecoder class which is created by MakeTypedDecoder function defined here: https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.h#L447. Then it is casted to BooleanDecoder to use the uint8_t variant.

@pitrou
Copy link
Member

pitrou commented Oct 13, 2022

This PR triggers undefined behavior in fuzzing builds:

/build/build-fuzz/debug/parquet-arrow-fuzz: Running 1 inputs 1 time(s) each.
Running: clusterfuzz-testcase-minimized-parquet-arrow-fuzz-5017972913864704
/home/antoine/arrow/dev/cpp/src/arrow/util/bit_stream_utils.h:415:42: runtime error: load of value 32, which is not a valid value for type 'bool'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /home/antoine/arrow/dev/cpp/src/arrow/util/bit_stream_utils.h:415:42 in 
==34310== ERROR: libFuzzer: deadly signal
    #0 0x55771d20cef1 in __sanitizer_print_stack_trace (/build/build-fuzz/debug/parquet-arrow-fuzz+0xe5ef1) (BuildId: 7423ef34b89be7542c884a94acdc1490cca072b4)
    #1 0x55771d17fe77 in fuzzer::PrintStackTrace() crtstuff.c
    #2 0x55771d1659a3 in fuzzer::Fuzzer::CrashCallback() crtstuff.c
    #3 0x7f9ccb2e741f  (/lib/x86_64-linux-gnu/libpthread.so.0+0x1441f) (BuildId: 7b4536f41cdaa5888408e82d0836e33dcf436466)
    #4 0x7f9ccb10900a in __libc_signal_restore_set /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/internal-signals.h:86:3
    #5 0x7f9ccb10900a in raise /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/raise.c:48:3
    #6 0x7f9ccb0e8858 in abort /build/glibc-SzIz7B/glibc-2.31/stdlib/abort.c:79:7
    #7 0x55771d227176 in __sanitizer::Abort() crtstuff.c
    #8 0x55771d225000 in __sanitizer::Die() crtstuff.c
    #9 0x55771d238d3b in __ubsan::ScopedReport::~ScopedReport() crtstuff.c
    #10 0x55771d23baea in handleLoadInvalidValue(__ubsan::InvalidValueData*, unsigned long, __ubsan::ReportOptions) crtstuff.c
    #11 0x55771d23bb2d in __ubsan_handle_load_invalid_value_abort (/build/build-fuzz/debug/parquet-arrow-fuzz+0x114b2d) (BuildId: 7423ef34b89be7542c884a94acdc1490cca072b4)
    #12 0x7f9ce5fbe8fa in bool arrow::bit_util::BitReader::GetAligned<bool>(int, bool*) /home/antoine/arrow/dev/cpp/src/arrow/util/bit_stream_utils.h:415:42
    #13 0x7f9ce5fbc533 in bool arrow::util::RleDecoder::NextCounts<bool>() /home/antoine/arrow/dev/cpp/src/arrow/util/rle_encoding.h:663:22
    #14 0x7f9ce5fb9087 in int arrow::util::RleDecoder::GetBatch<bool>(bool*, int) /home/antoine/arrow/dev/cpp/src/arrow/util/rle_encoding.h:329:12
    #15 0x7f9ce5faa13d in parquet::(anonymous namespace)::RleBooleanDecoder::Decode(bool*, int) /home/antoine/arrow/dev/cpp/src/parquet/encoding.cc:2388:19
    #16 0x7f9ce5d179a8 in parquet::internal::(anonymous namespace)::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)0> >::ReadValuesDense(long) /home/antoine/arrow/dev/cpp/src/parquet/column_reader.cc:1531:33
    #17 0x7f9ce5d1ac37 in parquet::internal::(anonymous namespace)::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)0> >::ReadRecordData(long) /home/antoine/arrow/dev/cpp/src/parquet/column_reader.cc:1575:7
    #18 0x7f9ce5d139b4 in parquet::internal::(anonymous namespace)::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)0> >::ReadRecords(long) /home/antoine/arrow/dev/cpp/src/parquet/column_reader.cc:1331:25
    #19 0x7f9ce572137d in parquet::arrow::(anonymous namespace)::LeafReader::LoadBatch(long) /home/antoine/arrow/dev/cpp/src/parquet/arrow/reader.cc:479:46
    #20 0x7f9ce571821e in parquet::arrow::ColumnReaderImpl::NextBatch(long, std::shared_ptr<arrow::ChunkedArray>*) /home/antoine/arrow/dev/cpp/src/parquet/arrow/reader.cc:109:5
    #21 0x7f9ce5785828 in parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadColumn(int, std::vector<int, std::allocator<int> > const&, parquet::arrow::ColumnReader*, std::shared_ptr<arrow::ChunkedArray>*) /home/antoine/arrow/dev/cpp/src/parquet/arrow/reader.cc:285:20
    #22 0x7f9ce57fe70b in parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, arrow::internal::Executor*)::$_4::operator()(unsigned long, std::shared_ptr<parquet::arrow::ColumnReaderImpl>) const /home/antoine/arrow/dev/cpp/src/parquet/arrow/reader.cc:1236:5
    #23 0x7f9ce57fa00c in arrow::Future<std::vector<std::shared_ptr<arrow::ChunkedArray>, std::allocator<std::shared_ptr<arrow::ChunkedArray> > > > arrow::internal::OptionalParallelForAsync<parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, arrow::internal::Executor*)::$_4&, std::shared_ptr<parquet::arrow::ColumnReaderImpl>, std::shared_ptr<arrow::ChunkedArray> >(bool, std::vector<std::shared_ptr<parquet::arrow::ColumnReaderImpl>, std::allocator<std::shared_ptr<parquet::arrow::ColumnReaderImpl> > >, parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, arrow::internal::Executor*)::$_4&, arrow::internal::Executor*) /home/antoine/arrow/dev/cpp/src/arrow/util/parallel.h:95:7
    #24 0x7f9ce57f89bb in parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, arrow::internal::Executor*) /home/antoine/arrow/dev/cpp/src/parquet/arrow/reader.cc:1254:10
    #25 0x7f9ce56f9266 in parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadRowGroups(std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, std::shared_ptr<arrow::Table>*) /home/antoine/arrow/dev/cpp/src/parquet/arrow/reader.cc:1215:14
    #26 0x7f9ce56f7e57 in parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadRowGroup(int, std::vector<int, std::allocator<int> > const&, std::shared_ptr<arrow::Table>*) /home/antoine/arrow/dev/cpp/src/parquet/arrow/reader.cc:322:12
    #27 0x7f9ce56f83ab in parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadRowGroup(int, std::shared_ptr<arrow::Table>*) /home/antoine/arrow/dev/cpp/src/parquet/arrow/reader.cc:326:12
    #28 0x7f9ce56e85d1 in parquet::arrow::internal::FuzzReader(std::unique_ptr<parquet::arrow::FileReader, std::default_delete<parquet::arrow::FileReader> >) /home/antoine/arrow/dev/cpp/src/parquet/arrow/reader.cc:1338:37
    #29 0x7f9ce56e9c35 in parquet::arrow::internal::FuzzReader(unsigned char const*, long) /home/antoine/arrow/dev/cpp/src/parquet/arrow/reader.cc:1355:10
    #30 0x55771d2402e7 in LLVMFuzzerTestOneInput /home/antoine/arrow/dev/cpp/src/parquet/arrow/fuzz.cc:22:17
    #31 0x55771d1670b3 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) crtstuff.c
    #32 0x55771d15146f in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) crtstuff.c
    #33 0x55771d157176 in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) crtstuff.c
    #34 0x55771d1807b2 in main (/build/build-fuzz/debug/parquet-arrow-fuzz+0x597b2) (BuildId: 7423ef34b89be7542c884a94acdc1490cca072b4)
    #35 0x7f9ccb0ea082 in __libc_start_main /build/glibc-SzIz7B/glibc-2.31/csu/../csu/libc-start.c:308:16
    #36 0x55771d14bc5d in _start (/build/build-fuzz/debug/parquet-arrow-fuzz+0x24c5d) (BuildId: 7423ef34b89be7542c884a94acdc1490cca072b4)

fatemehp pushed a commit to fatemehp/arrow that referenced this pull request Oct 17, 2022
…pache#14147)

Currently, parquet-cpp does not support columns encoded with RLE. Although the users of RLE are quite sparse with uses of one of the 3 types [Repetition and definition levels, dictionary indices and boolean values in data pages], [Parquet-encodings](https://parquet.apache.org/docs/file-format/data-pages/encodings/). Some implementations do encode this directly on boolean columns (Athena on AWS). Even though there is encoding and decoding support for repetition and definition levels, there is no support for boolean column with RLE. 



This PR integrates the column scanning to support columns with RLE. The first 4 bytes of the data length are size of the encoded data, which is parsed first and then passes to decoder. 

Added two tests with RLE boolean encoded parquet file to validate that values can be parsed individually and in a batch. 


Lead-authored-by: Nishanth Thimmegowda <nishanth.thimmegowda@snowflake.com>
Co-authored-by: sfc-gh-nthimmegowda <nishanth.thimmegowda@snowflake.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-29-51-62.us-west-2.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-29-51-79.us-west-2.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-29-51-6.us-west-2.compute.internal>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants