Simplify parquet arror `RecordReader` #1021

tustvold · 2021-12-10T14:02:09Z

Which issue does this PR close?

Closes #1020. Related to #171 (better performance reading dictionary encoded strings)

Rationale for this change

See ticket

What changes are included in this PR?

This alters RecordReader to remove some shared mutable state, along with the concept of being in the middle of a record.

Are there any user-facing changes?

No

codecov-commenter · 2021-12-10T14:18:17Z

Codecov Report

Merging #1021 (cd0f759) into master (e0abda2) will decrease coverage by 0.00%.
The diff coverage is 82.60%.

@@            Coverage Diff             @@
##           master    #1021      +/-   ##
==========================================
- Coverage   82.31%   82.30%   -0.01%     
==========================================
  Files         168      168              
  Lines       49031    49026       -5     
==========================================
- Hits        40359    40350       -9     
- Misses       8672     8676       +4

Impacted Files	Coverage Δ
parquet/src/arrow/record_reader.rs	`92.77% <82.60%> (-0.96%)`	⬇️
parquet/src/encodings/encoding.rs	`93.52% <0.00%> (-0.20%)`	⬇️
arrow/src/array/transform/mod.rs	`85.10% <0.00%> (-0.14%)`	⬇️
parquet_derive/src/parquet_field.rs	`66.21% <0.00%> (ø)`
arrow/src/datatypes/datatype.rs	`66.38% <0.00%> (+0.42%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0abda2...cd0f759. Read the comment docs.

alamb · 2021-12-10T14:24:21Z

Filed #1022 to track CI failure in "nightly" builds

alamb · 2021-12-10T15:13:50Z

I fixed the nightly failures in #1023 -- will merge to this PR to get that to pass too

alamb · 2021-12-10T15:40:37Z

I think we should run the parquet performance benchmark for this change -- I will do so

alamb

I read the code carefully and looks good to me. I am running the benchmarks on a GCP machine and will report the numbers shortly

alamb · 2021-12-10T15:49:30Z

parquet/src/arrow/record_reader.rs

@@ -75,9 +73,7 @@ impl<T: DataType> RecordReader<T> {
            column_desc: column_schema,
            num_records: 0,
            num_values: 0,
-            values_seen: 0,


These fields look like they have been here since the initial implementation by @liurenjie1024 in apache/arrow#4292

alamb · 2021-12-10T19:11:33Z

My performance tests showed no significant performance difference

tustvold/simplify-record-reader @ 290b24f
apache/master @ e0abda2

Test command

cargo bench -p parquet --bench arrow_array_reader --features=test_common -- --save-baseline <name>

Result:

alamb@instance-1:/data/arrow-rs$ critcmp master1 simplify-record-reader1
group                                                                                  master1                                simplify-record-reader1
-----                                                                                  -------                                -----------------------
arrow_array_reader/read Int32Array, dictionary encoded, mandatory, no NULLs - new      1.00    109.2±0.33µs        ? ?/sec    1.00    109.0±0.31µs        ? ?/sec
arrow_array_reader/read Int32Array, dictionary encoded, mandatory, no NULLs - old      1.00     37.5±0.15µs        ? ?/sec    1.00     37.6±0.19µs        ? ?/sec
arrow_array_reader/read Int32Array, dictionary encoded, optional, half NULLs - new     1.02    279.6±0.59µs        ? ?/sec    1.00    275.0±1.73µs        ? ?/sec
arrow_array_reader/read Int32Array, dictionary encoded, optional, half NULLs - old     1.00    258.4±0.74µs        ? ?/sec    1.12    290.2±1.68µs        ? ?/sec
arrow_array_reader/read Int32Array, dictionary encoded, optional, no NULLs - new       1.01    132.6±0.39µs        ? ?/sec    1.00    130.9±0.66µs        ? ?/sec
arrow_array_reader/read Int32Array, dictionary encoded, optional, no NULLs - old       1.00    126.4±0.74µs        ? ?/sec    1.02    128.9±0.57µs        ? ?/sec
arrow_array_reader/read Int32Array, plain encoded, mandatory, no NULLs - new           1.00      3.6±0.18µs        ? ?/sec    1.03      3.7±0.23µs        ? ?/sec
arrow_array_reader/read Int32Array, plain encoded, mandatory, no NULLs - old           1.00      5.8±0.40µs        ? ?/sec    1.01      5.8±0.41µs        ? ?/sec
arrow_array_reader/read Int32Array, plain encoded, optional, half NULLs - new          1.03    225.6±0.93µs        ? ?/sec    1.00    219.8±1.14µs        ? ?/sec
arrow_array_reader/read Int32Array, plain encoded, optional, half NULLs - old          1.00    242.1±0.84µs        ? ?/sec    1.13    272.6±1.12µs        ? ?/sec
arrow_array_reader/read Int32Array, plain encoded, optional, no NULLs - new            1.05     26.8±0.32µs        ? ?/sec    1.00     25.4±0.32µs        ? ?/sec
arrow_array_reader/read Int32Array, plain encoded, optional, no NULLs - old            1.00     96.0±0.86µs        ? ?/sec    1.02     98.2±1.41µs        ? ?/sec
arrow_array_reader/read StringArray, dictionary encoded, mandatory, no NULLs - new     1.00    155.6±1.06µs        ? ?/sec    1.01    157.1±1.18µs        ? ?/sec
arrow_array_reader/read StringArray, dictionary encoded, mandatory, no NULLs - old     1.00   1201.5±3.71µs        ? ?/sec    1.00   1197.0±4.59µs        ? ?/sec
arrow_array_reader/read StringArray, dictionary encoded, optional, half NULLs - new    1.00    358.7±1.41µs        ? ?/sec    1.00    358.4±2.82µs        ? ?/sec
arrow_array_reader/read StringArray, dictionary encoded, optional, half NULLs - old    1.01   1086.9±3.57µs        ? ?/sec    1.00   1080.2±3.87µs        ? ?/sec
arrow_array_reader/read StringArray, dictionary encoded, optional, no NULLs - new      1.01    181.7±1.17µs        ? ?/sec    1.00    179.4±1.03µs        ? ?/sec
arrow_array_reader/read StringArray, dictionary encoded, optional, no NULLs - old      1.01   1273.7±7.95µs        ? ?/sec    1.00   1265.2±8.85µs        ? ?/sec
arrow_array_reader/read StringArray, plain encoded, mandatory, no NULLs - new          1.00    176.6±0.96µs        ? ?/sec    1.01    177.6±1.47µs        ? ?/sec
arrow_array_reader/read StringArray, plain encoded, mandatory, no NULLs - old          1.00   1377.9±7.25µs        ? ?/sec    1.02   1399.2±5.47µs        ? ?/sec
arrow_array_reader/read StringArray, plain encoded, optional, half NULLs - new         1.00    380.6±1.63µs        ? ?/sec    1.00    380.0±2.58µs        ? ?/sec
arrow_array_reader/read StringArray, plain encoded, optional, half NULLs - old         1.00   1179.4±4.56µs        ? ?/sec    1.00   1180.2±4.77µs        ? ?/sec
arrow_array_reader/read StringArray, plain encoded, optional, no NULLs - new           1.00    206.2±1.45µs        ? ?/sec    1.00    205.2±1.94µs        ? ?/sec
arrow_array_reader/read StringArray, plain encoded, optional, no NULLs - old           1.00  1452.5±17.75µs        ? ?/sec    1.00   1445.8±5.62µs        ? ?/sec

alamb · 2021-12-10T19:12:07Z

It looks like this PR needs some clippy appeasement: https://github.com/apache/arrow-rs/runs/4485244206?check_suite_focus=true

But otherwise looks good from my perspective

alamb

This looks like a nice simplification @tustvold 👍 I didn't see any discernable performance difference.

cc @nevi-me @andygrove @sunchao

sunchao · 2021-12-13T19:05:20Z

parquet/src/arrow/record_reader.rs

+            let (record_count, value_count) =
+                self.count_records(num_records - records_read);
+
+            self.num_records += record_count;


nit: maybe we can update this only once before returning from the method?

I think this would leave RecordReader in a strange state if read_one_batch returned an error, as self.num_values would have been updated and not self.num? I can't pull self.num_values out to match as it is used by count_records.

sunchao · 2021-12-13T19:30:08Z

parquet/src/arrow/record_reader.rs

+                let mut end_of_last_record = self.num_values;
+
+                for current in self.num_values..self.values_written {
+                    if buf[current] == 0 && current != end_of_last_record {


Hmm, what if you haven't finished the current repeated list, and it continues to the next batch? seems we'll return here and count as if the repeated list has been read completely (since we'll increment the records_read here)?

what if you haven't finished the current repeated list

I'm not sure I follow, buf[current] == 0 implies we've reached the end of the list. Perhaps it would be more clear if the second condition were current != self.num_values it's only false on the first iteration? 🤔

Ah sorry, my bad. yea this looks OK. I think the downside is we could potentially read a batch of repLevels multiple times if, say, the repLevels are all non-zero values.

It's also strange that we initialize the repLevels to be the min batch size but keep growing it as we read more batches, until it hit the total number of levels for the entire column chunk.

Users of RecordReader call read_records and then call consume_rep_levels and friends to split data out. The result being it should only buffer a little bit more than the batch_size passed to read_records.

I agree this API is not particularly intuitive, I created #1032 in part because I felt these APIs were clearly not designed for external consumption. I believe the funky arises because ArrayReader wants to be able to stitch together multiple column chunks from different row groups (i.e. PageReader) into the same RecordBatch.

Thanks for the context. Yea I think consume_rep_levels and the friends are for assembling complex records like array, list and map. It'd be nice if we can simplify the APIs.

tustvold · 2021-12-13T20:53:35Z

Further context for this PR can be found in #1041 as it was what motivated me to juggle the logic a bit, so that I could traitify it

sunchao · 2021-12-13T21:16:14Z

parquet/src/arrow/record_reader.rs

            }

-            if (records_read >= num_records) || end_of_column {
+            if end_of_column {
+                // Since page reader contains complete records, if we reached end of a


I'm not sure if this is true though. Take parquet-mr as example, this is true for the latest version but in versions before 1.11.0, it seems there is no such guarantee: https://github.com/apache/parquet-mr/blob/apache-parquet-1.10.1/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterV1.java#L106, and a repeated list could span multiple pages.

See comment below, page reader is a column chunk. So this is effectively saying that a record can't be split across row groups, which I think is guaranteed?

sunchao · 2021-12-13T21:17:41Z

parquet/src/arrow/record_reader.rs

            }

-            if (records_read >= num_records) || end_of_column {
+            if end_of_column {


I'm wondering if this should be called end_of_page since read_records consumes at most a page? a new page is set in ArrayReader.next_batch.

Ehehe, PageReader is actually a column chunk... So the end of a PageReader is the end of a row group, not the end of a page. Confusingly PageIterator is an iterator of PageReader which are themselves iterators of Page 😆

Ah got it, thanks 🤦 . It all makes sense now!

sunchao

LGTM

sunchao · 2021-12-13T21:44:54Z

Merged, thanks!

Simplify record reader

9930e50

github-actions bot added the parquet Changes to the parquet crate label Dec 10, 2021

Merge remote-tracking branch 'apache/master' into simplify-record-reader

290b24f

alamb reviewed Dec 10, 2021

View reviewed changes

alamb approved these changes Dec 10, 2021

View reviewed changes

alamb changed the title ~~Simplify record reader~~ Simplify parquet arror RecordReader Dec 10, 2021

Fix clippy lints

cd0f759

tustvold mentioned this pull request Dec 13, 2021

Generify ColumnReaderImpl and RecordReader (#1040) #1041

Merged

sunchao reviewed Dec 13, 2021

View reviewed changes

Tweak count_records predicate

a47bff5

sunchao reviewed Dec 13, 2021

View reviewed changes

sunchao approved these changes Dec 13, 2021

View reviewed changes

sunchao merged commit 07660c6 into apache:master Dec 13, 2021

tustvold mentioned this pull request Dec 17, 2021

Improve parquet reading performance for columns with nulls by preserving bitmask when possible (#1037) #1054

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify parquet arror `RecordReader` #1021

Simplify parquet arror `RecordReader` #1021

tustvold commented Dec 10, 2021 •

edited by alamb

Loading

codecov-commenter commented Dec 10, 2021 •

edited

Loading

alamb commented Dec 10, 2021

alamb commented Dec 10, 2021

alamb commented Dec 10, 2021

alamb left a comment

alamb Dec 10, 2021

alamb commented Dec 10, 2021

alamb commented Dec 10, 2021

alamb left a comment

sunchao Dec 13, 2021

tustvold Dec 13, 2021

sunchao Dec 13, 2021

tustvold Dec 13, 2021 •

edited

Loading

tustvold Dec 13, 2021

sunchao Dec 13, 2021

sunchao Dec 13, 2021

tustvold Dec 13, 2021 •

edited

Loading

sunchao Dec 13, 2021

tustvold commented Dec 13, 2021

sunchao Dec 13, 2021 •

edited

Loading

tustvold Dec 13, 2021

sunchao Dec 13, 2021

tustvold Dec 13, 2021

sunchao Dec 13, 2021

sunchao left a comment

sunchao commented Dec 13, 2021

Simplify parquet arror RecordReader #1021

Simplify parquet arror RecordReader #1021

Conversation

tustvold commented Dec 10, 2021 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

codecov-commenter commented Dec 10, 2021 • edited Loading

Codecov Report

alamb commented Dec 10, 2021

alamb commented Dec 10, 2021

alamb commented Dec 10, 2021

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Dec 10, 2021

alamb commented Dec 10, 2021

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Dec 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Dec 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Dec 13, 2021

sunchao Dec 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao left a comment

Choose a reason for hiding this comment

sunchao commented Dec 13, 2021

Simplify parquet arror `RecordReader` #1021

Simplify parquet arror `RecordReader` #1021

tustvold commented Dec 10, 2021 •

edited by alamb

Loading

codecov-commenter commented Dec 10, 2021 •

edited

Loading

tustvold Dec 13, 2021 •

edited

Loading

tustvold Dec 13, 2021 •

edited

Loading

sunchao Dec 13, 2021 •

edited

Loading