Fixed using lower limit than size of first parquet row group #1046

arxra · 2022-06-03T15:06:25Z

If the first row-group of parquet data had more rows than the limit, no data would be returned even if the chunk size was available in that limit and first row group

jorgecarleitao

Thanks a lot for the PR!

I agree that there is an issue here. I left a comment as I think we should avoid advancing the iterator on try_new. Is there any way around that?

jorgecarleitao · 2022-06-03T16:22:17Z

src/io/parquet/read/file.rs

            reader,
            schema,
            groups_filter,
            metadata.row_groups.clone(),
            chunk_size,
            limit,
        );
+        let current_row_group = row_groups.next().transpose()?;


I think we should consider something different here - this causes try_new to be O(N) since it advances the iterator.

codecov · 2022-06-03T16:30:38Z

Codecov Report

Merging #1046 (93fda32) into main (06f8f36) will decrease coverage by 0.06%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1046      +/-   ##
==========================================
- Coverage   81.36%   81.29%   -0.07%     
==========================================
  Files         360      363       +3     
  Lines       34386    34651     +265     
==========================================
+ Hits        27978    28170     +192     
- Misses       6408     6481      +73

Impacted Files	Coverage Δ
src/io/parquet/read/file.rs	`84.27% <100.00%> (+0.20%)`	⬆️
src/io/parquet/read/row_group.rs	`99.40% <100.00%> (+<0.01%)`	⬆️
src/io/json_integration/mod.rs	`72.72% <0.00%> (-27.28%)`	⬇️
src/io/json/read/deserialize.rs	`72.62% <0.00%> (-6.67%)`	⬇️
src/array/binary/iterator.rs	`80.95% <0.00%> (-5.72%)`	⬇️
src/array/equal/mod.rs	`83.20% <0.00%> (-4.00%)`	⬇️
src/io/parquet/read/deserialize/boolean/nested.rs	`67.05% <0.00%> (-0.80%)`	⬇️
src/io/parquet/read/schema/convert.rs	`94.21% <0.00%> (-0.74%)`	⬇️
src/array/utf8/mod.rs	`80.12% <0.00%> (-0.19%)`	⬇️
src/ffi/schema.rs	`91.85% <0.00%> (-0.02%)`	⬇️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 06f8f36...93fda32. Read the comment docs.

…ow groups

arxra · 2022-06-04T16:53:18Z

@jorgecarleitao This alternative instead looks at the current_row_group, which if None should not update the amount of remaining rows as no rows will be read on that iteration! This would still work if given a empty row group, but most importantly its None on initialization for reading the first one.

jorgecarleitao

Great solution to this! Thanks a lot again.

(minor fmt error - let me know if you would like me to fix it)

arxra · 2022-06-05T06:05:07Z

Fixed :)

Fixed using lower limit than size of first parquet row group

6398333

jorgecarleitao added the bug Something isn't working label Jun 3, 2022

jorgecarleitao reviewed Jun 3, 2022

View reviewed changes

Do not initiallize the iterator on creation, instead handle None on r…

753a4ce

…ow groups

jorgecarleitao approved these changes Jun 4, 2022

View reviewed changes

fmt

93fda32

jorgecarleitao merged commit 745c199 into jorgecarleitao:main Jun 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed using lower limit than size of first parquet row group #1046

Fixed using lower limit than size of first parquet row group #1046

arxra commented Jun 3, 2022

jorgecarleitao left a comment

jorgecarleitao Jun 3, 2022

codecov bot commented Jun 3, 2022 •

edited

Loading

arxra commented Jun 4, 2022

jorgecarleitao left a comment

arxra commented Jun 5, 2022

Fixed using lower limit than size of first parquet row group #1046

Fixed using lower limit than size of first parquet row group #1046

Conversation

arxra commented Jun 3, 2022

jorgecarleitao left a comment

Choose a reason for hiding this comment

jorgecarleitao Jun 3, 2022

Choose a reason for hiding this comment

codecov bot commented Jun 3, 2022 • edited Loading

Codecov Report

arxra commented Jun 4, 2022

jorgecarleitao left a comment

Choose a reason for hiding this comment

arxra commented Jun 5, 2022

codecov bot commented Jun 3, 2022 •

edited

Loading