Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Fixed using lower limit than size of first parquet row group #1046

Merged
merged 3 commits into from
Jun 5, 2022

Conversation

arxra
Copy link
Contributor

@arxra arxra commented Jun 3, 2022

If the first row-group of parquet data had more rows than the limit, no data would be returned even if the chunk size was available in that limit and first row group

@jorgecarleitao jorgecarleitao added the bug Something isn't working label Jun 3, 2022
Copy link
Owner

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR!

I agree that there is an issue here. I left a comment as I think we should avoid advancing the iterator on try_new. Is there any way around that?

reader,
schema,
groups_filter,
metadata.row_groups.clone(),
chunk_size,
limit,
);
let current_row_group = row_groups.next().transpose()?;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider something different here - this causes try_new to be O(N) since it advances the iterator.

@codecov
Copy link

codecov bot commented Jun 3, 2022

Codecov Report

Merging #1046 (93fda32) into main (06f8f36) will decrease coverage by 0.06%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1046      +/-   ##
==========================================
- Coverage   81.36%   81.29%   -0.07%     
==========================================
  Files         360      363       +3     
  Lines       34386    34651     +265     
==========================================
+ Hits        27978    28170     +192     
- Misses       6408     6481      +73     
Impacted Files Coverage Δ
src/io/parquet/read/file.rs 84.27% <100.00%> (+0.20%) ⬆️
src/io/parquet/read/row_group.rs 99.40% <100.00%> (+<0.01%) ⬆️
src/io/json_integration/mod.rs 72.72% <0.00%> (-27.28%) ⬇️
src/io/json/read/deserialize.rs 72.62% <0.00%> (-6.67%) ⬇️
src/array/binary/iterator.rs 80.95% <0.00%> (-5.72%) ⬇️
src/array/equal/mod.rs 83.20% <0.00%> (-4.00%) ⬇️
src/io/parquet/read/deserialize/boolean/nested.rs 67.05% <0.00%> (-0.80%) ⬇️
src/io/parquet/read/schema/convert.rs 94.21% <0.00%> (-0.74%) ⬇️
src/array/utf8/mod.rs 80.12% <0.00%> (-0.19%) ⬇️
src/ffi/schema.rs 91.85% <0.00%> (-0.02%) ⬇️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 06f8f36...93fda32. Read the comment docs.

@arxra
Copy link
Contributor Author

arxra commented Jun 4, 2022

@jorgecarleitao This alternative instead looks at the current_row_group, which if None should not update the amount of remaining rows as no rows will be read on that iteration! This would still work if given a empty row group, but most importantly its None on initialization for reading the first one.

Copy link
Owner

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great solution to this! Thanks a lot again.

(minor fmt error - let me know if you would like me to fix it)

@arxra
Copy link
Contributor Author

arxra commented Jun 5, 2022

Fixed :)

@jorgecarleitao jorgecarleitao merged commit 745c199 into jorgecarleitao:main Jun 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants