Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose ArrowWriter row group flush in public API #1626

Closed
Cheappie opened this issue Apr 28, 2022 · 3 comments · Fixed by #1634
Closed

Expose ArrowWriter row group flush in public API #1626

Cheappie opened this issue Apr 28, 2022 · 3 comments · Fixed by #1634
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@Cheappie
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
From what I have read predicate pushdown filtering in parquet works on row-group level, so in my case I should be able to optimize reads by manually closing row-group.

Describe the solution you'd like
Simply expose in ArrowWriter API flush_row_group method that flushes all buffered rows.

Describe alternatives you've considered
I have considered using SerializedFileWriter, however due to my lack of complete understanding of definition, repetition levels I would prefer to use high level API like ArrowWriter.

@Cheappie Cheappie added the enhancement Any new improvement worthy of a entry in the changelog label Apr 28, 2022
@tustvold
Copy link
Contributor

I don't see any issue with exposing this, more power to the user, however, some thoughts:

  • I wonder if you could just set the max row group size smaller if you want greater row group granularity
  • For compressible data, more row groups will likely lead to larger files, which might actually be slower to read
  • Similar to the above, the reader is designed to amortise per-row group costs over many rows. This works less well with smaller row groups
  • It is possible to prune at a more granular level, it just hasn't been implemented yet - Parquet Scan Filter #1191

@Cheappie
Copy link
Contributor Author

Cheappie commented Apr 30, 2022

Thank you for sharing your thoughts with me, I will keep them in mind.

In my case I have similar data structure to row-group and within such row-group I keep related data. In access path by manually sizing row-groups I would be able to grab exactly data that I am interested in. Maybe I would be able to achieve similar outcome with PageIndex, but it's not ready yet and I am a bit short on time.

Could you tell me more about what are the per row-group costs ? I recall statistics that make pruning possible and for sure there is some metadata.

@Cheappie
Copy link
Contributor Author

Cheappie commented May 1, 2022

Hi @tustvold, I wonder whether we need test case for new function or it can go without any because it has single line body that delegates all logic to the other functions ?

@alamb alamb changed the title Expose ArrowWriter row group flush in public API Expose ArrowWriter row group flush in public API May 12, 2022
@alamb alamb added the parquet Changes to the parquet crate label May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants