Expose `ArrowWriter` row group flush in public API #1626

Cheappie · 2022-04-28T12:17:40Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
From what I have read predicate pushdown filtering in parquet works on row-group level, so in my case I should be able to optimize reads by manually closing row-group.

Describe the solution you'd like
Simply expose in ArrowWriter API flush_row_group method that flushes all buffered rows.

Describe alternatives you've considered
I have considered using SerializedFileWriter, however due to my lack of complete understanding of definition, repetition levels I would prefer to use high level API like ArrowWriter.

tustvold · 2022-04-29T22:13:52Z

I don't see any issue with exposing this, more power to the user, however, some thoughts:

I wonder if you could just set the max row group size smaller if you want greater row group granularity
For compressible data, more row groups will likely lead to larger files, which might actually be slower to read
Similar to the above, the reader is designed to amortise per-row group costs over many rows. This works less well with smaller row groups
It is possible to prune at a more granular level, it just hasn't been implemented yet - Parquet Scan Filter #1191

Cheappie · 2022-04-30T12:55:58Z

Thank you for sharing your thoughts with me, I will keep them in mind.

In my case I have similar data structure to row-group and within such row-group I keep related data. In access path by manually sizing row-groups I would be able to grab exactly data that I am interested in. Maybe I would be able to achieve similar outcome with PageIndex, but it's not ready yet and I am a bit short on time.

Could you tell me more about what are the per row-group costs ? I recall statistics that make pruning possible and for sure there is some metadata.

Cheappie · 2022-05-01T06:27:44Z

Hi @tustvold, I wonder whether we need test case for new function or it can go without any because it has single line body that delegates all logic to the other functions ?

Cheappie added the enhancement Any new improvement worthy of a entry in the changelog label Apr 28, 2022

Cheappie mentioned this issue May 1, 2022

expose row-group flush in public api #1634

Merged

tustvold closed this as completed in #1634 May 2, 2022

alamb changed the title ~~Expose ArrowWriter row group flush in public API~~ Expose ArrowWriter row group flush in public API May 12, 2022

alamb added the parquet Changes to the parquet crate label May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose `ArrowWriter` row group flush in public API #1626

Expose `ArrowWriter` row group flush in public API #1626

Cheappie commented Apr 28, 2022

tustvold commented Apr 29, 2022

Cheappie commented Apr 30, 2022 •

edited

Loading

Cheappie commented May 1, 2022

Expose ArrowWriter row group flush in public API #1626

Expose ArrowWriter row group flush in public API #1626

Comments

Cheappie commented Apr 28, 2022

tustvold commented Apr 29, 2022

Cheappie commented Apr 30, 2022 • edited Loading

Cheappie commented May 1, 2022

Expose `ArrowWriter` row group flush in public API #1626

Expose `ArrowWriter` row group flush in public API #1626

Cheappie commented Apr 30, 2022 •

edited

Loading