-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove internal buffering from AsyncArrowWriter (#5484) #5485
Remove internal buffering from AsyncArrowWriter (#5484) #5485
Conversation
pub fn try_new_with_options( | ||
writer: W, | ||
arrow_schema: SchemaRef, | ||
buffer_size: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I debated keeping this around as a capacity
argument, but decided this was likely a premature optimisation. We can always add a with_capacity
function down the line if necessary
As an aside my hope is that #5458 will remove the need for using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR doesn't make sense to me -- it seems like it changes the writer so that it does an object store write after every row group. That seems like it could result in a significant regression in performance compared to buffering up multiple row groups.
Maybe we can just update the docs to make it clear that buffer_size
is not a cap, but simply a minimum before the data is flushed. Though it seems already pretty clear
@@ -168,14 +143,18 @@ impl<W: AsyncWrite + Unpin + Send> AsyncArrowWriter<W> { | |||
/// After every sync write by the inner [ArrowWriter], the inner buffer will be | |||
/// checked and flush if at least half full |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment no longer seems correct
The object store writer has it's own buffering, buffering additionally on top will only make things slower. A similar story holds for other AsyncWrite e.g tokio File, we should flush data as soon as we can and let them make a judgement over how best to perform the actual IO I also intend to remove the use of AsyncWrite from object_store, it's a problematic interface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The object store writer has it's own buffering, buffering additionally on top will only make things slower. A similar story holds for other AsyncWrite e.g tokio File, we should flush data as soon as we can and let them make a judgement over how best to perform the actual IO
I think I was confused about why we are talking about object_store
here. This API isn't in terms of an object_store
, it is in terms of AsyncWrite
https://docs.rs/tokio/latest/tokio/io/trait.AsyncWrite.html
Leaving buffering to the underlying writer makes a lot of sense to me and follows the other rust IO conventions
Maybe we can add a note to the documentation explaining this rationale -- I left a suggestion on how to reword the description
Thank you @tustvold
/// The columnar nature of parquet forces buffering data for an entire row group, as such | ||
/// [`AsyncArrowWriter`] uses [`ArrowWriter`] to encode each row group in memory, before | ||
/// flushing it to the provided [`AsyncWrite`]. Memory usage can be limited by prematurely | ||
/// flushing the row group, although this will have implications for file size and query | ||
/// performance. See [ArrowWriter] for more information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would help to focus this comment on the rationale and implications for users, rather than implementation.
Something like:
/// The columnar nature of parquet forces buffering data for an entire row group, as such | |
/// [`AsyncArrowWriter`] uses [`ArrowWriter`] to encode each row group in memory, before | |
/// flushing it to the provided [`AsyncWrite`]. Memory usage can be limited by prematurely | |
/// flushing the row group, although this will have implications for file size and query | |
/// performance. See [ArrowWriter] for more information. | |
/// Similar to the standard Rust I/O API such as `std::fs::File` this writer eagerly writes | |
/// data to the underlying `AsyncWrite` as soon as possible. This permits fine grained control | |
/// over buffering and I/O scheduling. | |
/// | |
/// Note that the columnar nature of parquet forces buffering an entire row group, | |
/// before flushing it to the provided [`AsyncWrite`]. Depending on the data and the configured | |
/// row group size, the buffer required may be substantial. Memory usage can be limited by | |
/// calling [`Self::flush`] to flushing the in progress row group, although this will likely | |
/// increase overall file size and reduce query performance. See [ArrowWriter] for more information. | |
/// |
The argument was removed. See: apache/arrow-rs#5485
The argument was removed. See: apache/arrow-rs#5485
…apache#5485)" This reverts commit 19a3bb0.
The argument was removed. See: apache/arrow-rs#5485
The argument was removed. See: apache/arrow-rs#5485
Which issue does this PR close?
Closes #5484
Rationale for this change
Having a separate buffer size argument is confusing, especially when it doesn't actually limit memory consumption in practice, which is instead bounded by the row group size.
What changes are included in this PR?
Are there any user-facing changes?
Yes