-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API to get memory usage for parquet ArrowWriter #5851
Comments
BTE there is already https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowWriter.html#method.in_progress_size but that only accounts for the actual parquet data in progress, not any internal buffering structures in the writer itself |
can I take it? |
That would be amazing -- thank you @Rachelint |
I'll be on the lookout for a PR -- please ping me when you are ready for feedback |
ok! |
Thanks @Rachelint -- maybe you could help review #5967 if you are still interested |
Draft PR is up and undergoing code review. Please assign to me. |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When writing parquet files, depending on the writer settings and the data being written, we have observed the ArrowWriter consuming large amounts of memory (10s of GB) -- see #5828
The memory usage of parquet writers also often comes up in the context of proposals for new parquet formats
There is already a discussion about how to limit memory when writing here https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowWriter.html#memory-limiting
However there is now way currently to get a measurement of actual current use (that we could use to abort the write, for example).
Describe the solution you'd like
I would like some way to get to have some visibility on the current memory usage of the internal buffering in the parquet writer
Describe alternatives you've considered
I propose adding a function to ArrowWriter modeled on Array::get_array_memory_size
Additional context
Here is one ticket that describes one non trivial source of memory usage #5828 so the indices should be included.
The text was updated successfully, but these errors were encountered: