Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API to get memory usage for parquet ArrowWriter #5851

Closed
alamb opened this issue Jun 7, 2024 · 9 comments · Fixed by #5967
Closed

API to get memory usage for parquet ArrowWriter #5851

alamb opened this issue Jun 7, 2024 · 9 comments · Fixed by #5967
Assignees
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented Jun 7, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When writing parquet files, depending on the writer settings and the data being written, we have observed the ArrowWriter consuming large amounts of memory (10s of GB) -- see #5828

The memory usage of parquet writers also often comes up in the context of proposals for new parquet formats

There is already a discussion about how to limit memory when writing here https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowWriter.html#memory-limiting

However there is now way currently to get a measurement of actual current use (that we could use to abort the write, for example).

Describe the solution you'd like

I would like some way to get to have some visibility on the current memory usage of the internal buffering in the parquet writer

Describe alternatives you've considered
I propose adding a function to ArrowWriter modeled on Array::get_array_memory_size

impl ArrayWriter {
  /// returns an estimate of how much memory the array
  /// writer is currently using in its internal buffers. 
  fn memory_size(&self) -> usize { ... }
...
}

Additional context
Here is one ticket that describes one non trivial source of memory usage #5828 so the indices should be included.

@alamb alamb added parquet Changes to the parquet crate enhancement Any new improvement worthy of a entry in the changelog labels Jun 7, 2024
@alamb alamb changed the title API to get memory usage for Arr API to get memory usage for parquet ArrowWriter Jun 7, 2024
@alamb
Copy link
Contributor Author

alamb commented Jun 7, 2024

BTE there is already https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowWriter.html#method.in_progress_size but that only accounts for the actual parquet data in progress, not any internal buffering structures in the writer itself

@Rachelint
Copy link
Contributor

can I take it?

@alamb
Copy link
Contributor Author

alamb commented Jun 23, 2024

That would be amazing -- thank you @Rachelint

@alamb
Copy link
Contributor Author

alamb commented Jun 24, 2024

I'll be on the lookout for a PR -- please ping me when you are ready for feedback

@Rachelint
Copy link
Contributor

I'll be on the lookout for a PR -- please ping me when you are ready for feedback

ok!

@alamb
Copy link
Contributor Author

alamb commented Jun 28, 2024

@wiedld made a PR for this feature: #5967 as well

@Rachelint
Copy link
Contributor

@wiedld made a PR for this feature: #5967 as well

ok, planned to code it in weekend, so still no codes

@alamb
Copy link
Contributor Author

alamb commented Jun 29, 2024

@wiedld made a PR for this feature: #5967 as well

ok, planned to code it in weekend, so still no codes

Thanks @Rachelint -- maybe you could help review #5967 if you are still interested

@wiedld
Copy link
Contributor

wiedld commented Jul 1, 2024

Draft PR is up and undergoing code review. Please assign to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants