Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Nov 20, 2025

Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.

Rationale for this change

While testing the upgrade to arrow 57.1.0 in DataFusion

I noticed that the reported size of the memory used by metadata grew by almost 33%

Image

What changes are included in this PR?

Fix accounting for Arc, remove some other uneeded code

Are these changes tested?

Yes covered by existing tests (and changed)

Are there any user-facing changes?

More accurate memory accounting

@github-actions github-actions bot added the parquet Changes to the parquet crate label Nov 20, 2025
+ self.as_ref().heap_size()
// Do not count the size of the Arc as that is accounted for by the
// caller (the object that contains the Arc).
self.as_ref().heap_size()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change:

    let v = vec![0i32; 20];
    println!("filled vec heap size {}", v.heap_size());
    println!("size of vec {}", std::mem::size_of::<Vec<i32>>());

    let av = Arc::new(v);
    println!("arc<vec> heap size {}", av.heap_size());
filled vec heap size 80
size of vec 24
arc<vec> heap size 80

The Vec is now on the heap, but that is not accounted for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, sorry, I pushed the correct fix for that

impl HeapSize for FileDecryptor {
fn heap_size(&self) -> usize {
self.decryption_properties.heap_size()
+ (Arc::clone(&self.footer_decryptor) as Arc<dyn HeapSize>).heap_size()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive by cleanup -- this was added in #8671 and the new Arc::clone doesn't seem to be needed.

Copy link
Contributor

@adamreeve adamreeve Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was needed to be able to cast to Arc<dyn HeapSize> in order to use the impl HeapSize for Arc<dyn HeapSize>. Otherwise the other implementation (impl<T: HeapSize> HeapSize for Arc<T>) isn't used and the code with your change just derefs the contents of the Arc and calls its heap_size method, and the heap size of the Arc (ie. the size of its contents plus its reference counts) is silently skipped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what is supposed to happen is:

The size of Arc itself (e.g. the pointer and ref count) is accounted for in the size of the FileDecryptor struct. In this case, it is a field on ParquetMetadata and this is accounted for here:

std::mem::size_of::<Self>()

The size of the FooterDecryptor (what is pointed to by the Arc) is accounted for by the impl<T: HeapSize> HeapSize for Arc<T> { -- specifically the std::mem::size_of::<T>())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue is dyn HeapSize doesn't implement HeapSize, but even if it did the existing HeapSize impl for Arc wouldn't work because you can't use std::mem::size_of::<T>() on an unsized dyn object, and need to use std::mem::size_of_val.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I am still not sure about this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess what I am confused about is why we have a dyn Heapsize in the first place -- we know the exact type of the decryptor: https://github.com/apache/arrow-rs/blob/main/parquet/src/encryption/decrypt.rs#L563

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decryptor referenced here is actually an Arc<dyn BlockDecryptor>:

footer_decryptor: Arc<dyn BlockDecryptor>,

We do only have one implementation of that trait at the moment though, but the original plan was to also support an AES-GCM-CTR decryptor (#7258)


#[cfg(not(feature = "encryption"))]
let base_expected_size = 2766;
let base_expected_size = 2670;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is somewhat smaller, but still pretty big

+ self.as_ref().heap_size()
// Do not count the size of the Arc itself that is accounted for by the
// caller (the object that contains the Arc).
std::mem::size_of::<T>() + self.as_ref().heap_size()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strong and weak counts are stored on the heap so aren't included in std::mem::size_of::<T>().

+ self.as_ref().heap_size()
// Do not count the size of the Arc itself that is accounted for by the
// caller (the object that contains the Arc).
std::mem::size_of::<T>() + self.as_ref().heap_size()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from Box::new

  let x: Box<_> = Box::new(ArcInner {
      strong: atomic::AtomicUsize::new(1),
      weak: atomic::AtomicUsize::new(1),
      data,
  });

strong and weak are also on the heap, along with the T data.

size_of::<Arc<_>> returns 8, which is what I believe is accounted for in the size of the containers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @adamreeve and @etseidl for humoring my false alarm.

I re-reviewed the code from stdlib and I agree what is currently on main is an accurate representation

impl HeapSize for FileDecryptor {
fn heap_size(&self) -> usize {
self.decryption_properties.heap_size()
+ (Arc::clone(&self.footer_decryptor) as Arc<dyn HeapSize>).heap_size()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what is supposed to happen is:

The size of Arc itself (e.g. the pointer and ref count) is accounted for in the size of the FileDecryptor struct. In this case, it is a field on ParquetMetadata and this is accounted for here:

std::mem::size_of::<Self>()

The size of the FooterDecryptor (what is pointed to by the Arc) is accounted for by the impl<T: HeapSize> HeapSize for Arc<T> { -- specifically the std::mem::size_of::<T>())

impl HeapSize for FileDecryptor {
fn heap_size(&self) -> usize {
self.decryption_properties.heap_size()
+ (Arc::clone(&self.footer_decryptor) as Arc<dyn HeapSize>).heap_size()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I am still not sure about this

+ self.as_ref().heap_size()
// Do not count the size of the Arc itself that is accounted for by the
// caller (the object that contains the Arc).
std::mem::size_of::<T>() + self.as_ref().heap_size()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reviewing the code, I agree with the original codes assement of this structure's heap usage

The atomic counts are actually stored on the heap as well
https://doc.rust-lang.org/src/alloc/sync.rs.html#265
https://doc.rust-lang.org/src/alloc/sync.rs.html#371-378

@adamreeve
Copy link
Contributor

Thank you @adamreeve and @etseidl for humoring my false alarm.

I re-reviewed the code from stdlib and I agree what is currently on main is an accurate representation

No problem 😄. It would be nice to have an automated way to verify this and catch if any changes cause the memory usage to become inaccurate. I did manual testing with heaptrack to verify my changes but I'm not sure how easy it would be to automate that process.

@alamb
Copy link
Contributor Author

alamb commented Nov 21, 2025

No problem 😄. It would be nice to have an automated way to verify this and catch if any changes cause the memory usage to become inaccurate. I did manual testing with heaptrack to verify my changes but I'm not sure how easy it would be to automate that process.

I 100 agree with this. We maybe could shim the system allocator or something and count allocations 🤔 But as you say that is likely to be pretty brittle

@adamreeve
Copy link
Contributor

I've opened a draft PR that overrides the global allocator to track allocations and this seems to be working well (#8925). I don't have a lot of time at the moment to keep working on this but will try to add some more test cases including encrypted files soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet metadata size accounting overcounts Arc<T>

3 participants