-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Fix Parquet metadata heap size accounting #8898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
parquet/src/file/metadata/memory.rs
Outdated
| + self.as_ref().heap_size() | ||
| // Do not count the size of the Arc as that is accounted for by the | ||
| // caller (the object that contains the Arc). | ||
| self.as_ref().heap_size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change:
let v = vec![0i32; 20];
println!("filled vec heap size {}", v.heap_size());
println!("size of vec {}", std::mem::size_of::<Vec<i32>>());
let av = Arc::new(v);
println!("arc<vec> heap size {}", av.heap_size());filled vec heap size 80
size of vec 24
arc<vec> heap size 80
The Vec is now on the heap, but that is not accounted for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, sorry, I pushed the correct fix for that
| impl HeapSize for FileDecryptor { | ||
| fn heap_size(&self) -> usize { | ||
| self.decryption_properties.heap_size() | ||
| + (Arc::clone(&self.footer_decryptor) as Arc<dyn HeapSize>).heap_size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drive by cleanup -- this was added in #8671 and the new Arc::clone doesn't seem to be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was needed to be able to cast to Arc<dyn HeapSize> in order to use the impl HeapSize for Arc<dyn HeapSize>. Otherwise the other implementation (impl<T: HeapSize> HeapSize for Arc<T>) isn't used and the code with your change just derefs the contents of the Arc and calls its heap_size method, and the heap size of the Arc (ie. the size of its contents plus its reference counts) is silently skipped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what is supposed to happen is:
The size of Arc itself (e.g. the pointer and ref count) is accounted for in the size of the FileDecryptor struct. In this case, it is a field on ParquetMetadata and this is accounted for here:
arrow-rs/parquet/src/file/metadata/mod.rs
Line 296 in 123e6cd
| std::mem::size_of::<Self>() |
The size of the FooterDecryptor (what is pointed to by the Arc) is accounted for by the impl<T: HeapSize> HeapSize for Arc<T> { -- specifically the std::mem::size_of::<T>())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue is dyn HeapSize doesn't implement HeapSize, but even if it did the existing HeapSize impl for Arc wouldn't work because you can't use std::mem::size_of::<T>() on an unsized dyn object, and need to use std::mem::size_of_val.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I am still not sure about this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess what I am confused about is why we have a dyn Heapsize in the first place -- we know the exact type of the decryptor: https://github.com/apache/arrow-rs/blob/main/parquet/src/encryption/decrypt.rs#L563
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decryptor referenced here is actually an Arc<dyn BlockDecryptor>:
arrow-rs/parquet/src/encryption/decrypt.rs
Line 565 in ed9efe7
| footer_decryptor: Arc<dyn BlockDecryptor>, |
We do only have one implementation of that trait at the moment though, but the original plan was to also support an AES-GCM-CTR decryptor (#7258)
|
|
||
| #[cfg(not(feature = "encryption"))] | ||
| let base_expected_size = 2766; | ||
| let base_expected_size = 2670; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is somewhat smaller, but still pretty big
| + self.as_ref().heap_size() | ||
| // Do not count the size of the Arc itself that is accounted for by the | ||
| // caller (the object that contains the Arc). | ||
| std::mem::size_of::<T>() + self.as_ref().heap_size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The strong and weak counts are stored on the heap so aren't included in std::mem::size_of::<T>().
| + self.as_ref().heap_size() | ||
| // Do not count the size of the Arc itself that is accounted for by the | ||
| // caller (the object that contains the Arc). | ||
| std::mem::size_of::<T>() + self.as_ref().heap_size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from Box::new
let x: Box<_> = Box::new(ArcInner {
strong: atomic::AtomicUsize::new(1),
weak: atomic::AtomicUsize::new(1),
data,
});strong and weak are also on the heap, along with the T data.
size_of::<Arc<_>> returns 8, which is what I believe is accounted for in the size of the containers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @adamreeve and @etseidl for humoring my false alarm.
I re-reviewed the code from stdlib and I agree what is currently on main is an accurate representation
| impl HeapSize for FileDecryptor { | ||
| fn heap_size(&self) -> usize { | ||
| self.decryption_properties.heap_size() | ||
| + (Arc::clone(&self.footer_decryptor) as Arc<dyn HeapSize>).heap_size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what is supposed to happen is:
The size of Arc itself (e.g. the pointer and ref count) is accounted for in the size of the FileDecryptor struct. In this case, it is a field on ParquetMetadata and this is accounted for here:
arrow-rs/parquet/src/file/metadata/mod.rs
Line 296 in 123e6cd
| std::mem::size_of::<Self>() |
The size of the FooterDecryptor (what is pointed to by the Arc) is accounted for by the impl<T: HeapSize> HeapSize for Arc<T> { -- specifically the std::mem::size_of::<T>())
| impl HeapSize for FileDecryptor { | ||
| fn heap_size(&self) -> usize { | ||
| self.decryption_properties.heap_size() | ||
| + (Arc::clone(&self.footer_decryptor) as Arc<dyn HeapSize>).heap_size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I am still not sure about this
| + self.as_ref().heap_size() | ||
| // Do not count the size of the Arc itself that is accounted for by the | ||
| // caller (the object that contains the Arc). | ||
| std::mem::size_of::<T>() + self.as_ref().heap_size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reviewing the code, I agree with the original codes assement of this structure's heap usage
The atomic counts are actually stored on the heap as well
https://doc.rust-lang.org/src/alloc/sync.rs.html#265
https://doc.rust-lang.org/src/alloc/sync.rs.html#371-378
No problem 😄. It would be nice to have an automated way to verify this and catch if any changes cause the memory usage to become inaccurate. I did manual testing with |
I 100 agree with this. We maybe could shim the system allocator or something and count allocations 🤔 But as you say that is likely to be pretty brittle |
|
I've opened a draft PR that overrides the global allocator to track allocations and this seems to be working well (#8925). I don't have a lot of time at the moment to keep working on this but will try to add some more test cases including encrypted files soon. |
Which issue does this PR close?
We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.
Arc<T>#8897Rationale for this change
While testing the upgrade to arrow 57.1.0 in DataFusion
arrow,parquetto57.1.0datafusion#18820I noticed that the reported size of the memory used by metadata grew by almost 33%
What changes are included in this PR?
Fix accounting for Arc, remove some other uneeded code
Are these changes tested?
Yes covered by existing tests (and changed)
Are there any user-facing changes?
More accurate memory accounting