-
Couldn't load subscription status.
- Fork 1k
[Parquet] Account for FileDecryptor in ParquetMetaData heap size calculation #8671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| // The retriever is a user-defined type we don't control, | ||
| // so we can't determine the heap size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in #8472, we could potentially add a new trait method to allow a key retriever to provide a heap size later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for running this to ground @adamreeve! I think we can punt on the retriever for now. We just need to decide what to do with hash map. 🤔
I've updated this to be more accurate and tried to match the actual hashmap implementation more closely without replicating all the details exactly. Eg. it doesn't account for some alignment calculations and the group size is architecture dependent so this might be an overestimate. The calculation of the number of buckets could maybe be simplified further, but I felt like small hash maps would be quite common so I didn't want to overestimate this too much. This does feel a bit too complex, but then changing the memory characteristics of the standard HashMap type seems like something that shouldn't happen often so maybe this is OK... |
|
Changing this back to draft as I realised the handling of The implementation of |
|
Hmm, this opens quite the can of worms. Now I'm looking at And what about |
|
The nested heap allocations within the T held by the Arc are already double counted, and this behaviour is documented here: arrow-rs/parquet/src/file/metadata/mod.rs Lines 284 to 286 in 0c8ab49
So I think it's at least consistent that the item held directly in the Arc should also be counted twice. But yeah this could possibly be a bit smarter. Maybe this could all be refactored to track which items have been accounted for with pointer equality so things aren't counted twice? But that would be more complicated and require more time and memory to compute the heap size.
No, this works correctly and |
I was going to comment that the arrow-rs/parquet/src/file/metadata/memory.rs Lines 101 to 103 in d519bb8
arrow-rs/parquet/src/file/metadata/memory.rs Lines 115 to 116 in d519bb8
So applying a similar solution to prevent duplicate accounting of the |
|
After that latest change to not count the type size within |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @adamreeve! This is a big (and thorough) improvement.
| let bigger_expected_size = 2674; | ||
| let bigger_expected_size = 3192; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So sad to see this increase so much. Truth hurts 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can start with being truthful and then move on to being slimmer
| // Don't include the heap size of primitive_type, this is already | ||
| // accounted for via SchemaDescriptor::schema | ||
| self.path.heap_size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @adamreeve and @etseidl -- this looks great to me too
| let bigger_expected_size = 2674; | ||
| let bigger_expected_size = 3192; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can start with being truthful and then move on to being slimmer
Which issue does this PR close?
encryptionis enabled #8472.Rationale for this change
Makes the metadata heap size calculation more accurate when reading encrypted Parquet files, which helps to better manage caches of Parquet metadata.
What changes are included in this PR?
FileDecryptorinParquetMetaDataHeapSizeimplementation forArc<T>so the size ofTis included, as well as the reference counts that are stored on the heapColumnDescriptorbeing included twiceNot included
KeyRetrieverAre these changes tested?
Yes, there's a new unit test added that computes the heap size with a decryptor.
I also did a manual test that created a test Parquet file with 100 columns using per-column encryption keys, and loaded 10,000 copies of the
ParquetMetaDatainto a vector.heaptrackreported 1.136 GB memory heap allocated in this test program. Prior to this change, the sum of the metadata was reported as 879.2 MB, and afterwards it was 1.136 GB.Are there any user-facing changes?
No
This was co-authored by @etseidl