[Parquet] Account for FileDecryptor in ParquetMetaData heap size calculation #8671

adamreeve · 2025-10-21T02:30:23Z

Which issue does this PR close?

Closes ParquetMetaData memory size is not reported accurately when encryption is enabled #8472.

Rationale for this change

Makes the metadata heap size calculation more accurate when reading encrypted Parquet files, which helps to better manage caches of Parquet metadata.

What changes are included in this PR?

Accounts for heap allocations related to the FileDecryptor in ParquetMetaData
Fixes the HeapSize implementation for Arc<T> so the size of T is included, as well as the reference counts that are stored on the heap
Fixes the heap size of type pointers within ColumnDescriptor being included twice

Not included

Accounting for any heap allocations in a user-provided KeyRetriever

Are these changes tested?

Yes, there's a new unit test added that computes the heap size with a decryptor.

I also did a manual test that created a test Parquet file with 100 columns using per-column encryption keys, and loaded 10,000 copies of the ParquetMetaData into a vector. heaptrack reported 1.136 GB memory heap allocated in this test program. Prior to this change, the sum of the metadata was reported as 879.2 MB, and afterwards it was 1.136 GB.

Are there any user-facing changes?

No

This was co-authored by @etseidl

parquet/src/file/metadata/memory.rs

parquet/src/encryption/ciphers.rs

adamreeve · 2025-10-21T02:45:50Z

parquet/src/encryption/decrypt.rs

+                // The retriever is a user-defined type we don't control,
+                // so we can't determine the heap size.


As discussed in #8472, we could potentially add a new trait method to allow a key retriever to provide a heap size later.

etseidl

Thanks for running this to ground @adamreeve! I think we can punt on the retriever for now. We just need to decide what to do with hash map. 🤔

parquet/src/encryption/decrypt.rs

parquet/src/file/metadata/memory.rs

adamreeve · 2025-10-22T02:21:59Z

We just need to decide what to do with hash map

I've updated this to be more accurate and tried to match the actual hashmap implementation more closely without replicating all the details exactly. Eg. it doesn't account for some alignment calculations and the group size is architecture dependent so this might be an overestimate.

The calculation of the number of buckets could maybe be simplified further, but I felt like small hash maps would be quite common so I didn't want to overestimate this too much.

This does feel a bit too complex, but then changing the memory characteristics of the standard HashMap type seems like something that shouldn't happen often so maybe this is OK...

adamreeve · 2025-10-22T06:38:16Z

Changing this back to draft as I realised the handling of FileDecryptor::footer_decryptor isn't correct and I'm not sure yet exactly how to handle this.

The implementation of HeapSize for Arc<T> looks wrong, this should match the implementation for Box where the size of the contained item is included. But even if that's fixed, the Arc impl isn't used for an Arc<dyn BlockDecryptor>. Instead the Arc is dereferenced and only the HeapSize implementation of the contained type is used.

etseidl · 2025-10-22T13:58:54Z

Hmm, this opens quite the can of worms. Now I'm looking at HeapSize for the schema, and we may be overcounting there. SchemaDescriptor is already counting the heap for the tree of Type pointers, but then each ColumnDescriptor is also counting the same objects. Perhaps the impl for ColumnDescriptor should be more like self.path.heap_size() + 2 * std::mem::size_of::<usize>() 🤷

And what about Vec<Arc<T>>? Does sizeof for Arc include the pointers and ref counts as well?

adamreeve · 2025-10-23T22:03:59Z

The nested heap allocations within the T held by the Arc are already double counted, and this behaviour is documented here:

arrow-rs/parquet/src/file/metadata/mod.rs

Lines 284 to 286 in 0c8ab49

    
               /// 3. Includes memory from shared pointers (e.g. [`SchemaDescPtr`]). This 
        
               ///    means `memory_size` will over estimate the memory size if such pointers 
        
               ///    are shared.

So I think it's at least consistent that the item held directly in the Arc should also be counted twice. But yeah this could possibly be a bit smarter. Maybe this could all be refactored to track which items have been accounted for with pointer equality so things aren't counted twice? But that would be more complicated and require more time and memory to compute the heap size.

And what about Vec<Arc>? Does sizeof for Arc include the pointers and ref counts as well?

No, this works correctly and size_of::<Arc<T>>() only includes the size of one pointer to the heap allocated memory. The ref counts are only accounted for once in the HeapSize impl for Arc<T>.

adamreeve · 2025-10-24T01:40:40Z

SchemaDescriptor is already counting the heap for the tree of Type pointers, but then each ColumnDescriptor is also counting the same objects. Perhaps the impl for ColumnDescriptor should be more like self.path.heap_size() + 2 * std::mem::size_of::<usize>() 🤷

I was going to comment that the ColumnDescriptor's themselves are also referenced from file_metadata.schema_descr.leaves as well as row_groups[rg].schema_descr and row_groups[rg].columns[c].column_desc. But then I saw that this is already accounted for:

arrow-rs/parquet/src/file/metadata/memory.rs

Lines 101 to 103 in d519bb8

    
           // don't count schema_descr here because it is already 
        
           // counted in FileMetaData 
        
           self.columns.heap_size() + self.sorting_columns.heap_size()

arrow-rs/parquet/src/file/metadata/memory.rs

Lines 115 to 116 in d519bb8

    
           // don't count column_descr here because it is already counted in 
        
           // FileMetaData

So applying a similar solution to prevent duplicate accounting of the Type pointers probably makes sense. It's expanding the scope of this PR a little, but it's a pretty small change so I think it's fine to add here. I think the impl should only be self.path.heap_size() though, the sizes of the pointers will be accounted for in size_of::<ColumnDescriptor>.

adamreeve · 2025-10-24T02:15:19Z

After that latest change to not count the type size within ColumnDescriptor, my test program that loads meta data into a vector and reports the heap size using ParquetMetaData::memory_size() agrees exactly with the total heap allocations reported by HeapTrack, after accounting for a constant overhead reported by HeapTrack that is independent of the number of copies of the metadata loaded.

etseidl

Thanks again @adamreeve! This is a big (and thorough) improvement.

etseidl · 2025-10-24T13:17:56Z

parquet/src/file/metadata/mod.rs

-        let bigger_expected_size = 2674;
+        let bigger_expected_size = 3192;


So sad to see this increase so much. Truth hurts 😢

We can start with being truthful and then move on to being slimmer

etseidl · 2025-10-24T13:18:12Z

parquet/src/schema/types.rs

+        // Don't include the heap size of primitive_type, this is already
+        // accounted for via SchemaDescriptor::schema
+        self.path.heap_size()


alamb

Thanks @adamreeve and @etseidl -- this looks great to me too

alamb · 2025-10-27T18:27:41Z

parquet/src/file/metadata/mod.rs

-        let bigger_expected_size = 2674;
+        let bigger_expected_size = 3192;


We can start with being truthful and then move on to being slimmer

etseidl and others added 3 commits October 21, 2025 15:07

checkpoint

a6f033e

Format and update comments

ec03919

Add unit test

f58565e

github-actions bot added the parquet Changes to the parquet crate label Oct 21, 2025

adamreeve commented Oct 21, 2025

View reviewed changes

parquet/src/file/metadata/memory.rs Show resolved Hide resolved

adamreeve commented Oct 21, 2025

View reviewed changes

parquet/src/encryption/ciphers.rs Show resolved Hide resolved

adamreeve commented Oct 21, 2025

View reviewed changes

adamreeve requested a review from etseidl October 21, 2025 02:47

etseidl approved these changes Oct 21, 2025

View reviewed changes

parquet/src/encryption/decrypt.rs Show resolved Hide resolved

parquet/src/file/metadata/memory.rs Show resolved Hide resolved

More accurate HashMap heap size calculation

a53eb7b

adamreeve marked this pull request as draft October 22, 2025 06:38

Fix HeapSize implementation for Arc<T>

0c8ab49

Fix decryptor size not being included

955ad16

adamreeve marked this pull request as ready for review October 24, 2025 01:08

Fix double counting heap sie of type pointers in column descriptors

4ce0a66

etseidl approved these changes Oct 24, 2025

View reviewed changes

alamb approved these changes Oct 27, 2025

View reviewed changes

alamb merged commit 06c49db into apache:main Oct 27, 2025
20 checks passed

adamreeve deleted the decryptor-heap-size branch October 27, 2025 20:08

		// The retriever is a user-defined type we don't control,
		// so we can't determine the heap size.

		let bigger_expected_size = 2674;
		let bigger_expected_size = 3192;

Uh oh!

[Parquet] Account for FileDecryptor in ParquetMetaData heap size calculation #8671

[Parquet] Account for FileDecryptor in ParquetMetaData heap size calculation #8671

Conversation

adamreeve commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Not included

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

Uh oh!

adamreeve Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

adamreeve commented Oct 22, 2025

Uh oh!

adamreeve commented Oct 22, 2025

Uh oh!

etseidl commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adamreeve commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adamreeve commented Oct 24, 2025

Uh oh!

adamreeve commented Oct 24, 2025

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

etseidl Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

etseidl Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adamreeve commented Oct 21, 2025 •

edited

Loading

etseidl commented Oct 22, 2025 •

edited

Loading

adamreeve commented Oct 23, 2025 •

edited

Loading