Skip to content

Conversation

@adamreeve
Copy link
Contributor

@adamreeve adamreeve commented Oct 21, 2025

Which issue does this PR close?

Rationale for this change

Makes the metadata heap size calculation more accurate when reading encrypted Parquet files, which helps to better manage caches of Parquet metadata.

What changes are included in this PR?

  • Accounts for heap allocations related to the FileDecryptor in ParquetMetaData
  • Fixes the HeapSize implementation for Arc<T> so the size of T is included, as well as the reference counts that are stored on the heap
  • Fixes the heap size of type pointers within ColumnDescriptor being included twice

Not included

  • Accounting for any heap allocations in a user-provided KeyRetriever

Are these changes tested?

Yes, there's a new unit test added that computes the heap size with a decryptor.

I also did a manual test that created a test Parquet file with 100 columns using per-column encryption keys, and loaded 10,000 copies of the ParquetMetaData into a vector. heaptrack reported 1.136 GB memory heap allocated in this test program. Prior to this change, the sum of the metadata was reported as 879.2 MB, and afterwards it was 1.136 GB.

Are there any user-facing changes?

No

This was co-authored by @etseidl

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 21, 2025
Comment on lines +305 to +306
// The retriever is a user-defined type we don't control,
// so we can't determine the heap size.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in #8472, we could potentially add a new trait method to allow a key retriever to provide a heap size later.

@adamreeve adamreeve requested a review from etseidl October 21, 2025 02:47
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for running this to ground @adamreeve! I think we can punt on the retriever for now. We just need to decide what to do with hash map. 🤔

@adamreeve
Copy link
Contributor Author

We just need to decide what to do with hash map

I've updated this to be more accurate and tried to match the actual hashmap implementation more closely without replicating all the details exactly. Eg. it doesn't account for some alignment calculations and the group size is architecture dependent so this might be an overestimate.

The calculation of the number of buckets could maybe be simplified further, but I felt like small hash maps would be quite common so I didn't want to overestimate this too much.

This does feel a bit too complex, but then changing the memory characteristics of the standard HashMap type seems like something that shouldn't happen often so maybe this is OK...

@adamreeve
Copy link
Contributor Author

Changing this back to draft as I realised the handling of FileDecryptor::footer_decryptor isn't correct and I'm not sure yet exactly how to handle this.

The implementation of HeapSize for Arc<T> looks wrong, this should match the implementation for Box where the size of the contained item is included. But even if that's fixed, the Arc impl isn't used for an Arc<dyn BlockDecryptor>. Instead the Arc is dereferenced and only the HeapSize implementation of the contained type is used.

@adamreeve adamreeve marked this pull request as draft October 22, 2025 06:38
@etseidl
Copy link
Contributor

etseidl commented Oct 22, 2025

Hmm, this opens quite the can of worms. Now I'm looking at HeapSize for the schema, and we may be overcounting there. SchemaDescriptor is already counting the heap for the tree of Type pointers, but then each ColumnDescriptor is also counting the same objects. Perhaps the impl for ColumnDescriptor should be more like self.path.heap_size() + 2 * std::mem::size_of::<usize>() 🤷

And what about Vec<Arc<T>>? Does sizeof for Arc include the pointers and ref counts as well?

@adamreeve
Copy link
Contributor Author

adamreeve commented Oct 23, 2025

The nested heap allocations within the T held by the Arc are already double counted, and this behaviour is documented here:

/// 3. Includes memory from shared pointers (e.g. [`SchemaDescPtr`]). This
/// means `memory_size` will over estimate the memory size if such pointers
/// are shared.

So I think it's at least consistent that the item held directly in the Arc should also be counted twice. But yeah this could possibly be a bit smarter. Maybe this could all be refactored to track which items have been accounted for with pointer equality so things aren't counted twice? But that would be more complicated and require more time and memory to compute the heap size.

And what about Vec<Arc>? Does sizeof for Arc include the pointers and ref counts as well?

No, this works correctly and size_of::<Arc<T>>() only includes the size of one pointer to the heap allocated memory. The ref counts are only accounted for once in the HeapSize impl for Arc<T>.

@adamreeve adamreeve marked this pull request as ready for review October 24, 2025 01:08
@adamreeve
Copy link
Contributor Author

SchemaDescriptor is already counting the heap for the tree of Type pointers, but then each ColumnDescriptor is also counting the same objects. Perhaps the impl for ColumnDescriptor should be more like self.path.heap_size() + 2 * std::mem::size_of::<usize>() 🤷

I was going to comment that the ColumnDescriptor's themselves are also referenced from file_metadata.schema_descr.leaves as well as row_groups[rg].schema_descr and row_groups[rg].columns[c].column_desc. But then I saw that this is already accounted for:

// don't count schema_descr here because it is already
// counted in FileMetaData
self.columns.heap_size() + self.sorting_columns.heap_size()

// don't count column_descr here because it is already counted in
// FileMetaData

So applying a similar solution to prevent duplicate accounting of the Type pointers probably makes sense. It's expanding the scope of this PR a little, but it's a pretty small change so I think it's fine to add here. I think the impl should only be self.path.heap_size() though, the sizes of the pointers will be accounted for in size_of::<ColumnDescriptor>.

@adamreeve
Copy link
Contributor Author

After that latest change to not count the type size within ColumnDescriptor, my test program that loads meta data into a vector and reports the heap size using ParquetMetaData::memory_size() agrees exactly with the total heap allocations reported by HeapTrack, after accounting for a constant overhead reported by HeapTrack that is independent of the number of copies of the metadata loaded.

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @adamreeve! This is a big (and thorough) improvement.

Comment on lines -1910 to +1915
let bigger_expected_size = 2674;
let bigger_expected_size = 3192;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So sad to see this increase so much. Truth hurts 😢

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can start with being truthful and then move on to being slimmer

Comment on lines +848 to +850
// Don't include the heap size of primitive_type, this is already
// accounted for via SchemaDescriptor::schema
self.path.heap_size()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adamreeve and @etseidl -- this looks great to me too

Comment on lines -1910 to +1915
let bigger_expected_size = 2674;
let bigger_expected_size = 3192;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can start with being truthful and then move on to being slimmer

@alamb alamb merged commit 06c49db into apache:main Oct 27, 2025
20 checks passed
@adamreeve adamreeve deleted the decryptor-heap-size branch October 27, 2025 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ParquetMetaData memory size is not reported accurately when encryption is enabled

3 participants