Skip to content

Conversation

@adamreeve
Copy link
Contributor

Which issue does this PR close?

What changes are included in this PR?

Adds a new test program that overrides the global allocator in order to track allocations, and compare the measured allocation size with the computed heap size of the Parquet metadata.

Are these changes tested?

Yes, this only adds a test

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Nov 26, 2025
Comment on lines +82 to +83
// Calculated heap size doesn't match exactly, possibly due to extra overhead not accounted
// for in the HeapSize implementation for parquet::data_type::ByteArray.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't managed to track down exactly where the difference comes from so this is a bit of a guess. The confusing part is the file with encryption has stats for a ByteArray column too so I'm not sure why it does give the exact same heap size. Maybe there's something different about how the stats from encrypted metadata work or maybe the difference is from somewhere else.

But the computed size is still very close to the actual heap allocation size so I think this is good enough.

@adamreeve adamreeve marked this pull request as ready for review December 4, 2025 03:28
@adamreeve adamreeve requested a review from rok December 4, 2025 03:31
@alamb
Copy link
Contributor

alamb commented Dec 10, 2025

Interestingly, this PR fails on my mac (OSX 26.1)

Darwin Andrews-MacBook-Pro-3.local 25.1.0 Darwin Kernel Version 25.1.0: Mon Oct 20 19:30:01 PDT 2025; root:xnu-12377.41.6~2/RELEASE_ARM64_T6031 arm64
---- test_metadata_heap_memory stdout ----

thread 'test_metadata_heap_memory' (15130761) panicked at parquet/tests/metadata_memory.rs:138:9:
assertion `left == right` failed: Calculated heap size 10534 doesn't match the allocated size 10526 for file /Users/andrewlamb/Software/arrow-rs/arrow/../parquet-testing/data/encrypt_columns_plaintext_footer.parquet.encrypted
  left: 10534
 right: 10526
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    test_metadata_heap_memory

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.01s

error: test failed, to rerun pass `-p parquet --test metadata_memory`

@adamreeve
Copy link
Contributor Author

Huh, interesting. I think trying to get this to always match exactly and track down all the causes of differences is not worth the effort, and I should just add a small tolerance for all the test cases.

use std::sync::Arc;
use std::sync::atomic::{AtomicUsize, Ordering};

pub struct TrackingAllocator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a very cool idea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Parquet] Add metadata heap size test that tracks allocations

2 participants