feat: Use Cached Metadata for ListingTable Statistics #17022

shehabgamin · 2025-08-03T11:17:20Z

Which issue does this PR close?

Closes [Parquet Metadata Cache] Use the cached metadata for ListingTable statistics #17002

Rationale for this change

Faster queries and downstream usability.

What changes are included in this PR?

Make Parquet reader public so downstream crates can use it.
Use cached metadata for ListingTable statistics for faster queries.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

…data

shehabgamin · 2025-08-03T13:03:24Z

EDIT:
I had collect_statistics turned off when doing this, so disregard the results. Striking out the results below to not cause confusion.

~~Not seeing any benefit when testing with S3 data cc @alamb~~

~~Tested on Derived TPC-H (100 GB) querying S3 from EC2.~~

~~EC2 Tab 1:~~

~~env RUSTFLAGS="-C target-cpu=native" cargo build -r -p sail-cli --bins --target-dir target/parquet-metadata-cache~~

~~env \ RUST_LOG=info \ SAIL_PARQUET__FILE_METADATA_CACHE=true \ target/parquet-metadata-cache/release/sail spark server~~

~~EC2 Tab 2:~~

~~python python/pysail/examples/spark/tpch.py \ --data-path s3://BUCKET-PATH HERE \ --query-path python/pysail/data/tpch/queries \ --query-all \ --num-runs 3~~

Run 1 Total time for all queries: 180.1174988746643 seconds.
Run 2 Total time for all queries: 184.3733410835266 seconds.
Run 3 Total time for all queries: 176.67709589004517 seconds.

shehabgamin · 2025-08-03T13:51:31Z

EDIT:
I had collect_statistics turned off when doing this, so disregard the results. Striking out the results below to not cause confusion.

Retesting after: c20b142
and lakehq/sail@5580d8b as well

~~Will update this message when testing done~~

~~UPDATE:~~
We can try with both "10G" set and also with nothing set and see. env \ RUST_LOG=info \ SAIL_PARQUET__FILE_METADATA_CACHE=true \ SAIL_PARQUET__FILE_METADATA_CACHE_LIMIT="10G" \ target/parquet-metadata-cache/release/sail spark server

~~python python/pysail/examples/spark/tpch.py \ --data-path s3://PATH \ --query-path python/pysail/data/tpch/queries \ --query-all \ --num-runs 5~~

Run 1 Total time for all queries: 203.74799275398254 seconds.
Run 2 Total time for all queries: 184.0539104938507 seconds.
Run 3 Total time for all queries: 178.30218935012817 seconds.
Run 4 Total time for all queries: 172.49501848220825 seconds.
Run 5 Total time for all queries: 174.7171494960785 seconds.

~~env \ RUST_LOG=info \ SAIL_PARQUET__FILE_METADATA_CACHE=true \ target/parquet-metadata-cache/release/sail spark server~~

~~python python/pysail/examples/spark/tpch.py \ --data-path s3://PATH \ --query-path python/pysail/data/tpch/queries \ --query-all \ --num-runs 5~~

Run 1 Total time for all queries: 182.70136904716492 seconds.
Run 2 Total time for all queries: 192.44108653068542 seconds.
Run 3 Total time for all queries: 175.50793719291687 seconds.
Run 4 Total time for all queries: 175.66266465187073 seconds.
Run 5 Total time for all queries: 175.70708632469177 seconds.

alamb · 2025-08-04T13:44:28Z

Very exciting -- I have planned time this week to help review and get these various caching things completed

alamb

Thanks @shehabgamin -- the code in this PR looks great. I think we just need to add some tests to make sure we don't break this feature in the future

Maybe we can add a test based on the existing ones that shows a second retrieval of statistics doesn't refetch the same footer 🤔

shehabgamin · 2025-08-05T21:04:29Z

Thanks @shehabgamin -- the code in this PR looks great. I think we just need to add some tests to make sure we don't break this feature in the future

Maybe we can add a test based on the existing ones that shows a second retrieval of statistics doesn't refetch the same footer 🤔

Forsure! Will knock that out when I have some downtime today/tonight

shehabgamin · 2025-08-06T05:29:09Z

Thanks @shehabgamin -- the code in this PR looks great. I think we just need to add some tests to make sure we don't break this feature in the future
Maybe we can add a test based on the existing ones that shows a second retrieval of statistics doesn't refetch the same footer 🤔

Forsure! Will knock that out when I have some downtime today/tonight

@alamb Done!

alamb

Thanks @shehabgamin -- this code and tests look great to me

I also tried it locally and it is 👨‍🍳 👌 . Note that the count(*) which is based on statistics returns immediately:

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ ./target/release/datafusion-cli
DataFusion CLI v49.0.0
> create external table hits stored as parquet location 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
0 row(s) fetched.
Elapsed 3.298 seconds.

> select min("WatchID"), max("WatchID") from hits;
+---------------------+---------------------+
| min(hits.WatchID)   | max(hits.WatchID)   |
+---------------------+---------------------+
| 4611686071420045196 | 9223372033328793741 |
+---------------------+---------------------+
1 row(s) fetched.
Elapsed 0.273 seconds.

> select min("WatchID"), max("WatchID") from hits;
+---------------------+---------------------+
| min(hits.WatchID)   | max(hits.WatchID)   |
+---------------------+---------------------+
| 4611686071420045196 | 9223372033328793741 |
+---------------------+---------------------+
1 row(s) fetched.
Elapsed 0.241 seconds.

DataFusion 49.0.0:

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ ~/Software/datafusion-cli/datafusion-cli-49.0.0
DataFusion CLI v49.0.0
> create external table hits stored as parquet location 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
0 row(s) fetched.
Elapsed 3.313 seconds.

> select min("WatchID"), max("WatchID") from hits;
+---------------------+---------------------+
| min(hits.WatchID)   | max(hits.WatchID)   |
+---------------------+---------------------+
| 4611686071420045196 | 9223372033328793741 |
+---------------------+---------------------+
1 row(s) fetched.
Elapsed 1.380 seconds.

However, I tried merging up from main locally and this PR didn't compile. It seems to have a logical conflict with #17062

The good news is that since we have removed that field I think this PR can get significantly simpler. Sorry I can't push the changes directly to this PR but I don't have write access to the lakehq fork

error[E0609]: no field `cache_metadata` on type `datafusion_common::config::ParquetOptions`
   --> datafusion/core/src/datasource/file_format/parquet.rs:204:37
    |
204 |             format.options().global.cache_metadata,
    |                                     ^^^^^^^^^^^^^^ unknown field

alamb · 2025-08-07T20:12:38Z

datafusion/core/src/datasource/file_format/parquet.rs

+        .expect("error reading metadata with hint");
+        assert_eq!(store.request_count(), 4);
+
+        // Increase by 2  because `cache_metadata` is false


alamb · 2025-08-07T20:19:56Z

datafusion/datasource-parquet/src/mod.rs

 mod opener;
 mod page_filter;
-mod reader;
+pub mod reader;


why does this need to be made pub? I reverted the change and things still seem to compile just fine

So downstream crates can use it. The reader file looked like it could generally be useful for downstream crates which is why I made the entire reader public.

Currently in Sail we build our own cache for each cache type in CacheManagerConfig. I was unable to access CachedParquetMetaData to do something like the following unless I made CachedParquetMetaData public (full code):

if let Some(parquet_metadata) = value.1.as_any().downcast_ref::<CachedParquetMetaData>() { parquet_metadata .parquet_metadata() .memory_size() .min(u32::MAX as usize) }

I see -- my concern is that this change might be easy to accidentally undo / break in the future. Maybe to make it more deliberate, you could leave mod reader and then pub use both CachedParquetMetaData and CachedParquetMetaData 🤔

Works with me!

alamb · 2025-08-07T20:20:45Z

datafusion/datasource-parquet/src/reader.rs

 /// from the [`FileMetadataCache`], if available, otherwise reads it directly from the file and then
 /// updates the cache.
-pub(crate) struct CachedParquetFileReader {
+pub struct CachedParquetFileReader {


Likewise, I locally reverted these changes to visibility and everything seems to have compiled just fine

Same reason as here:
https://github.com/apache/datafusion/pull/17022/files#r2262208618

alamb · 2025-08-07T20:23:57Z

datafusion/datasource-parquet/src/file_format.rs

        self
    }
+
+    pub fn with_cache_metadata(mut self, cache_metadata: bool) -> Self {


I think this option was removed in #17062 so it is no longer needed

alamb · 2025-08-07T21:39:15Z

@nuno-faria and @jonathanc-n -- do you have time to review this PR as well?

jonathanc-n · 2025-08-08T00:37:36Z

Some changes might be needed after the removal of cache_metadata in #17062, should be small, we just no longer need to do the if check on cache metadata

…data

datafusion/core/tests/parquet/page_pruning.rs

nuno-faria · 2025-08-08T10:05:22Z

datafusion/datasource-parquet/src/file_format.rs

+    let metadata = Arc::new(
+        reader
+            .load_and_finish(fetch, file_size)
+            .await
+            .map_err(DataFusionError::from)?,
+    );
+
+    if let Some(cache) = file_metadata_cache {
+        cache.put(
+            meta,
+            Arc::new(CachedParquetMetaData::new(Arc::clone(&metadata))),
+        );
+    }


I think there is an issue with the fetch_parquet_metadata function. When this function is initially called to retrieve the schema (in fetch_schema), it will read the metadata and update the cache. When the CachedParquetFileReader tries to get the metadata, it checks that it is present in the cache. However, the cached metadata does not contain the page index, as it is not retrieved in the fetch_parquet_metadata, meaning it will have to be read in every query.

So this fetch_parquet_metadata needs to retrieve the entire metadata for the caching to be effective. On the other hand, after this we will have two different places where the entire metadata is read and cached (CachedParquetFileReader and fetch_parquet_metadata), so creating an utility function retrieve_full_parquet_metadata -> Arc<ParquetMetaData> might be useful to avoid duplicate modifications in the future.

Ahh nice catch, will do!

…data

nuno-faria

Thanks @shehabgamin, overall lgmt.

nuno-faria · 2025-08-09T17:14:57Z

datafusion/datasource-parquet/src/file_format.rs

-pub async fn fetch_parquet_metadata(
-    store: &dyn ObjectStore,
-    meta: &ObjectMeta,
+pub async fn fetch_parquet_metadata<F: MetadataFetch>(


I wonder now if this function should go to a separate module (a utils.rs or similar?). This is because the file_format refers to reader and reader refers to file_format. cc @alamb

I agree -- -- I am thinking the code that handles fetching metadata and schema is also getting enough options that it can probably be its own file / structure too. Maybe something like

pub struct ParquetMetadataFetcher { ... }

So we could use it like

let fetcher = ParquetMetadataFetcher::new(object_store, path) .with_hint(...); fetcher.fetch_metadata().await?

🤔

The more I looked the more it seemed like what would be helpful would be a whole new module statistics.rs or something

I am hacking up a prototype of what this might look like

I bashed out a PR that pulled the metadata handling out into its own module and I think it looks quite a big nicer. Thank you for the suggestion @nuno-faria

Consolidate Parquet Metadata handling into its own module and struct DFParquetMetadata #17127

alamb · 2025-08-11T16:44:52Z

> create external table hits stored as parquet location 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
0 row(s) fetched.
Elapsed 3.539 seconds.

> select min("WatchID"), max("WatchID") from hits;
+---------------------+---------------------+
| min(hits.WatchID)   | max(hits.WatchID)   |
+---------------------+---------------------+
| 4611686071420045196 | 9223372033328793741 |
+---------------------+---------------------+
1 row(s) fetched.
Elapsed 0.164 seconds.

The 0.16 seconds is pretty sweet!

On main, the same query takes 1.3sec

> select min("WatchID"), max("WatchID") from hits;
+---------------------+---------------------+
| min(hits.WatchID)   | max(hits.WatchID)   |
+---------------------+---------------------+
| 4611686071420045196 | 9223372033328793741 |
+---------------------+---------------------+
1 row(s) fetched.
Elapsed 1.317 seconds.

alamb

I took another look at this - and it looks great to me. Thank you @shehabgamin and @nuno-faria and @jonathanc-n

alamb · 2025-08-11T16:58:50Z

datafusion/datasource-parquet/src/file_format.rs

-pub async fn fetch_parquet_metadata(
-    store: &dyn ObjectStore,
-    meta: &ObjectMeta,
+pub async fn fetch_parquet_metadata<F: MetadataFetch>(


I agree -- -- I am thinking the code that handles fetching metadata and schema is also getting enough options that it can probably be its own file / structure too. Maybe something like

pub struct ParquetMetadataFetcher { ... }

So we could use it like

let fetcher = ParquetMetadataFetcher::new(object_store, path) .with_hint(...); fetcher.fetch_metadata().await?

🤔

alamb · 2025-08-11T17:02:22Z

datafusion/datasource-parquet/src/file_format.rs

    file: &ObjectMeta,
    metadata_size_hint: Option<usize>,
    coerce_int96: Option<TimeUnit>,
+    file_metadata_cache: Option<Arc<dyn FileMetadataCache>>,


I re-reviewed these changes and since I think we always now have a file_metadata_cache we could probably make this be non optional. However, there are many tests that need to change

alamb · 2025-08-11T17:02:53Z

datafusion/datasource-parquet/src/file_format.rs

-pub async fn fetch_parquet_metadata(
-    store: &dyn ObjectStore,
-    meta: &ObjectMeta,
+pub async fn fetch_parquet_metadata<F: MetadataFetch>(


The more I looked the more it seemed like what would be helpful would be a whole new module statistics.rs or something

shehabgamin added 5 commits August 2, 2025 21:12

make CachedParquetMetaData pub

c51e16e

make parquet reader pub

74359e0

undo

c67ba4a

add impl

84f91b0

cached meta in reader

e3efe06

github-actions bot added core Core DataFusion crate datasource Changes to the datasource crate labels Aug 3, 2025

shehabgamin changed the title ~~Parquet metadata~~ feat: Use Cached Metadata for ListingTable Statistics Aug 3, 2025

shehabgamin added 2 commits August 3, 2025 04:23

Merge branch 'main' of github.com:lakehq/datafusion into parquet-meta…

961fc24

…data

update function docs

f95ea46

github-actions bot added the documentation Improvements or additions to documentation label Aug 3, 2025

shehabgamin added 2 commits August 3, 2025 04:48

cache after fetch

0d6b5ff

cache meta

c12bc46

shehabgamin mentioned this pull request Aug 3, 2025

feat: file metadata cache lakehq/sail#687

Closed

return arc

c20b142

Merge branch 'main' into parquet-metadata

65fa8f7

shehabgamin marked this pull request as ready for review August 4, 2025 22:19

github-actions bot removed the documentation Improvements or additions to documentation label Aug 4, 2025

alamb reviewed Aug 5, 2025

View reviewed changes

shehabgamin added 3 commits August 5, 2025 20:16

Merge branch 'apache:main' into parquet-metadata

2479a9f

add cache metadata tests

230256c

add parquet cache tests

17366f5

xudong963 self-requested a review August 6, 2025 05:30

alamb reviewed Aug 7, 2025

View reviewed changes

shehabgamin added 3 commits August 8, 2025 00:43

Merge branch 'main' of github.com:lakehq/datafusion into parquet-meta…

489763c

…data

remove cache_metadata option

15914eb

parquet metadata cache impacts test

a6d0bc0

shehabgamin requested a review from alamb August 8, 2025 09:11

shehabgamin commented Aug 8, 2025

View reviewed changes

datafusion/core/tests/parquet/page_pruning.rs Show resolved Hide resolved

nuno-faria reviewed Aug 8, 2025

View reviewed changes

alamb mentioned this pull request Aug 8, 2025

[Parquet Metadata Cache] Add an API to review the contents of the Cache #17091

Closed

shehabgamin added 6 commits August 8, 2025 18:43

Merge branch 'main' of github.com:lakehq/datafusion into parquet-meta…

fd761e7

…data

add comment for reader

f3ad6f1

refactor fetch parquet metadata

5b4129a

parquet metadata cleanup

c39b29f

parquet metadata cleanup

2313aed

refactor parquet metadata cache

3bb321b

shehabgamin requested a review from nuno-faria August 9, 2025 04:24

feature flag

7d9128b

nuno-faria approved these changes Aug 9, 2025

View reviewed changes

alamb approved these changes Aug 11, 2025

View reviewed changes

alamb merged commit 14fb4a3 into apache:main Aug 11, 2025
27 checks passed

alamb deleted the parquet-metadata branch August 11, 2025 17:03

alamb mentioned this pull request Aug 11, 2025

Consolidate Parquet Metadata handling into its own module and struct DFParquetMetadata #17127

Merged

feat: Use Cached Metadata for ListingTable Statistics #17022

feat: Use Cached Metadata for ListingTable Statistics #17022

Uh oh!

Conversation

shehabgamin commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

shehabgamin commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shehabgamin commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Aug 4, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

shehabgamin commented Aug 5, 2025

Uh oh!

shehabgamin commented Aug 6, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Aug 7, 2025

Uh oh!

jonathanc-n commented Aug 8, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nuno-faria left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shehabgamin commented Aug 3, 2025 •

edited

Loading

shehabgamin commented Aug 3, 2025 •

edited

Loading

shehabgamin commented Aug 3, 2025 •

edited

Loading

alamb commented Aug 11, 2025 •

edited

Loading