Skip to content

Conversation

jonathanc-n
Copy link
Contributor

@jonathanc-n jonathanc-n commented Aug 7, 2025

Which issue does this PR close?

Rationale for this change

Removes cache_metadata config and opt to always cache Parquet metadata

What changes are included in this PR?

Removal of all cache_metadata uses

@github-actions github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate proto Related to proto crate datasource Changes to the datasource crate labels Aug 7, 2025

// Without filter will not read pageIndex.
assert!(bytes_scanned_with_filter > bytes_scanned_without_filter);
// Same amount of bytes are scanned when defaulting to cache parquet metadata
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nuno-faria this change seems to be brought by CachedParquetFileReaderFactory you added. I just changed the test to reflect the correct behaviour after using the factory as the default now. Just making sure if it looks correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, when caching by default the page index is always read, even if the query does not require it. FYI @alamb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is best practice anyways and probably what people want

Comment on lines 451 to 452
if let Some(metadata_cache) =
state.runtime_env().cache_manager.get_file_metadata_cache()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After #17031 is merged the metadata_cache won't be an Option anymore, so this can be simplified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update after #17031 gets merged

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update after #17031 gets merged

I just merged it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be good to go!

@nuno-faria
Copy link
Contributor

Thanks @jonathanc-n, LGTM. Now we are able to get the performance of metadata caching right out of the box.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nuno-faria and @jonathanc-n

For anyone else reviewing, this is not an API change because the config parameter that was removed was added in #16971, which we haven't released yet

@alamb alamb merged commit 60ac1cc into apache:main Aug 7, 2025
28 checks passed
@jonathanc-n jonathanc-n deleted the remove-cache_metadata-config branch August 8, 2025 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation proto Related to proto crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Parquet Metadata Cache] remove datafusion.execution.parquet.cache_metadata config
3 participants