-
Notifications
You must be signed in to change notification settings - Fork 1.7k
feat: Use Cached Metadata for ListingTable Statistics #17022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
EDIT:
|
EDIT:
|
Very exciting -- I have planned time this week to help review and get these various caching things completed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shehabgamin -- the code in this PR looks great. I think we just need to add some tests to make sure we don't break this feature in the future
Maybe we can add a test based on the existing ones that shows a second retrieval of statistics doesn't refetch the same footer 🤔
Forsure! Will knock that out when I have some downtime today/tonight |
@alamb Done! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shehabgamin -- this code and tests look great to me
I also tried it locally and it is 👨🍳 👌 . Note that the count(*) which is based on statistics returns immediately:
andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ ./target/release/datafusion-cli
DataFusion CLI v49.0.0
> create external table hits stored as parquet location 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
0 row(s) fetched.
Elapsed 3.298 seconds.
> select min("WatchID"), max("WatchID") from hits;
+---------------------+---------------------+
| min(hits.WatchID) | max(hits.WatchID) |
+---------------------+---------------------+
| 4611686071420045196 | 9223372033328793741 |
+---------------------+---------------------+
1 row(s) fetched.
Elapsed 0.273 seconds.
> select min("WatchID"), max("WatchID") from hits;
+---------------------+---------------------+
| min(hits.WatchID) | max(hits.WatchID) |
+---------------------+---------------------+
| 4611686071420045196 | 9223372033328793741 |
+---------------------+---------------------+
1 row(s) fetched.
Elapsed 0.241 seconds.
DataFusion 49.0.0:
andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ ~/Software/datafusion-cli/datafusion-cli-49.0.0
DataFusion CLI v49.0.0
> create external table hits stored as parquet location 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
0 row(s) fetched.
Elapsed 3.313 seconds.
> select min("WatchID"), max("WatchID") from hits;
+---------------------+---------------------+
| min(hits.WatchID) | max(hits.WatchID) |
+---------------------+---------------------+
| 4611686071420045196 | 9223372033328793741 |
+---------------------+---------------------+
1 row(s) fetched.
Elapsed 1.380 seconds.
However, I tried merging up from main locally and this PR didn't compile. It seems to have a logical conflict with #17062
The good news is that since we have removed that field I think this PR can get significantly simpler. Sorry I can't push the changes directly to this PR but I don't have write access to the lakehq
fork
error[E0609]: no field `cache_metadata` on type `datafusion_common::config::ParquetOptions`
--> datafusion/core/src/datasource/file_format/parquet.rs:204:37
|
204 | format.options().global.cache_metadata,
| ^^^^^^^^^^^^^^ unknown field
.expect("error reading metadata with hint"); | ||
assert_eq!(store.request_count(), 4); | ||
|
||
// Increase by 2 because `cache_metadata` is false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
mod opener; | ||
mod page_filter; | ||
mod reader; | ||
pub mod reader; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this need to be made pub? I reverted the change and things still seem to compile just fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So downstream crates can use it. The reader file looked like it could generally be useful for downstream crates which is why I made the entire reader public.
Currently in Sail we build our own cache for each cache type in CacheManagerConfig
. I was unable to access CachedParquetMetaData
to do something like the following unless I made CachedParquetMetaData
public (full code):
if let Some(parquet_metadata) =
value.1.as_any().downcast_ref::<CachedParquetMetaData>()
{
parquet_metadata
.parquet_metadata()
.memory_size()
.min(u32::MAX as usize)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see -- my concern is that this change might be easy to accidentally undo / break in the future. Maybe to make it more deliberate, you could leave mod reader
and then pub use
both CachedParquetMetaData
and CachedParquetMetaData
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works with me!
/// from the [`FileMetadataCache`], if available, otherwise reads it directly from the file and then | ||
/// updates the cache. | ||
pub(crate) struct CachedParquetFileReader { | ||
pub struct CachedParquetFileReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, I locally reverted these changes to visibility and everything seems to have compiled just fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reason as here:
https://github.com/apache/datafusion/pull/17022/files#r2262208618
self | ||
} | ||
|
||
pub fn with_cache_metadata(mut self, cache_metadata: bool) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this option was removed in #17062 so it is no longer needed
@nuno-faria and @jonathanc-n -- do you have time to review this PR as well? |
Some changes might be needed after the removal of |
let metadata = Arc::new( | ||
reader | ||
.load_and_finish(fetch, file_size) | ||
.await | ||
.map_err(DataFusionError::from)?, | ||
); | ||
|
||
if let Some(cache) = file_metadata_cache { | ||
cache.put( | ||
meta, | ||
Arc::new(CachedParquetMetaData::new(Arc::clone(&metadata))), | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is an issue with the fetch_parquet_metadata
function. When this function is initially called to retrieve the schema (in fetch_schema
), it will read the metadata and update the cache. When the CachedParquetFileReader
tries to get the metadata, it checks that it is present in the cache. However, the cached metadata does not contain the page index, as it is not retrieved in the fetch_parquet_metadata
, meaning it will have to be read in every query.
So this fetch_parquet_metadata
needs to retrieve the entire metadata for the caching to be effective. On the other hand, after this we will have two different places where the entire metadata is read and cached (CachedParquetFileReader
and fetch_parquet_metadata
), so creating an utility function retrieve_full_parquet_metadata -> Arc<ParquetMetaData>
might be useful to avoid duplicate modifications in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh nice catch, will do!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shehabgamin, overall lgmt.
pub async fn fetch_parquet_metadata( | ||
store: &dyn ObjectStore, | ||
meta: &ObjectMeta, | ||
pub async fn fetch_parquet_metadata<F: MetadataFetch>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder now if this function should go to a separate module (a utils.rs
or similar?). This is because the file_format
refers to reader
and reader
refers to file_format
. cc @alamb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- -- I am thinking the code that handles fetching metadata and schema is also getting enough options that it can probably be its own file / structure too. Maybe something like
pub struct ParquetMetadataFetcher {
...
}
So we could use it like
let fetcher = ParquetMetadataFetcher::new(object_store, path)
.with_hint(...);
fetcher.fetch_metadata().await?
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I looked the more it seemed like what would be helpful would be a whole new module statistics.rs
or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am hacking up a prototype of what this might look like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bashed out a PR that pulled the metadata handling out into its own module and I think it looks quite a big nicer. Thank you for the suggestion @nuno-faria
The 0.16 seconds is pretty sweet! On main, the same query takes 1.3sec > select min("WatchID"), max("WatchID") from hits;
+---------------------+---------------------+
| min(hits.WatchID) | max(hits.WatchID) |
+---------------------+---------------------+
| 4611686071420045196 | 9223372033328793741 |
+---------------------+---------------------+
1 row(s) fetched.
Elapsed 1.317 seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took another look at this - and it looks great to me. Thank you @shehabgamin and @nuno-faria and @jonathanc-n
pub async fn fetch_parquet_metadata( | ||
store: &dyn ObjectStore, | ||
meta: &ObjectMeta, | ||
pub async fn fetch_parquet_metadata<F: MetadataFetch>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- -- I am thinking the code that handles fetching metadata and schema is also getting enough options that it can probably be its own file / structure too. Maybe something like
pub struct ParquetMetadataFetcher {
...
}
So we could use it like
let fetcher = ParquetMetadataFetcher::new(object_store, path)
.with_hint(...);
fetcher.fetch_metadata().await?
🤔
file: &ObjectMeta, | ||
metadata_size_hint: Option<usize>, | ||
coerce_int96: Option<TimeUnit>, | ||
file_metadata_cache: Option<Arc<dyn FileMetadataCache>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-reviewed these changes and since I think we always now have a file_metadata_cache we could probably make this be non optional. However, there are many tests that need to change
pub async fn fetch_parquet_metadata( | ||
store: &dyn ObjectStore, | ||
meta: &ObjectMeta, | ||
pub async fn fetch_parquet_metadata<F: MetadataFetch>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I looked the more it seemed like what would be helpful would be a whole new module statistics.rs
or something
Which issue does this PR close?
Rationale for this change
Faster queries and downstream usability.
What changes are included in this PR?
Are these changes tested?
Yes.
Are there any user-facing changes?
No.