Skip to content

Conversation

@phillipleblanc
Copy link
Contributor

@phillipleblanc phillipleblanc commented Mar 12, 2025

Which issue does this PR close?

Rationale for this change

The ListingTableProvider in DataFusion provides an implementation of a TableProvider that organizes a collection of (potentially hive partitioned) files in an object store into a single table.

Similar to how hive partitions are injected into the listing table schema, but they don't actually exist in the physical parquet files - this PR adds the ability to request the ListingTable to inject metadata columns that get their data from the ObjectMeta provided by the object store crate. That allows consumers to opt-in for the requested metadata columns.

Note: This is related to the ongoing work in #13975 / #14057 / #14362 -- these new metadata columns could be marked as proper system/metadata columns as defined in those PRs - but I don't see that as a prerequisite for this change. Since this would be an opt-in from the consumer, automatic filtering out on a SELECT * doesn't seem required. We could consider automatically enabling these if we decide on proper support for system columns.

What changes are included in this PR?

I've added a new API on the ListingOptions struct that is passed to a ListingTableConfig which is passed to ListingTable::try_new.

    /// Set metadata columns on [`ListingOptions`] and returns self.
    ///
    /// "metadata columns" are columns that are computed from the `ObjectMeta` of the files from object store.
    ///
    /// Available metadata columns:
    /// - `location`: The full path to the object
    /// - `last_modified`: The last modified time
    /// - `size`: The size in bytes of the object
    ///
    /// For example, given the following files in object store:
    ///
    /// ```text
    /// /mnt/nyctaxi/tripdata01.parquet
    /// /mnt/nyctaxi/tripdata02.parquet
    /// /mnt/nyctaxi/tripdata03.parquet
    /// ```
    ///
    /// If the `last_modified` field in the `ObjectMeta` for `tripdata01.parquet` is `2024-01-01 12:00:00`,
    /// then the table schema will include a column named `last_modified` with the value `2024-01-01 12:00:00`
    /// for all rows read from `tripdata01.parquet`.
    ///
    /// | <other columns> | last_modified         |
    /// |-----------------|-----------------------|
    /// | ...             | 2024-01-01 12:00:00   |
    /// | ...             | 2024-01-02 15:30:00   |
    /// | ...             | 2024-01-03 09:15:00   |
    ///
    /// # Example
    /// ```
    /// # use std::sync::Arc;
    /// # use datafusion::datasource::{listing::ListingOptions, file_format::parquet::ParquetFormat};
    ///
    /// let listing_options = ListingOptions::new(Arc::new(
    ///     ParquetFormat::default()
    ///   ))
    ///   .with_metadata_cols(vec![MetadataColumn::LastModified]);
    ///
    /// assert_eq!(listing_options.metadata_cols, vec![MetadataColumn::LastModified]);
    /// ```
    pub fn with_metadata_cols(mut self, metadata_cols: Vec<MetadataColumn>) -> Self {
        self.metadata_cols = metadata_cols;
        self
    }

That controls whether the ListingTableProvider will add the metadata columns to the schema, similar to how partition columns are added.

The definition for MetadataColumn is a simple enum:

/// A metadata column that can be used to filter files
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub enum MetadataColumn {
    /// The location of the file in object store
    Location,
    /// The last modified timestamp of the file
    LastModified,
    /// The size of the file in bytes
    Size,
}

Filters on metadata columns directly can be used to prune out files that don't need to be read - i.e. SELECT * FROM my_listing_table WHERE last_modified > '2025-03-10' will only scan files that were modified after '2025-03-10'.

Are these changes tested?

Yes, I've added tests in several places (including adding tests for functions that I've changed that didn't previously exist).

Are there any user-facing changes?

The main change is adding the with_metadata_cols API on the ListingOptions struct. This is not a breaking change, as the current behavior will be to not add any metadata columns unless with_metadata_cols is explicitly called.

@github-actions github-actions bot added core Core DataFusion crate datasource Changes to the datasource crate labels Mar 12, 2025
Comment on lines +464 to +467
pub fn with_metadata_cols(mut self, metadata_cols: Vec<MetadataColumn>) -> Self {
self.metadata_cols = metadata_cols;
self
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main API change that consumers would use to enable these columns on the Listing Table. They aren't added by default.

@phillipleblanc phillipleblanc changed the title Support metadata columns (location, size, last_modified) in ListingTableProvider Support metadata columns (location, size, last_modified) in ListingTableProvider Mar 12, 2025
@github-actions github-actions bot added the catalog Related to the catalog crate label Apr 8, 2025
@phillipleblanc
Copy link
Contributor Author

phillipleblanc commented Apr 8, 2025

Looks like the security audit failure is due to #15571

Fixed

phillipleblanc added a commit to spiceai/datafusion that referenced this pull request Apr 8, 2025
… ListingTableProvider (#74)

* Initial work on metadata columns

* Metadata filtering working

* Working on plumbing to file scan config

* wip

* All wired up

* Working!

* Use MetadataColumn enum

* Add integration tests for metadata selection + pushdown filtering

UPSTREAM NOTE: This PR was submitted upstream: apache#15181
phillipleblanc added a commit to spiceai/datafusion that referenced this pull request Apr 8, 2025
… ListingTableProvider (#74)

* Initial work on metadata columns

* Metadata filtering working

* Working on plumbing to file scan config

* wip

* All wired up

* Working!

* Use MetadataColumn enum

* Add integration tests for metadata selection + pushdown filtering

UPSTREAM NOTE: This PR was submitted upstream: apache#15181
phillipleblanc added a commit to spiceai/datafusion that referenced this pull request Apr 8, 2025
… ListingTableProvider (#74)

* Initial work on metadata columns

* Metadata filtering working

* Working on plumbing to file scan config

* wip

* All wired up

* Working!

* Use MetadataColumn enum

* Add integration tests for metadata selection + pushdown filtering

UPSTREAM NOTE: This PR was submitted upstream: apache#15181
phillipleblanc added a commit to spiceai/datafusion that referenced this pull request Apr 17, 2025
… ListingTableProvider (#74)

* Initial work on metadata columns

* Metadata filtering working

* Working on plumbing to file scan config

* wip

* All wired up

* Working!

* Use MetadataColumn enum

* Add integration tests for metadata selection + pushdown filtering

UPSTREAM NOTE: This PR was submitted upstream: apache#15181
phillipleblanc added a commit to spiceai/datafusion that referenced this pull request Apr 25, 2025
… ListingTableProvider (#74)

* Initial work on metadata columns

* Metadata filtering working

* Working on plumbing to file scan config

* wip

* All wired up

* Working!

* Use MetadataColumn enum

* Add integration tests for metadata selection + pushdown filtering

UPSTREAM NOTE: This PR was submitted upstream: apache#15181
@alamb alamb mentioned this pull request May 1, 2025
@phillipleblanc
Copy link
Contributor Author

Closing in favor of the approach outlined in #15173

phillipleblanc added a commit to spiceai/datafusion that referenced this pull request May 7, 2025
… ListingTableProvider (#74)

* Initial work on metadata columns

* Metadata filtering working

* Working on plumbing to file scan config

* wip

* All wired up

* Working!

* Use MetadataColumn enum

* Add integration tests for metadata selection + pushdown filtering

UPSTREAM NOTE: This PR was submitted upstream: apache#15181
sgrebnov pushed a commit to spiceai/datafusion that referenced this pull request May 22, 2025
… ListingTableProvider (#74)

* Initial work on metadata columns

* Metadata filtering working

* Working on plumbing to file scan config

* wip

* All wired up

* Working!

* Use MetadataColumn enum

* Add integration tests for metadata selection + pushdown filtering

UPSTREAM NOTE: This PR was submitted upstream: apache#15181

# Conflicts:
#	datafusion/core/src/datasource/listing/table.rs
#	datafusion/core/tests/sql/path_partition.rs
#	datafusion/datasource/src/file_scan_config.rs
#	datafusion/datasource/src/mod.rs
sgrebnov pushed a commit to spiceai/datafusion that referenced this pull request May 26, 2025
… ListingTableProvider (#74)

UPSTREAM NOTE: This PR was attempted to be upstreamed but was not accepted. Needs to be applied manually
apache#15181
kczimm pushed a commit to spiceai/datafusion that referenced this pull request Aug 19, 2025
… ListingTableProvider (#74)

UPSTREAM NOTE: This PR was attempted to be upstreamed but was not accepted. Needs to be applied manually
apache#15181
kczimm pushed a commit to spiceai/datafusion that referenced this pull request Aug 21, 2025
… ListingTableProvider (#74)

UPSTREAM NOTE: This PR was attempted to be upstreamed but was not accepted. Needs to be applied manually
apache#15181
kczimm pushed a commit to spiceai/datafusion that referenced this pull request Aug 21, 2025
… ListingTableProvider (#74)

UPSTREAM NOTE: This PR was attempted to be upstreamed but was not accepted. Needs to be applied manually
apache#15181
Jeadie pushed a commit to spiceai/datafusion that referenced this pull request Sep 9, 2025
… ListingTableProvider (#74)

UPSTREAM NOTE: This PR was attempted to be upstreamed but was not accepted. Needs to be applied manually
apache#15181
Jeadie pushed a commit to spiceai/datafusion that referenced this pull request Sep 12, 2025
… ListingTableProvider (#74)

UPSTREAM NOTE: This PR was attempted to be upstreamed but was not accepted. Needs to be applied manually
apache#15181
peasee pushed a commit to spiceai/datafusion that referenced this pull request Oct 27, 2025
… ListingTableProvider (#74)

UPSTREAM NOTE: This PR was attempted to be upstreamed but was not accepted. Needs to be applied manually
apache#15181
peasee added a commit to spiceai/datafusion that referenced this pull request Oct 27, 2025
* fix: Ensure only tables or aliases that exist are projected (#52)
fix: More dangling references (#54)

UPSTREAM NOTE: This PR was attempted to be upstreamed in apache#13405 - but it was not accepted due to the complexity it brought. Phillip needs to figure out what a good solution that solves our problem and can be upstreamed is.

* Support for metadata columns (`location`, `size`, `last_modified`)  in ListingTableProvider (#74)

UPSTREAM NOTE: This PR was attempted to be upstreamed but was not accepted. Needs to be applied manually
apache#15181

* Infer placeholder datatype for `Expr::InSubquery` (#80)

UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49
apache#15980

* Infer placeholder datatype after `LIMIT` clause as `DataType::Int64` (#81)

UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49
apache#15980

* Do not double alias Exprs

UPSTREAM NOTE: This was attempted to be fixed with
apache#15008 but was closed

This is the tracking issue on DataFusion:
apache#14895
Do not double alias Exprs

* Add prefix to location metadata column (#82)

UPSTREAM NOTE: This will not be upstreamed as is.

* Infer placeholder types for CASE expressions (#87)

UPSTREAM NOTE: This has not been submitted upstream yet.

* Expand `infer_placeholder_types` to infer all possible placeholder types based on their expression (#88)

UPSTREAM NOTE: This has not been submitted upstream yet.

* Fix `Expr::infer_placeholder_types` inference to not fail (#89)

UPSTREAM NOTE: This has not been submitted upstream yet.

* cherry-pick parquet patch (#94)

* Fix array types coercion: preserve child element nullability for list types (#96)

UPSTREAM NOTE: This was submitted upstream and should be available in DF50

apache#17306

* Expand `infer_placeholder_types` to infer all possible placeholder types based on their expression (#88)

UPSTREAM NOTE: This has not been submitted upstream yet.

* do not enforce type guarantees on all Expr traversed in infer_placeholder_types (#97)

* Use UDTF function args in `LogicalPlan::TableScan` name (#98)

* use UDTF function args in LogicalPlan::TableScan name

* update test snapshots

* Implement timestamp_cast_dtype for SqliteDialect (#99)

* Use text for sqlite timestamp

* Add test

* Custom timestamp format for DuckDB (#102)

* Revert "cherry-pick parquet patch (#94)"

This reverts commit d780cc2.

* Support ExprNamed arguments to Scalar UDFs (#104)

* support ExprNamed until 17379 ships

* add same exprnamed lifting to udtf

* resolve projection against `ListingTable` table_schema incl. partition columns (#106)

* fix: Ensure ListingTable partitions are pruned when filters are not used (#108)

* fix: Prune partitions when no filters are defined

* fix: Backport for DF49:

* review: Address comments

* FileScanConfig: Preserve schema metadata across serde boundary (#107)

* FileScanConfig: preserve schema metadata across serde boundary

* add test

* Merge conflict fixes

UPSTREAM NOTE: this should not be upstreamed. This contains conflict fixes from various cherry-picks and differences in v50.

* update arrow-rs fork

UPSTREAM NOTE: this should not be upstreamed

---------

Co-authored-by: Phillip LeBlanc <phillip@leblanc.tech>
Co-authored-by: Kevin Zimmerman <4733573+kczimm@users.noreply.github.com>
Co-authored-by: sgrebnov <sergei.grebnov@gmail.com>
Co-authored-by: jeadie <jack@spice.ai>
Co-authored-by: Jack Eadie <jack.eadie0@gmail.com>
Co-authored-by: Viktor Yershov <krinart@gmail.com>
Co-authored-by: Viktor Yershov <viktor@spice.ai>
Co-authored-by: David Stancu <david@spice.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

catalog Related to the catalog crate core Core DataFusion crate datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support metadata columns (location, size, last_modified) in ListingTableProvider

1 participant