Skip to content

Conversation

@paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Oct 8, 2025

Which issue does this PR close?

Rationale for this change

Most logical plan expressions now propagate metadata; however, parameters with extension types or other field metadata cannot participate in placeholder/parameter binding.

What changes are included in this PR?

The DataType in the Placeholder struct was replaced with a FieldRef along with anything that stored the "DataType" of a parameter.

Strictly speaking one could bind parameters with an extension type by copy/pasting the placeholder replacer, which I figured out towards the end of this change. I still think this change makes sense and opens up the door for things like handling UUID in SQL with full parameter binding support.

Are these changes tested?

Yes

Are there any user-facing changes?

Yes, one new function was added to extract the placeholder fields from a plan.

This is a breaking change for code that specifically interacts with the pub fields of the modified structs (ParamValues, Placeholder, and Prepare are the main ones).

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions proto Related to proto crate physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate common Related to common crate labels Oct 8, 2025
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Oct 10, 2025
Copy link
Member Author

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timsaucer @alamb I'm happy to add tests for all these components but wanted to make sure this is vaugely headed in the right direction before I do so!

Comment on lines -1373 to +1145
pub data_type: Option<DataType>,
pub field: Option<FieldRef>,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main change. We can change this to a less severely breaking option (e.g., just add metadata: FieldMetadata to the struct)...I started with the most breaking version to identify its use in as many places as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a reasonable change, although to your point it will be a breaking API change and may cause issues for downstream users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the api-change label to this PR

Comment on lines 590 to 592
pub(crate) fn convert_data_type(&self, sql_type: &SQLDataType) -> Result<DataType> {
pub(crate) fn convert_data_type(&self, sql_type: &SQLDataType) -> Result<FieldRef> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change would also enable supporting UUIDs and other SQL types that map to extension types

Comment on lines +989 to +994
Ok(Expr::Cast(Cast::new(
Box::new(expr),
dt.data_type().clone(),
)))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another place where metadata is dropped that I didn't update (casts)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made #18060 for this one

Comment on lines +54 to +168
#[derive(Clone, PartialEq, Eq, PartialOrd, Hash, Debug)]
pub struct FieldMetadata {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just moved this from the expr crate so I could us it in ParamValues

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @paleolimbot -- I think this is definitely the right direction. I had some small API suggestions, but overall looks good

The largest open question in my mind is what you have highlighted for customizing behavior for different extension types (e.g. comparing two fields for "equality" and printing them, and casting them, etc.)

@findepi brought up the same thing many months ago when discussing adding new types in

One idea is to create a TypeRegistry similar to a FunctionRegistry and some sort of ExtensionType trait that encapsulates these behaviors.

The challenge would then be to thread the registry to all places that need it. Though that is likely largely an API design / plumbing exercise

If you think that is an idea worth exploring

impl ParamValues {
/// Verify parameter list length and type
pub fn verify(&self, expect: &[DataType]) -> Result<()> {
pub fn verify(&self, expect: &[FieldRef]) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing that would be nice to help people upgrade could be to add a new function and deprecate this one -- perhaps something like suggested in https://datafusion.apache.org/contributor-guide/api-health.html#api-health-policy

    #[deprecated]
    pub fn verify(&self, expect: &[DataType]) -> Result<()> {
      // make dummy Fields
      let expect = ...;
      self.verify_fields(&expect)
     }

    // new function that has the new signature
    pub fn verify_fields(&self, expect: &[FieldRef]) -> Result<()> {
    ...
    }

}

if let Some(expected_metadata) = maybe_metadata {
// Probably too strict of a comparison (this is an example of where
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree straight up comparing strings is probably not ideal

If we wanted to introduce type equality, I thing the bigger question is how to thread it through (you would have to have some way to register your types / methods to check equality and ensure that somehow ended up here 🤔 )

@tobixdev
Copy link
Contributor

tobixdev commented Oct 13, 2025

One idea is to create a TypeRegistry similar to a FunctionRegistry and some sort of ExtensionType trait that encapsulates these behaviors.

I've stumbled across this PR and remembered that I implemented such a thing in #15106 for defining custom sort orders for certain types. I now have a bit more time on my hand and I'll try to revive that effort as we would be still interested in that.

@paleolimbot
Copy link
Member Author

Thank you for the review! I will clean this up today/tomorrow and ensure we can test all the spots.

I now have a bit more time on my hand and I'll try to revive that effort as we would be still interested in that.

I'd love that! Perhaps the first step from the non-sorting angle might be to just consolidate all the places we print, cast, or test equality (which are currently scattered around the code base).

A vague thought based mostly on this PR, SedonaDB's SedonaType, and the one linked (likely needs much refinement):

pub fn resolve_df_data_type(session: Option<SessionSomething>, storage_type: &DataType, metadata: impl MetadataViewOfSomeSort ) -> Box<dyn DFDataType> {
  // sanity check to avoid hitting the lock on any registry whatsoever when not needed)

  // If the session is specified, use a session registry

  // otherwise, use a static or compile-time populated registry until such a time as the session
  // is made available everywhere and/or DataType::Extension is a thing
}

pub trait DFDataType: Clone {
  fn storage_type(&self) -> DataType;
  // e.g., the sanity check on the return field of a UDxF
  fn type_identical_to(&self, &dyn DFDataType) -> Result<bool>;
  // for when this has to be a storage field (it might already be one)
  fn storage_field(&self, name: Option<&str>, nullability: Option<bool>) -> FieldRef;
  fn add_to_existing_storage_field(&self
  // for when this has to be just metadata (e.g., the Expr::Literal)
  fn storage_field_metadata(&self) -> Option<FieldMetadata>;
  // to avoid dumping a Field debug string in user-facing errors
  fn display_type(&self) -> String;
  // for any hope of ever being able to deal with shredded/unshredded variants in UNION ALL
  // slash listing tables. This is useful for some geoarrow types too (wkb/wkb_view -> wkb)
  fn common_castable_type(&self, &dyn DFDataType) -> Result<Box<dyn DFDataType>> { /* self if types are identical or error */ }
  // usually printing the storage is OK but no ideal (e.g., variants should be jsonish)
  fn format_proxy(&self, storage: ArrayRef) -> Result<ArrayRef> { Ok(storage) }
  // Or something more advanced like the dynamic comparator in the linked PR
  fn sort_proxy(&self, storage: ArrayRef) -> Result<ArrayRef> { Ok(storage) }
  // It's a shame to bleed execution in here but casting shows up a lot
  fn cast_storage(&self, storage: ArrayRef, to: Box<dyn DFDataType>);
}

// All the structures that are currently communicating that type information

impl DFDataType for FieldMetadata {
  // implement strict equality based on all metadata (or just extension name/extension metadata?)
  // printing based on metadata
  // everything else based on datatype
  // ...basically everything datafusion does right now
}

impl DFDataType for FieldRef {
  // defer to fieldmetadata
}

impl DFDataType for DataType {
 // defer to DataTypes
}

@tobixdev
Copy link
Contributor

Btw, if you want another set of eyes you can ping me once its ready for another review.

@paleolimbot
Copy link
Member Author

Thank you...I'd love that! I just need to finish up a few tests...I didn't quite get there today but I'm confident I can get this ready for eyes tomorrow.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @paleolimbot, I think this is a step in the right direction

While it does contain some breaking API changes I don't really see how that can be avoided as we need to be working Field through the codebase.

I got half way through the review and then saw new commits so I'll go back and review them now too

Comment on lines +124 to +127
pub fn format_type_and_metadata(
data_type: &DataType,
metadata: Option<&std::collections::HashMap<String, String>>,
) -> String {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started a ticket to discuss how such an API might look like:

///
/// Use [`ParamValues::verify_fields`] to ensure field metadata is considered when
/// computing type equality.
#[deprecated]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add a note saying "use varify_fields" as part of the notice as described here: https://datafusion.apache.org/contributor-guide/api-health.html#deprecation-guidelines

Something like

Suggested change
#[deprecated]
#[deprecated(since = "51.0.0", note = "Use verifiy_fields instead")]

Comment on lines -1373 to +1145
pub data_type: Option<DataType>,
pub field: Option<FieldRef>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a reasonable change, although to your point it will be a breaking API change and may cause issues for downstream users.

}

/// Create a new Placeholder expression from a Field
pub fn new_with_metadata(id: String, field: Option<FieldRef>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be more consistent if it was named new_with_field rather than new_with_metadata


impl Placeholder {
/// Create a new Placeholder expression
pub fn new(id: String, data_type: Option<DataType>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably deprecate this one too perhaps -- and direct users to the one below

.map(Field::try_from)
.collect::<Result<_, _>>()?;

// If the fields are empty this may have been generated by an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a precedent that I know of -- see https://docs.rs/datafusion-proto/latest/datafusion_proto/#version-compatibility

It would be cool to have a way to test this, even if we don't guarantee it. But given there is no precedent I think we can file a follow on ticket to track this feature

// Arrays may be multi-dimensional.
let inner_data_type = self.convert_data_type(inner_sql_type)?;
Ok(DataType::new_list(inner_data_type, true))
let inner_data_type = self.convert_data_type_to_field(inner_sql_type)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This confused me for a while -- I missed inner_data_type is now a Field

Perhaps this would be easier to understand if the datatype was named inner_field

Ok(Field::new(
"",
DataType::List(
inner_data_type.as_ref().clone().with_name("item").into(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can potentially avoid a copy using Arc::unwrap_or_clone here:

Suggested change
inner_data_type.as_ref().clone().with_name("item").into(),
Arc::unwrap_or_clone(inner_data_type).with_name("item").into(),

]

// arrow-rs which causes issues. We use the low-level
// constructor here to preserve extension metadata from the
// child type.
Ok(Field::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to start wrapping this pattern into helper methods / extension traits .

I started playing around and was able to move a lot of the mechanics around and simpify them:

@alamb
Copy link
Contributor

alamb commented Oct 22, 2025

In my opinion, there are several things we could improve in this PR, such as the conversion of DataType <> Field but we can do them as follow ons as well (given this one is already pretty big).

Thank you @paleolimbot and @tobixdev for helping blaze the trail.

I suggest we

  1. Address as many comments as feasible on this PR
  2. File any follow on issues to track improvements
  3. Merge it in and keep iterating.

The only thing I would really like to see changed is renaming LiteralValue to something less confusing, but even that is fine to do as a follow on PR I think

@paleolimbot
Copy link
Member Author

Thank you for the reviews!

The only thing I would really like to see changed is renaming LiteralValue to something less confusing

I went with ScalarAndMetadata from your suggestion list. Maybe LogicalValue (i.e., ScalarValue with logical type information) would work in some future world.

It would be nice to start wrapping this pattern into helper methods / extension traits .

Thanks! I think that is a great solution for this PR (I love all things that consolidate these assumptions/operations). Probably also a DFField or LogicalType (== arrow type + metadata + nullability) would help with this...we're using a FieldRef in a very non-Field-y way and that will always be awkward. I'll put a note on one of the UDT threads (definitely a future consideration).

protobuf compatibility follow-on ticket

I'm happy to file one if that is worth tracking (I didn't know that it was explicitly unsupported when I was doing the protobuf changes).

@alamb alamb added the api change Changes the API exposed to users of the crate label Oct 24, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @paleoimbot and @tobixdev

While I am sure we will improve the APIs more as we go forward I think this is a significant improvement over what we currently have so I'll merge it and let's keep on improving in follow on PRs

pub fn check_metadata_with_storage_equal(
actual: (
&DataType,
Option<&std::collections::HashMap<String, String>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method is pretty hard to grok -- think we can wrap it up a bit more in subsequent PRs

Comment on lines -1373 to +1145
pub data_type: Option<DataType>,
pub field: Option<FieldRef>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the api-change label to this PR

.map(Field::try_from)
.collect::<Result<_, _>>()?;

// If the fields are empty this may have been generated by an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb alamb added this pull request to the merge queue Oct 24, 2025
Merged via the queue into apache:main with commit 82b1307 Oct 24, 2025
32 checks passed
@paleolimbot paleolimbot deleted the params-with-metadata branch October 24, 2025 16:09
@alamb
Copy link
Contributor

alamb commented Oct 24, 2025

github-merge-queue bot pushed a commit that referenced this pull request Oct 26, 2025
## Which issue does this PR close?

- Follow on to #17986 from
@paleolimbot


## Rationale for this change

As we thread Field through more of the DataFusion APs, making it easy to
convert back and forth with Field will be increasingly important. We
added some helper methods in
#17986, but I think they could
be better documented (I wrote them so this is not a dig at @paleolimbot
!)

Lets add some more documentation and examples so it is clearer what this
code is doing.

## What changes are included in this PR?

1. Add more Documentation and examples so it is clearer what this code
is doing.


## Are these changes tested?

By CI

## Are there any user-facing changes?
More docs, no functional changes
github-merge-queue bot pushed a commit that referenced this pull request Oct 26, 2025
## Which issue does this PR close?

- Follow on to #17986 from
@paleolimbot

## Rationale for this change

As we thread Field through more of the DataFusion APs, making it easy to
convert back and forth
will be increasingly important. We added `ScalarAndMetadata` and I think
it is a good idea to add
some helper methods to make it easy to create `ScalarAndMetadata`.

## What changes are included in this PR?

Add some From impls that make conversions easier 

## Are these changes tested?

By CI
## Are there any user-facing changes?

SOme new APIs
tobixdev pushed a commit to tobixdev/datafusion that referenced this pull request Nov 2, 2025
## Which issue does this PR close?

- Closes apache#17862

## Rationale for this change

Most logical plan expressions now propagate metadata; however,
parameters with extension types or other field metadata cannot
participate in placeholder/parameter binding.

## What changes are included in this PR?

The DataType in the Placeholder struct was replaced with a FieldRef
along with anything that stored the "DataType" of a parameter.

Strictly speaking one could bind parameters with an extension type by
copy/pasting the placeholder replacer, which I figured out towards the
end of this change. I still think this change makes sense and opens up
the door for things like handling UUID in SQL with full parameter
binding support.

## Are these changes tested?

Yes

## Are there any user-facing changes?

Yes, one new function was added to extract the placeholder fields from a
plan.

This is a breaking change for code that specifically interacts with the
pub fields of the modified structs (ParamValues, Placeholder, and
Prepare are the main ones).

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
tobixdev pushed a commit to tobixdev/datafusion that referenced this pull request Nov 2, 2025
)

## Which issue does this PR close?

- Follow on to apache#17986 from
@paleolimbot


## Rationale for this change

As we thread Field through more of the DataFusion APs, making it easy to
convert back and forth with Field will be increasingly important. We
added some helper methods in
apache#17986, but I think they could
be better documented (I wrote them so this is not a dig at @paleolimbot
!)

Lets add some more documentation and examples so it is clearer what this
code is doing.

## What changes are included in this PR?

1. Add more Documentation and examples so it is clearer what this code
is doing.


## Are these changes tested?

By CI

## Are there any user-facing changes?
More docs, no functional changes
tobixdev pushed a commit to tobixdev/datafusion that referenced this pull request Nov 2, 2025
## Which issue does this PR close?

- Follow on to apache#17986 from
@paleolimbot

## Rationale for this change

As we thread Field through more of the DataFusion APs, making it easy to
convert back and forth
will be increasingly important. We added `ScalarAndMetadata` and I think
it is a good idea to add
some helper methods to make it easy to create `ScalarAndMetadata`.

## What changes are included in this PR?

Add some From impls that make conversions easier 

## Are these changes tested?

By CI
## Are there any user-facing changes?

SOme new APIs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate common Related to common crate core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates proto Related to proto crate sql SQL Planner

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Logical plan placeholders can't contain metadata

3 participants