-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Enable placeholders with extension types #17986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
97a8408 to
04cebe1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timsaucer @alamb I'm happy to add tests for all these components but wanted to make sure this is vaugely headed in the right direction before I do so!
| pub data_type: Option<DataType>, | ||
| pub field: Option<FieldRef>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main change. We can change this to a less severely breaking option (e.g., just add metadata: FieldMetadata to the struct)...I started with the most breaking version to identify its use in as many places as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a reasonable change, although to your point it will be a breaking API change and may cause issues for downstream users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the api-change label to this PR
datafusion/sql/src/planner.rs
Outdated
| pub(crate) fn convert_data_type(&self, sql_type: &SQLDataType) -> Result<DataType> { | ||
| pub(crate) fn convert_data_type(&self, sql_type: &SQLDataType) -> Result<FieldRef> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change would also enable supporting UUIDs and other SQL types that map to extension types
| Ok(Expr::Cast(Cast::new( | ||
| Box::new(expr), | ||
| dt.data_type().clone(), | ||
| ))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another place where metadata is dropped that I didn't update (casts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made #18060 for this one
| #[derive(Clone, PartialEq, Eq, PartialOrd, Hash, Debug)] | ||
| pub struct FieldMetadata { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just moved this from the expr crate so I could us it in ParamValues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @paleolimbot -- I think this is definitely the right direction. I had some small API suggestions, but overall looks good
The largest open question in my mind is what you have highlighted for customizing behavior for different extension types (e.g. comparing two fields for "equality" and printing them, and casting them, etc.)
@findepi brought up the same thing many months ago when discussing adding new types in
One idea is to create a TypeRegistry similar to a FunctionRegistry and some sort of ExtensionType trait that encapsulates these behaviors.
The challenge would then be to thread the registry to all places that need it. Though that is likely largely an API design / plumbing exercise
If you think that is an idea worth exploring
datafusion/common/src/param_value.rs
Outdated
| impl ParamValues { | ||
| /// Verify parameter list length and type | ||
| pub fn verify(&self, expect: &[DataType]) -> Result<()> { | ||
| pub fn verify(&self, expect: &[FieldRef]) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one thing that would be nice to help people upgrade could be to add a new function and deprecate this one -- perhaps something like suggested in https://datafusion.apache.org/contributor-guide/api-health.html#api-health-policy
#[deprecated]
pub fn verify(&self, expect: &[DataType]) -> Result<()> {
// make dummy Fields
let expect = ...;
self.verify_fields(&expect)
}
// new function that has the new signature
pub fn verify_fields(&self, expect: &[FieldRef]) -> Result<()> {
...
}
datafusion/common/src/param_value.rs
Outdated
| } | ||
|
|
||
| if let Some(expected_metadata) = maybe_metadata { | ||
| // Probably too strict of a comparison (this is an example of where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree straight up comparing strings is probably not ideal
If we wanted to introduce type equality, I thing the bigger question is how to thread it through (you would have to have some way to register your types / methods to check equality and ensure that somehow ended up here 🤔 )
I've stumbled across this PR and remembered that I implemented such a thing in #15106 for defining custom sort orders for certain types. I now have a bit more time on my hand and I'll try to revive that effort as we would be still interested in that. |
|
Thank you for the review! I will clean this up today/tomorrow and ensure we can test all the spots.
I'd love that! Perhaps the first step from the non-sorting angle might be to just consolidate all the places we print, cast, or test equality (which are currently scattered around the code base). A vague thought based mostly on this PR, SedonaDB's pub fn resolve_df_data_type(session: Option<SessionSomething>, storage_type: &DataType, metadata: impl MetadataViewOfSomeSort ) -> Box<dyn DFDataType> {
// sanity check to avoid hitting the lock on any registry whatsoever when not needed)
// If the session is specified, use a session registry
// otherwise, use a static or compile-time populated registry until such a time as the session
// is made available everywhere and/or DataType::Extension is a thing
}
pub trait DFDataType: Clone {
fn storage_type(&self) -> DataType;
// e.g., the sanity check on the return field of a UDxF
fn type_identical_to(&self, &dyn DFDataType) -> Result<bool>;
// for when this has to be a storage field (it might already be one)
fn storage_field(&self, name: Option<&str>, nullability: Option<bool>) -> FieldRef;
fn add_to_existing_storage_field(&self
// for when this has to be just metadata (e.g., the Expr::Literal)
fn storage_field_metadata(&self) -> Option<FieldMetadata>;
// to avoid dumping a Field debug string in user-facing errors
fn display_type(&self) -> String;
// for any hope of ever being able to deal with shredded/unshredded variants in UNION ALL
// slash listing tables. This is useful for some geoarrow types too (wkb/wkb_view -> wkb)
fn common_castable_type(&self, &dyn DFDataType) -> Result<Box<dyn DFDataType>> { /* self if types are identical or error */ }
// usually printing the storage is OK but no ideal (e.g., variants should be jsonish)
fn format_proxy(&self, storage: ArrayRef) -> Result<ArrayRef> { Ok(storage) }
// Or something more advanced like the dynamic comparator in the linked PR
fn sort_proxy(&self, storage: ArrayRef) -> Result<ArrayRef> { Ok(storage) }
// It's a shame to bleed execution in here but casting shows up a lot
fn cast_storage(&self, storage: ArrayRef, to: Box<dyn DFDataType>);
}
// All the structures that are currently communicating that type information
impl DFDataType for FieldMetadata {
// implement strict equality based on all metadata (or just extension name/extension metadata?)
// printing based on metadata
// everything else based on datatype
// ...basically everything datafusion does right now
}
impl DFDataType for FieldRef {
// defer to fieldmetadata
}
impl DFDataType for DataType {
// defer to DataTypes
} |
|
Btw, if you want another set of eyes you can ping me once its ready for another review. |
|
Thank you...I'd love that! I just need to finish up a few tests...I didn't quite get there today but I'm confident I can get this ready for eyes tomorrow. |
9b6c182 to
27e8e42
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @paleolimbot, I think this is a step in the right direction
While it does contain some breaking API changes I don't really see how that can be avoided as we need to be working Field through the codebase.
I got half way through the review and then saw new commits so I'll go back and review them now too
| pub fn format_type_and_metadata( | ||
| data_type: &DataType, | ||
| metadata: Option<&std::collections::HashMap<String, String>>, | ||
| ) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started a ticket to discuss how such an API might look like:
datafusion/common/src/param_value.rs
Outdated
| /// | ||
| /// Use [`ParamValues::verify_fields`] to ensure field metadata is considered when | ||
| /// computing type equality. | ||
| #[deprecated] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please add a note saying "use varify_fields" as part of the notice as described here: https://datafusion.apache.org/contributor-guide/api-health.html#deprecation-guidelines
Something like
| #[deprecated] | |
| #[deprecated(since = "51.0.0", note = "Use verifiy_fields instead")] |
| pub data_type: Option<DataType>, | ||
| pub field: Option<FieldRef>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a reasonable change, although to your point it will be a breaking API change and may cause issues for downstream users.
datafusion/expr/src/expr.rs
Outdated
| } | ||
|
|
||
| /// Create a new Placeholder expression from a Field | ||
| pub fn new_with_metadata(id: String, field: Option<FieldRef>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be more consistent if it was named new_with_field rather than new_with_metadata
|
|
||
| impl Placeholder { | ||
| /// Create a new Placeholder expression | ||
| pub fn new(id: String, data_type: Option<DataType>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should probably deprecate this one too perhaps -- and direct users to the one below
| .map(Field::try_from) | ||
| .collect::<Result<_, _>>()?; | ||
|
|
||
| // If the fields are empty this may have been generated by an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is a precedent that I know of -- see https://docs.rs/datafusion-proto/latest/datafusion_proto/#version-compatibility
It would be cool to have a way to test this, even if we don't guarantee it. But given there is no precedent I think we can file a follow on ticket to track this feature
datafusion/sql/src/planner.rs
Outdated
| // Arrays may be multi-dimensional. | ||
| let inner_data_type = self.convert_data_type(inner_sql_type)?; | ||
| Ok(DataType::new_list(inner_data_type, true)) | ||
| let inner_data_type = self.convert_data_type_to_field(inner_sql_type)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This confused me for a while -- I missed inner_data_type is now a Field
Perhaps this would be easier to understand if the datatype was named inner_field
datafusion/sql/src/planner.rs
Outdated
| Ok(Field::new( | ||
| "", | ||
| DataType::List( | ||
| inner_data_type.as_ref().clone().with_name("item").into(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can potentially avoid a copy using Arc::unwrap_or_clone here:
| inner_data_type.as_ref().clone().with_name("item").into(), | |
| Arc::unwrap_or_clone(inner_data_type).with_name("item").into(), |
]
datafusion/sql/src/planner.rs
Outdated
| // arrow-rs which causes issues. We use the low-level | ||
| // constructor here to preserve extension metadata from the | ||
| // child type. | ||
| Ok(Field::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to start wrapping this pattern into helper methods / extension traits .
I started playing around and was able to move a lot of the mechanics around and simpify them:
|
In my opinion, there are several things we could improve in this PR, such as the conversion of DataType <> Field but we can do them as follow ons as well (given this one is already pretty big). Thank you @paleolimbot and @tobixdev for helping blaze the trail. I suggest we
The only thing I would really like to see changed is renaming |
Add DataTypeExt and FieldExt to assist converting to/from Fields and …
|
Thank you for the reviews!
I went with
Thanks! I think that is a great solution for this PR (I love all things that consolidate these assumptions/operations). Probably also a
I'm happy to file one if that is worth tracking (I didn't know that it was explicitly unsupported when I was doing the protobuf changes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @paleoimbot and @tobixdev
While I am sure we will improve the APIs more as we go forward I think this is a significant improvement over what we currently have so I'll merge it and let's keep on improving in follow on PRs
| pub fn check_metadata_with_storage_equal( | ||
| actual: ( | ||
| &DataType, | ||
| Option<&std::collections::HashMap<String, String>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method is pretty hard to grok -- think we can wrap it up a bit more in subsequent PRs
| pub data_type: Option<DataType>, | ||
| pub field: Option<FieldRef>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the api-change label to this PR
| .map(Field::try_from) | ||
| .collect::<Result<_, _>>()?; | ||
|
|
||
| // If the fields are empty this may have been generated by an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
I also made a few small follow on PRs: |
## Which issue does this PR close? - Follow on to #17986 from @paleolimbot ## Rationale for this change As we thread Field through more of the DataFusion APs, making it easy to convert back and forth with Field will be increasingly important. We added some helper methods in #17986, but I think they could be better documented (I wrote them so this is not a dig at @paleolimbot !) Lets add some more documentation and examples so it is clearer what this code is doing. ## What changes are included in this PR? 1. Add more Documentation and examples so it is clearer what this code is doing. ## Are these changes tested? By CI ## Are there any user-facing changes? More docs, no functional changes
## Which issue does this PR close? - Follow on to #17986 from @paleolimbot ## Rationale for this change As we thread Field through more of the DataFusion APs, making it easy to convert back and forth will be increasingly important. We added `ScalarAndMetadata` and I think it is a good idea to add some helper methods to make it easy to create `ScalarAndMetadata`. ## What changes are included in this PR? Add some From impls that make conversions easier ## Are these changes tested? By CI ## Are there any user-facing changes? SOme new APIs
## Which issue does this PR close? - Closes apache#17862 ## Rationale for this change Most logical plan expressions now propagate metadata; however, parameters with extension types or other field metadata cannot participate in placeholder/parameter binding. ## What changes are included in this PR? The DataType in the Placeholder struct was replaced with a FieldRef along with anything that stored the "DataType" of a parameter. Strictly speaking one could bind parameters with an extension type by copy/pasting the placeholder replacer, which I figured out towards the end of this change. I still think this change makes sense and opens up the door for things like handling UUID in SQL with full parameter binding support. ## Are these changes tested? Yes ## Are there any user-facing changes? Yes, one new function was added to extract the placeholder fields from a plan. This is a breaking change for code that specifically interacts with the pub fields of the modified structs (ParamValues, Placeholder, and Prepare are the main ones). --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
) ## Which issue does this PR close? - Follow on to apache#17986 from @paleolimbot ## Rationale for this change As we thread Field through more of the DataFusion APs, making it easy to convert back and forth with Field will be increasingly important. We added some helper methods in apache#17986, but I think they could be better documented (I wrote them so this is not a dig at @paleolimbot !) Lets add some more documentation and examples so it is clearer what this code is doing. ## What changes are included in this PR? 1. Add more Documentation and examples so it is clearer what this code is doing. ## Are these changes tested? By CI ## Are there any user-facing changes? More docs, no functional changes
## Which issue does this PR close? - Follow on to apache#17986 from @paleolimbot ## Rationale for this change As we thread Field through more of the DataFusion APs, making it easy to convert back and forth will be increasingly important. We added `ScalarAndMetadata` and I think it is a good idea to add some helper methods to make it easy to create `ScalarAndMetadata`. ## What changes are included in this PR? Add some From impls that make conversions easier ## Are these changes tested? By CI ## Are there any user-facing changes? SOme new APIs
Which issue does this PR close?
Rationale for this change
Most logical plan expressions now propagate metadata; however, parameters with extension types or other field metadata cannot participate in placeholder/parameter binding.
What changes are included in this PR?
The DataType in the Placeholder struct was replaced with a FieldRef along with anything that stored the "DataType" of a parameter.
Strictly speaking one could bind parameters with an extension type by copy/pasting the placeholder replacer, which I figured out towards the end of this change. I still think this change makes sense and opens up the door for things like handling UUID in SQL with full parameter binding support.
Are these changes tested?
Yes
Are there any user-facing changes?
Yes, one new function was added to extract the placeholder fields from a plan.
This is a breaking change for code that specifically interacts with the pub fields of the modified structs (ParamValues, Placeholder, and Prepare are the main ones).