-
Notifications
You must be signed in to change notification settings - Fork 1.7k
feat: add metadata to physical literal expressions #16053
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
821bffd
to
dce4360
Compare
…metadata, and if they do so do not simplify
@kylebarron @paleolimbot I tested this latest push against the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! We have a set of queries we're running in testing but can't port them over to dev DataFusion yet for versioning reasons. We're fully intending to get there but haven't quite yet!
#[derive(Debug, PartialEq, Eq)] | ||
pub struct Literal { | ||
value: ScalarValue, | ||
metadata: Option<HashMap<String, String>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting - this puts the responsibility for carrying metadata on the Literal
instead of the ScalarValue
. I wonder if there are other places (e.g., the Statistics
) where we'll need a scalar paired with metadata (I was wondering if ScalarValue::WithMetadata(Box<ScalarValue>, HashMap<String, String>>)
might work as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I hadn't thought about that. That's a very interesting way to solve one of the problems I was coming up against. I will try a proof of concept of that on a separate branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on that! That's an interesting idea! When i experimented with logical scalars, I did not go down that path as I thought that it is very easy to miss handling metadata that only adds little information to a scalar.
To elaborate a bit on that: Let's assume I have a DataType::Utf8
field and a len()
function that returns the length of a value. For strings, that could be the number of unicode characters.
Now, given the following "implementation" of len
:
match value {
ScalarValue::Utf8(value) => count_chars(value),
_ => panic!("bye bye")
}
Now I'd like to call this function with ScalarValue::WithMetadata(my_string, LANGUAGE_INFO)
. This call will panic, as the actual scalar is hidden behind the metadata wrapper. Initially I though this is a deal-breaker, as many functions just want to work on the physical data. After all, that's how it is with arrays, as the Metadata can be simply ignored.
However, in a scenario where the metadata indicates an extension type, it may be a good thing that Metadata cannot be ignored that easily. For example, the len()
function should probably return the number of keys/entries in the top-level JSON object/array. Carrying the metadata in a separate ScalarValue
variant then may be a good thing! Especially if there is a function that unwraps the metadata so that some functions can operate on the scalar without metadata.
Hope this does make some sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are two compelling arguments for keeping the metadata on the Literal
rather than the ScalarValue
- Does not require a change to the protobuf definitions. I haven't tested this thoroughly yet, but I think if we put the metadata on the scalar value we would need more breaking changes on the conversion between protobuf and internal types, basically needing to pass the schema around. This isn't a deal-breaker but I could see pain points for some users.
- The match statements like @tobixdev points out. I've also seen a growing usage of things like
let ScalarValue::Utf8(my_string) = val else { return exec_error!("invalid scalar value") }
which would not survive a case where they really do have a valid value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping with the literial is OK as long as we are also prepared to make any other breaking changes required to pipe metadata through to ensure that it is not dropped/it is possible to implement extension logic in other places without rewriting large chunks of DataFusion code. A few places I've noted in passing are the Statistics, the physical and logical literals, casting logic, and some conversions like protobuf and to/from Python via ffi. Perhaps those places should be using a Literal
instead of a ScalarValue
?
This is possibly a good illustration of the difference between "metadata" and extension types...an extension type scalar should not be blindly/implicitly cast to its storage type; however, for an integer that happened to have some field metadata attached to it, it would be surprising if that metadata caused an error somewhere in the engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a slightly deeper cut in this PR: #16170
In addition to the work in the current PR, that above one adds the metadata directly to the Literal
variant of the logical plan rather than only the physical plan. I think that one is a better version long term, but it is more of a breaking change. If we're doing these other breaking changes for metadata in user defined functions in DF48, then maybe we can add it in as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good! You have a much better mental model of the internals and I trust you. Optimizing for ensuring existing code runs (i.e., ensuring somebody's integer that happened to have field metadata is treated the same as in DataFusion 47) is probably better as long as there's a future plan for some of the other things we might need to do to get extension-type like behaviour 🙂
Closing in favor of #16170 |
Which issue does this PR close?
Rationale for this change
Now that we have metadata processing on scalar UDFs, we need to support scalar values that have metadata associated with them. This will allow users to send scalar values to these functions and also evaluate the metadata. This also impacts how extension types would be handled by these UDFs.
What changes are included in this PR?
This PR adds in the option to pass metadata from an alias expression on a literal value through to the physical plan. It requires some rework of the expression simplifier code where we now check aliases to see if they have metadata instead of just simplifying all aliases.
It is unclear to me if instead of relying on
Alias
to add metadata if we need to add this directly into theExpr::Literal
struct instead. That would be a more impactful change to downstream projects, but I am willing to add that work if people believe it would make for a better implementation.Are these changes tested?
Unit test added, and will also be testing with geoarrow-rs project where an example of this problem was discovered.
Are there any user-facing changes?
None