-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use non-comparison coercion for Coalesce
or even avoid implicit casting for Coalesce
#10261
Comments
If possible, I think we should follow the model of existing implementations here (rather than invent DataFusion specific semants) Specifically if we are going to change the semantics of coalsce I think we should follow either postgres or spark's behavior -- I haven't done the research to know how close/far what DataFusion does compared to those systems |
It seems Postgres has casting for some of the types too. |
Plan:
datafusion/datafusion/expr/src/type_coercion/functions.rs Lines 82 to 261 in ed14682
https://www.postgresql.org/docs/current/typeconv-union-case.html
|
BTW this is a really nicely written ticket |
Is your feature request related to a problem or challenge?
What does Coalesce do
The coalesce function implicitly coerces types with Signature::VariadicEqual which has
comparison_coercion
internally, and the coercion is taken considered for all the columns, not only those we need, which gives us back unexpected casting results.Coerce types after first non-null values are known
We can see the following example,
Since they are coerced to Utf8, so we get 3 and 2 with Utf8.
Ideally, If we take the first non-null value, we should expect to get Int, not Utf8.
another example, dict is cast to Int64
The reason we need coercion is that it is possible that we have different types for different columns. Coerce them can help we get the final single type.
We get (1, Int8) and (3, Int32) for respective row, and finally cast them to I32.
I suggest that we apply coercion after we collect those first non-null values for each row.
Use non-comparison coercion
Comparision coercion (fn comparison_coercion) vs non-comparison coercion (fn coerced_from)
Those two logic are quite different, comparison coercion is for
comparision
. For example, compare dict with dict returns value type, since dict key is not important. Compare i64 with u64 we fallback to i64 because we don't have i128, even there is possible of lossy if you have large U64 value, but most of the cases like U64(1) and I64(1) is comparable, we will not block for those edge cases in comparison. And, there might be more.Given the difference between these two coercion, I think non-comparison coercion is more suitable for Coalesce function.
Btw, I think make_array should switch to non-comparison coercion too.make_array
should do comparison coercion, following what Duckdb doI suggest we switch VariadicEqual to non-comparison coercion or introduce another signature VariadicEqualNonCompare if comparison coercion is needed somewhere.
I think
Int64
is a big surprise that we should avoid.Maybe disable implicit coercion for Coalesce function?
I'm not sure why is coalesce function introduce implicit coercion, but I found that Postgres and Duckdb does not do implicit casting for Coalesce, maybe we should follow them?
DuckDB errors
Postgres Error
Describe the solution you'd like
Describe alternatives you've considered
No response
Additional context
No response
Related issue that has coercion issue from Coalesce #10221
Part of the idea #10241
The text was updated successfully, but these errors were encountered: