Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect column cast when coalescing array type #9098

Open
vigimite opened this issue Feb 1, 2024 · 5 comments
Open

Incorrect column cast when coalescing array type #9098

vigimite opened this issue Feb 1, 2024 · 5 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@vigimite
Copy link

vigimite commented Feb 1, 2024

Describe the bug

When coalescing between two arrays, datafusion will infer the type to be Utf8.

To Reproduce

❯ select arrow_typeof(coalesce([1], [2]));
+-------------------------------------------------------------------+
| arrow_typeof(coalesce(make_array(Int64(1)),make_array(Int64(2)))) |
+-------------------------------------------------------------------+
| Utf8                                                              |
+-------------------------------------------------------------------+
1 row in set. Query took 0.001 seconds.

❯ select arrow_typeof(coalesce(NULL, [2]));
+---------------------------------------------------+
| arrow_typeof(coalesce(NULL,make_array(Int64(2)))) |
+---------------------------------------------------+
| Utf8                                              |
+---------------------------------------------------+
1 row in set. Query took 0.001 seconds.

❯ select arrow_typeof(coalesce([1], NULL));
+---------------------------------------------------+
| arrow_typeof(coalesce(make_array(Int64(1)),NULL)) |
+---------------------------------------------------+
| Utf8                                              |
+---------------------------------------------------+
1 row in set. Query took 0.001 seconds.

Expected behavior

Postgres handles this correctly for example:

select pg_typeof(coalesce(NULL, array[1,2,3]));

-- pg_typeof
-- integer[]

Additional context

I didn't test this extensively but I think the same is true for struct types (and by extension maybe all "complex" type?)

If someone comes accross this, I am currently working around this by replacing the coalesce with a case statement:

when(col("c1").is_null(), col("c2"))
    .otherwise(col("c1"))
    .unwrap()
    .alias("c1"),

This preserves the type of the column.

@vigimite vigimite added the bug Something isn't working label Feb 1, 2024
@Omega359
Copy link
Contributor

Omega359 commented Feb 1, 2024

There doesn't seem to be any tests that cover this case that I can find.

@Omega359
Copy link
Contributor

Omega359 commented Feb 1, 2024

Possibly related - I found this in the arrow_typeof.slt

#TODO: arrow-rs doesn't support it yet
#quer
y ?
#select arrow_cast('1', 'FixedSizeList(1, Int64)');
#----
#[1]

Enabling that results in an error converting utf8 which I suspect may be related to what is happening here

External error: query failed: DataFusion error: Error during planning: Cannot automatically convert Utf8 to FixedSizeList(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 1)
[SQL] select arrow_cast('1', 'FixedSizeList(1, Int64)');
at test_files/arrow_typeof.slt:397

@alamb
Copy link
Contributor

alamb commented Feb 1, 2024

@Weijun-H just merged #8902 and I think FixedSizeList is now supported in arrow_cast that @Omega359 mentions in #9098 (comment)

I would not at all be surprised if the type coercion logic doesn't work for coaelsce

I just double checked and even on latest main I get the same results reported in this ticket

@alamb alamb added the help wanted Extra attention is needed label Feb 1, 2024
@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 2, 2024

The reason that we got UTF8 is that the signature of coalesce is Variadic that contains UTF8 (one of SUPPORTED_COALESCE_TYPES) and both null and list(i64) can be coerced into UTF8. Therefore they are utf8 at the end.

We should change the signature of Coalesce into VariadicEqual

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 2, 2024

While I tried out, I failed in one of the query in joins.slt

query TIR
select col1, col2, coalesce(sum_col3, 0) as sum_col3
from (select distinct col2 from tbl) AS q1
cross join (select distinct col1 from tbl) AS q2
left outer join (SELECT col1, col2, sum(col3) as sum_col3 FROM tbl GROUP BY col1, col2) AS q3
USING(col2, col1)
ORDER BY col1, col2

which the error is that

External error: query failed: DataFusion error: Error during planning: No function matches the given name and argument types 'coalesce(UInt64, Int64)'. You might need to add explicit type casts.
	Candidate functions:
	coalesce(CoercibleT, .., CoercibleT)

This error occurs because when we have type i64 and u64 in comparison_coercion, it returns i64. But coerced_from fails to convert u64 to i64.

We can get the same error with select make_array(1, arrow_cast(2, 'UInt64'));, which has the signature VariadicEqual.

One solution is that we introduce i128 like duckdb, then we can convert both u64 and i64 to i128.
The other solution is that we coerced the value to either f64 or i64 or u64 that may lost value but most of the cases are fine.

We can also change the edge cases like query in joins.slt to other smaller type for this issue, then fight with larger types later on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants