Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STRPOS doesn't support (Dictionary(Int32, Utf8) as input type #12670

Closed
goldmedal opened this issue Sep 29, 2024 · 11 comments · Fixed by #12712
Closed

STRPOS doesn't support (Dictionary(Int32, Utf8) as input type #12670

goldmedal opened this issue Sep 29, 2024 · 11 comments · Fixed by #12712
Labels
bug Something isn't working

Comments

@goldmedal
Copy link
Contributor

goldmedal commented Sep 29, 2024

Describe the bug

While working on #12415, I found STRPOS can't accept a dictionary string as an argument.

To Reproduce

The following SQL can reproduce this bug

> create table test_source as values
  ('Andrew', 'X', 'datafusion📊🔥', '🔥'),
  ('Xiangpeng', 'Xiangpeng', 'datafusion数据融合', 'datafusion数据融合'),
  ('Raphael', 'R', 'datafusionДатаФусион', 'аФус'),
  (NULL, 'R', NULL, '🔥');
0 row(s) fetched. 
Elapsed 0.068 seconds.

> create table test_basic_operator as
select
    arrow_cast(column1, 'Dictionary(Int32, Utf8)') as ascii_1,
    arrow_cast(column2, 'Dictionary(Int32, Utf8)') as ascii_2,
    arrow_cast(column3, 'Dictionary(Int32, Utf8)') as unicode_1,
    arrow_cast(column4, 'Dictionary(Int32, Utf8)') as unicode_2
from test_source;
0 row(s) fetched. 
Elapsed 0.043 seconds.

> SELECT
  STRPOS(ascii_1, 'e'),
  STRPOS(ascii_1, 'ang'),
  STRPOS(ascii_1, NULL),
  STRPOS(unicode_1, 'и'),
  STRPOS(unicode_1, 'ион'),
  STRPOS(unicode_1, NULL)
FROM test_basic_operator;
Execution error: Unsupported data type combination (Dictionary(Int32, Utf8), Utf8) for function strpos

Expected behavior

This SQL should work.

Additional context

Actually, it may be a bug in recent commits. I am really sure it worked until 26c8004 (#12619). After rebasing to the latest main branch, the test case added by #12618 is broken. The bug maybe caused by 524e56d..c21d025

@goldmedal goldmedal added the bug Something isn't working label Sep 29, 2024
@alamb
Copy link
Contributor

alamb commented Sep 29, 2024

This is a good example of the kind of bug that @findepi 's proposal will hopefully help with:

Specifically once we get a standardized way to make specialized function bodies maybe this bug would be fixed by switching to that standardized way

@findepi
Copy link
Member

findepi commented Sep 30, 2024

@goldmedal can you please check the following queries?

orignal query from this isuse

> select strpos(arrow_cast('123', '(Dictionary(Int32, Utf8)'), '123');
Error during planning: Unsupported type '(Dictionary(Int32, Utf8)'. [...]

a query without ( before Dictionary in the cast definition:

> select strpos(arrow_cast('123', 'Dictionary(Int32, Utf8)'), '123');
+-----------------------------------------------------------------------------+
| strpos(arrow_cast(Utf8("123"),Utf8("Dictionary(Int32, Utf8)")),Utf8("123")) |
+-----------------------------------------------------------------------------+
| 1                                                                           |
+-----------------------------------------------------------------------------+
1 row(s) fetched.

@findepi
Copy link
Member

findepi commented Sep 30, 2024

It seems true that strpos doesn't natively handle dictionaries as input type. It seems they are handled here by requesting a coercion

(DataType::Dictionary(_, value_type), DataType::LargeUtf8 | DataType::Utf8View | DataType::Utf8) => match **value_type {
DataType::LargeUtf8 | DataType::Utf8View | DataType::Utf8 | DataType::Null | DataType::Binary => Ok(vec![*value_type.clone(), second.to_owned()]),

but i don't see this reflected in the EXPLAIN plan, so not sure how it works.

BTW strpos use of function-driven coercion rules is non-orthodox and seems not exactly in line with generally recommended approach (see @alamb 's #12308 (comment)), which may lead to inconsistencies between this function and others.
It seems that function implementors have too much control over planning/coercions.
cc @alamb @jayzhan211 @notfilippo

@goldmedal
Copy link
Contributor Author

a query without ( before Dictionary in the cast definition:

> select strpos(arrow_cast('123', 'Dictionary(Int32, Utf8)'), '123');
+-----------------------------------------------------------------------------+
| strpos(arrow_cast(Utf8("123"),Utf8("Dictionary(Int32, Utf8)")),Utf8("123")) |
+-----------------------------------------------------------------------------+
| 1                                                                           |
+-----------------------------------------------------------------------------+
1 row(s) fetched.

Oh, you're right. The scalar value works well. I actually found this bug in another case.
I'll update the reproducing steps. Thanks!

> create table test_source as values
  ('Andrew', 'X', 'datafusion📊🔥', '🔥'),
  ('Xiangpeng', 'Xiangpeng', 'datafusion数据融合', 'datafusion数据融合'),
  ('Raphael', 'R', 'datafusionДатаФусион', 'аФус'),
  (NULL, 'R', NULL, '🔥');
0 row(s) fetched. 
Elapsed 0.068 seconds.

> create table test_basic_operator as
select
    arrow_cast(column1, 'Dictionary(Int32, Utf8)') as ascii_1,
    arrow_cast(column2, 'Dictionary(Int32, Utf8)') as ascii_2,
    arrow_cast(column3, 'Dictionary(Int32, Utf8)') as unicode_1,
    arrow_cast(column4, 'Dictionary(Int32, Utf8)') as unicode_2
from test_source;
0 row(s) fetched. 
Elapsed 0.043 seconds.

> SELECT
  STRPOS(ascii_1, 'e'),
  STRPOS(ascii_1, 'ang'),
  STRPOS(ascii_1, NULL),
  STRPOS(unicode_1, 'и'),
  STRPOS(unicode_1, 'ион'),
  STRPOS(unicode_1, NULL)
FROM test_basic_operator;
Execution error: Unsupported data type combination (Dictionary(Int32, Utf8), Utf8) for function strpos

@findepi
Copy link
Member

findepi commented Sep 30, 2024

It seems to have something to do with NULL handling. Your last query runs fine on current main if you remove STRPOS(..., NULL) cases. And a query with a NULL second arg fails even in single-row / VALUES case:

> select strpos(arrow_cast('123', 'Dictionary(Int32, Utf8)'), NULL);
Execution error: Unsupported data type combination (Dictionary(Int32, Utf8), Utf8) for function strpos

@Omega359
Copy link
Contributor

I am unsure why this function would need manual type coercion. I haven't tested but I would think that a signature such as :

       Signature::one_of(
                vec![
                    Exact(vec![Utf8View, Utf8View]),
                    Exact(vec![Utf8View, Utf8]),
                    Exact(vec![Utf8View, LargeUtf8]),
                    Exact(vec![Utf8, Utf8View]),
                    Exact(vec![Utf8, Utf8]),
                    Exact(vec![Utf8, LargeUtf8]),
                    Exact(vec![LargeUtf8, Utf8View]),
                    Exact(vec![LargeUtf8, Utf8]),
                    Exact(vec![LargeUtf8, LargeUtf8]),
                ],
                Volatility::Immutable,
            )

should be enough?

@alamb
Copy link
Contributor

alamb commented Sep 30, 2024

I am unsure why this function would need manual type coercion. I haven't tested but I would think that a signature such as

I agree that seems reasonable to me. Maybe the code just needs to be cleaned up

@jayzhan211
Copy link
Contributor

I am unsure why this function would need manual type coercion. I haven't tested but I would think that a signature such as :

       Signature::one_of(
                vec![
                    Exact(vec![Utf8View, Utf8View]),
                    Exact(vec![Utf8View, Utf8]),
                    Exact(vec![Utf8View, LargeUtf8]),
                    Exact(vec![Utf8, Utf8View]),
                    Exact(vec![Utf8, Utf8]),
                    Exact(vec![Utf8, LargeUtf8]),
                    Exact(vec![LargeUtf8, Utf8View]),
                    Exact(vec![LargeUtf8, Utf8]),
                    Exact(vec![LargeUtf8, LargeUtf8]),
                ],
                Volatility::Immutable,
            )

should be enough?

Because we need to handle Nulls as well

@alamb
Copy link
Contributor

alamb commented Oct 1, 2024

Because we need to handle Nulls as well

I feel like we need to teach the coercion logic that it can substitute Null for any data type passed in

So given a signature like

                    Exact(vec![Utf8View, Utf8View]),

I would expect the coercion logic to be able to handle inputs like the following (by casting Null to Utf8View)

(Null, Null)
(Null, Utf8View) 
(Utf8View, Null)
(Utf8View, Utf8View)

@alamb
Copy link
Contributor

alamb commented Oct 1, 2024

Filed #12698 to track trying to improve this

@findepi
Copy link
Member

findepi commented Oct 2, 2024

it seems the source of the problem is the Signature::user_defined, where the engine cannot do coercions, because function took responsibility for it. Two options are equivalent, but only "engine's responsibility" can provide consistency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants