Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upper (and other string functions) don't support String Dictionary types: Internal error: The "upper" function can only accept strings. #5471

Closed
alamb opened this issue Mar 3, 2023 · 20 comments · Fixed by #7262
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Mar 3, 2023

Describe the bug
Running upper(col) where col is a dictionary results in an internal error:

Internal error: The "upper" function can only accept strings.. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

To Reproduce
Requires the arrow_cast function in #5166

DataFusion CLI v19.0.0select upper('foo');
+--------------------+
| upper(Utf8("foo")) |
+--------------------+
| FOO                |
+--------------------+
1 row in set. Query took 0.004 seconds.
❯ select upper(arrow_cast('foo', 'Dictionary(Int32, Utf8)'));
Internal error: The "upper" function can only accept strings.. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

Expected behavior
The test should pass and produce FOO

Additional context
Reported by @sanderson at influxdata/docs-v2#4773 (comment)

@tustvold
Copy link
Contributor

tustvold commented Mar 3, 2023

https://docs.rs/arrow-array/latest/arrow_array/array/struct.DictionaryArray.html#method.with_values might be useful here

@alamb
Copy link
Contributor Author

alamb commented Mar 3, 2023

I strongly suspect the bug/limitiation is in type coercion somewhere. Perhaps https://github.com/apache/arrow-datafusion/blob/main/datafusion/expr/src/function.rs#L55-L73

@alamb alamb self-assigned this Mar 9, 2023
@alamb
Copy link
Contributor Author

alamb commented Apr 26, 2023

I started looking into this -- in general the string expressions need some love. Roughly:

  • Improve the tests (so I have some place to add the reproducer for dictionaries as well as LargeStringArray)
  • Update the code to handle dictionaries,

@alamb
Copy link
Contributor Author

alamb commented Apr 27, 2023

length / character_length has the same issue:

❯ select length("labels.chart") from 'incoming_lines.count' limit 1;
Internal error: The "character_length" function can only accept strings.. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

@alamb alamb changed the title upper (and maybe other functions) don't support String Dictionary types: Internal error: The "upper" function can only accept strings. upper (and other dictionary functions) don't support String Dictionary types: Internal error: The "upper" function can only accept strings. Apr 28, 2023
@alamb alamb changed the title upper (and other dictionary functions) don't support String Dictionary types: Internal error: The "upper" function can only accept strings. upper (and other string functions) don't support String Dictionary types: Internal error: The "upper" function can only accept strings. Apr 28, 2023
@alamb
Copy link
Contributor Author

alamb commented May 18, 2023

The code for these functions is in https://github.com/apache/arrow-datafusion/blob/main/datafusion/physical-expr/src/string_expressions.rs

As an initial step, I recommend creating some sqllogictests reproducing the issue. sqllogictest is explained here: https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/tests/sqllogictests

Perhaps by extending what is in https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/sqllogictests/test_files/functions.slt

@alamb alamb removed their assignment May 19, 2023
@alamb
Copy link
Contributor Author

alamb commented Jul 27, 2023

BTW a workaround is to cast explicitly to varchar:

select upper(col::varchar)

@appletreeisyellow
Copy link
Contributor

I'm going to put up a PR soon

@appletreeisyellow
Copy link
Contributor

appletreeisyellow commented Aug 10, 2023

These are all the functions that generate the same error as upper. I'll include them in tests

$ grep utf8_to_str_type

datafusion/expr/src/built_in_function.rs
600:                utf8_to_str_type(&input_expr_types[0], "btrim")
628:                utf8_to_str_type(&input_expr_types[0], "initcap")
630:            BuiltinScalarFunction::Left => utf8_to_str_type(&input_expr_types[0], "left"),
632:                utf8_to_str_type(&input_expr_types[0], "lower")
634:            BuiltinScalarFunction::Lpad => utf8_to_str_type(&input_expr_types[0], "lpad"),
636:                utf8_to_str_type(&input_expr_types[0], "ltrim")
638:            BuiltinScalarFunction::MD5 => utf8_to_str_type(&input_expr_types[0], "md5"),
651:                utf8_to_str_type(&input_expr_types[0], "regex_replace")
654:                utf8_to_str_type(&input_expr_types[0], "repeat")
657:                utf8_to_str_type(&input_expr_types[0], "replace")
660:                utf8_to_str_type(&input_expr_types[0], "reverse")
663:                utf8_to_str_type(&input_expr_types[0], "right")
665:            BuiltinScalarFunction::Rpad => utf8_to_str_type(&input_expr_types[0], "rpad"),
667:                utf8_to_str_type(&input_expr_types[0], "rtrimp")
711:                utf8_to_str_type(&input_expr_types[0], "split_part")
718:                utf8_to_str_type(&input_expr_types[0], "substr")
740:                utf8_to_str_type(&input_expr_types[0], "translate")
742:            BuiltinScalarFunction::Trim => utf8_to_str_type(&input_expr_types[0], "trim"),
744:                utf8_to_str_type(&input_expr_types[0], "upper")
$ grep utf8_to_int_type

datafusion/expr/src/built_in_function.rs
597:                utf8_to_int_type(&input_expr_types[0], "bit_length")
603:                utf8_to_int_type(&input_expr_types[0], "character_length")
645:                utf8_to_int_type(&input_expr_types[0], "octet_length")
715:                utf8_to_int_type(&input_expr_types[0], "strpos")

@tustvold
Copy link
Contributor

I presume this is what you intend to do, but I would recommend implementing this by evaluating the function on the dictionary values using the existing kernels, and then passing the result from this into the take kernel with the dictionary keys as the second argument. This should not only be less code, but will avoid a codegen explosion from parameterising generic code on both dictionary key types and value types

@alamb
Copy link
Contributor Author

alamb commented Aug 11, 2023

I just tested this out after #7262 was merged and the functions still don't work for arrays it seems:

create table foo(x varchar) as values ('foo'), ('bar');
create table foo_dict as select arrow_cast(x, 'Dictionary(Int32, Utf8)') as x from foo;
select upper(x) from foo_dict;

Results in an internal error:

DataFusion CLI v28.0.0
❯ create table foo(x varchar) as values ('foo'), ('bar');
create table foo_dict as select arrow_cast(x, 'Dictionary(Int32, Utf8)') as x from foo;
select upper(x) from foo_dict;

0 rows in set. Query took 0.034 seconds.

0 rows in set. Query took 0.039 seconds.

Internal error: The "upper" function can only accept strings.. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

So I think #7262 made things better but there is still something not working

@alamb alamb reopened this Aug 11, 2023
@alamb
Copy link
Contributor Author

alamb commented Aug 12, 2023

I presume this is what you intend to do, but I would recommend implementing this by evaluating the function on the dictionary values using the existing kernels, and then passing the result from this into the take kernel with the dictionary keys as the second argument.

@tustvold Since this is such a common operation (write a function that applies a function to a StringArray / LargeStringArray / DictionaryArray) I was wondering if we could make a function like https://docs.rs/arrow/latest/arrow/compute/fn.unary.html to do the dispatch and apply

Maybe something like

// apply f to each element of array, which writes the result to a temporary string
// resulting in the same type of array out
fn unary_str<F>(apply: &dyn Array, f: F) -> Result<ArrayRef> 
where:
  F: FnMut(&str) -> Cow<str>
{
...
}

Then writing upper would look like

unary_str(&input_array, |in| {
  // this cold be substantially fancier and avoid allocations, etc if the output 
  // was already upper case, etc 
  Cow::from(in.to_uppercase())
}

@tustvold
Copy link
Contributor

tustvold commented Aug 12, 2023

You could but it seems odd to me that you would special case strings, it feels like you should be able to wrap an arbitrary unary function and produce a new function that can handle dictionaries. This is a completely general transformation that need not know anything about the values or even the scalar function in question?

This would also give you a single place to choose between preserving the dictionary, i.e. using Dictionary::with_values or hydrating it using the take kernel.

Provided you are able to pass the input schema to create_physical_fun this should be relatively straightforward

@alamb
Copy link
Contributor Author

alamb commented Aug 12, 2023

You could but it seems odd to me that you would special case strings, it feels like you should be able to wrap an arbitrary unary function and produce a new function that can handle dictionaries.

What I am struggling with is that unary only works for ArrowPrimitiveTypes (aka not StringArray / LargeStringArray)

I do agree that under the covers the handling of Non DictionaryArray and DictionaryArray could be general

Maybe we could also add a function like the following to handle it 🤔

/// Applies an array --> array transform function 
/// that will apply the function to dictionary values as well
fn apply<F>(input: &dyn Array, f: F) -> Result<ArrayRef> 
where
  F: FnMut(&dyn Array) -> Result<ArrayRef> 
{
  // if input is dictionary, apply f to values()
  // otherwise, apply f to input
}

Maybe it doesn't even need to be generic so we could avoid additional code bloat 🤔 Box<dyn FnMut(dyn Array) -...

@tustvold
Copy link
Contributor

tustvold commented Aug 12, 2023

So the current state of play is we have a function to obtain a type erased function that operates on ColumnarValue and produces a ColumnarValue for non-dictionary arrays. I'm not sure if this is currently specialized to the array type, but it definitely could be.

It should therefore be trivial to make this same function recurse for the dictionary values type, and then return a type-erased function that operates on dictionaries by using the type-erased values function and manipulating its output.

This would be simpler, require minimal additional codegen, and naturally generalise to all unary functions and argument types.

TLDR you shouldn't need to write any additional code specialized on anything other than the dictionary arrays themselves

@alamb
Copy link
Contributor Author

alamb commented Aug 13, 2023

So the current state of play is we have a function to obtain a type erased function that operates on ColumnarValue and produces a ColumnarValue for non-dictionary arrays

Do you mean something in DataFusion?

Perhaps here: https://github.com/apache/arrow-datafusion/blob/00627785718d9d98998021bf44585f32c33af3ea/datafusion/physical-expr/src/functions.rs#L340

Are you suggesting we handle evaluating scalar functions on DictionaryArrays in the generic function evaluate function rather than doing something special for string functions?

@appletreeisyellow
Copy link
Contributor

I was trying to implement something you both described above, and before I add any code, I tried to run sqllogictests and datafusion-cli. Then I found something weird:

  • When I ran cargo test -p datafusion --test sqllogictests -- functions, all the tests passed
  • Then I ran cargo build. When I copied the queries from the test file to datafusion-cli, I got the same error message Internal error: The "upper" function can only accept strings..

https://github.com/apache/arrow-datafusion/blob/6ad79165f6554a66aa5ed4c5d432401c2c162f69/datafusion/core/tests/sqllogictests/test_files/functions.slt#L616

  • Why the same query passes in sqllogictests, but it gets an error in datafusion-cli? Am I missing anything?

@alamb
Copy link
Contributor Author

alamb commented Aug 16, 2023

Why the same query passes in sqllogictests, but it gets an error in datafusion-cli? Am I missing anything

Maybe you are still using an old version of datafusion-cli 🤔 What exact commands are you using to do "cargo build. When I copied the queries from the test file to datafusion-cli"

I think it should be something like this:

cd datafusion-cli
cargo build
./target/debug/datafusion-cli 

If you want to install datafusion-cli in your path you need to do something like

cd datafusion-cli
cargo install 
# now you can run datafusion-cli anywhere
datafusion-cli

@appletreeisyellow
Copy link
Contributor

appletreeisyellow commented Aug 16, 2023

Ah! I just ran datafusion-cli directly without cd into the datafusion-cli dir and I didn't use ./target/debug/datafusion-cli, so I was using the old version of datafusion cli 🤦‍♀️

This is what I tired:

  1. Pull the latest arrow-datafusion repo
  2. Ran the following:
$ cd datafusion-cli
$ cargo build
$ ./target/debug/datafusion-cli 
DataFusion CLI v29.0.0
❯ SELECT upper('foo');
+--------------------+
| upper(Utf8("foo")) |
+--------------------+
| FOO                |
+--------------------+
1 row in set. Query took 0.035 seconds.

❯ select upper(arrow_cast('foo', 'Dictionary(Int32, Utf8)'));
+--------------------+
| upper(Utf8("foo")) |
+--------------------+
| FOO                |
+--------------------+
1 row in set. Query took 0.002 seconds.

❯ create table foo(x varchar) as values ('foo'), ('bar');
0 rows in set. Query took 0.016 seconds.

❯ create table foo_dict as select arrow_cast(x, 'Dictionary(Int32, Utf8)') as x from foo;
0 rows in set. Query took 0.017 seconds.

❯ select upper(x) from foo_dict;
+-------------------+
| upper(foo_dict.x) |
+-------------------+
| FOO               |
| BAR               |
+-------------------+
2 rows in set. Query took 0.004 seconds.

I cannot re-produce the error 🤔 Definitely DataFusion CLI v29.0.0 doesn't have the error, but some point in v28.0.0 the error still exist. I tested that the error exists in v28.0.0 before this PR, and no error afterwards

@alamb Can you verify for me?

@alamb
Copy link
Contributor Author

alamb commented Aug 16, 2023

🤔 I got the same results as you did @appletreeisyellow

(arrow_dev) alamb@MacBook-Pro-8:~/Downloads/arrow-datafusion/datafusion-cli$ ./target/debug/datafusion-cli
DataFusion CLI v29.0.0
❯ create table foo(x varchar) as values ('foo'), ('bar');
0 rows in set. Query took 0.006 seconds.

❯ create table foo_dict as select arrow_cast(x, 'Dictionary(Int32, Utf8)') as x from foo;
0 rows in set. Query took 0.005 seconds.

❯ select upper(x) from foo_dict;
+-------------------+
| upper(foo_dict.x) |
+-------------------+
| FOO               |
| BAR               |
+-------------------+
2 rows in set. Query took 0.004 seconds.

And in fact the plan looks reasonable (though not performant) by automatically casting the argument to a varchar

❯ explain select upper(x) from foo_dict;
+---------------+-----------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                          |
+---------------+-----------------------------------------------------------------------------------------------+
| logical_plan  | Projection: upper(CAST(foo_dict.x AS Utf8))                                                   |
|               |   TableScan: foo_dict projection=[x]                                                          |
| physical_plan | ProjectionExec: expr=[upper(CAST(x@0 AS Utf8)) as upper(foo_dict.x)]                          |
|               |   MemoryExec: partitions=16, partition_sizes=[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] |
|               |                                                                                               |
+---------------+-----------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.004 seconds.

Thus I am not sure why it was failing in IOx in https://github.com/influxdata/influxdb_iox/pull/8479 🤔 I think more investigation is needed

@alamb
Copy link
Contributor Author

alamb commented Aug 17, 2023

This query now works in DataFusion so closing this ticket. Sorry for the confusion

@alamb alamb closed this as completed Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants