-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upper
(and other string functions) don't support String Dictionary
types: Internal error: The "upper" function can only accept strings.
#5471
Comments
I strongly suspect the bug/limitiation is in type coercion somewhere. Perhaps https://github.com/apache/arrow-datafusion/blob/main/datafusion/expr/src/function.rs#L55-L73 |
I started looking into this -- in general the string expressions need some love. Roughly:
|
length / character_length has the same issue:
|
upper
(and maybe other functions) don't support String Dictionary
types: Internal error: The "upper" function can only accept strings.upper
(and other dictionary functions) don't support String Dictionary
types: Internal error: The "upper" function can only accept strings.
upper
(and other dictionary functions) don't support String Dictionary
types: Internal error: The "upper" function can only accept strings.upper
(and other string functions) don't support String Dictionary
types: Internal error: The "upper" function can only accept strings.
The code for these functions is in https://github.com/apache/arrow-datafusion/blob/main/datafusion/physical-expr/src/string_expressions.rs As an initial step, I recommend creating some sqllogictests reproducing the issue. sqllogictest is explained here: https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/tests/sqllogictests Perhaps by extending what is in https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/sqllogictests/test_files/functions.slt |
BTW a workaround is to cast explicitly to varchar: select upper(col::varchar) |
I'm going to put up a PR soon |
These are all the functions that generate the same error as
|
I presume this is what you intend to do, but I would recommend implementing this by evaluating the function on the dictionary values using the existing kernels, and then passing the result from this into the take kernel with the dictionary keys as the second argument. This should not only be less code, but will avoid a codegen explosion from parameterising generic code on both dictionary key types and value types |
I just tested this out after #7262 was merged and the functions still don't work for arrays it seems: create table foo(x varchar) as values ('foo'), ('bar');
create table foo_dict as select arrow_cast(x, 'Dictionary(Int32, Utf8)') as x from foo;
select upper(x) from foo_dict; Results in an internal error:
So I think #7262 made things better but there is still something not working |
@tustvold Since this is such a common operation (write a function that applies a function to a Maybe something like // apply f to each element of array, which writes the result to a temporary string
// resulting in the same type of array out
fn unary_str<F>(apply: &dyn Array, f: F) -> Result<ArrayRef>
where:
F: FnMut(&str) -> Cow<str>
{
...
} Then writing unary_str(&input_array, |in| {
// this cold be substantially fancier and avoid allocations, etc if the output
// was already upper case, etc
Cow::from(in.to_uppercase())
} |
You could but it seems odd to me that you would special case strings, it feels like you should be able to wrap an arbitrary unary function and produce a new function that can handle dictionaries. This is a completely general transformation that need not know anything about the values or even the scalar function in question? This would also give you a single place to choose between preserving the dictionary, i.e. using Provided you are able to pass the input schema to create_physical_fun this should be relatively straightforward |
What I am struggling with is that I do agree that under the covers the handling of Non Maybe we could also add a function like the following to handle it 🤔 /// Applies an array --> array transform function
/// that will apply the function to dictionary values as well
fn apply<F>(input: &dyn Array, f: F) -> Result<ArrayRef>
where
F: FnMut(&dyn Array) -> Result<ArrayRef>
{
// if input is dictionary, apply f to values()
// otherwise, apply f to input
} Maybe it doesn't even need to be generic so we could avoid additional code bloat 🤔 |
So the current state of play is we have a function to obtain a type erased function that operates on ColumnarValue and produces a ColumnarValue for non-dictionary arrays. I'm not sure if this is currently specialized to the array type, but it definitely could be. It should therefore be trivial to make this same function recurse for the dictionary values type, and then return a type-erased function that operates on dictionaries by using the type-erased values function and manipulating its output. This would be simpler, require minimal additional codegen, and naturally generalise to all unary functions and argument types. TLDR you shouldn't need to write any additional code specialized on anything other than the dictionary arrays themselves |
Do you mean something in DataFusion? Are you suggesting we handle evaluating scalar functions on DictionaryArrays in the generic function evaluate function rather than doing something special for string functions? |
I was trying to implement something you both described above, and before I add any code, I tried to run sqllogictests and
|
Maybe you are still using an old version of I think it should be something like this: cd datafusion-cli
cargo build
./target/debug/datafusion-cli If you want to install cd datafusion-cli
cargo install
# now you can run datafusion-cli anywhere
datafusion-cli |
Ah! I just ran This is what I tired:
$ cd datafusion-cli
$ cargo build
$ ./target/debug/datafusion-cli
DataFusion CLI v29.0.0
❯ SELECT upper('foo');
+--------------------+
| upper(Utf8("foo")) |
+--------------------+
| FOO |
+--------------------+
1 row in set. Query took 0.035 seconds.
❯ select upper(arrow_cast('foo', 'Dictionary(Int32, Utf8)'));
+--------------------+
| upper(Utf8("foo")) |
+--------------------+
| FOO |
+--------------------+
1 row in set. Query took 0.002 seconds.
❯ create table foo(x varchar) as values ('foo'), ('bar');
0 rows in set. Query took 0.016 seconds.
❯ create table foo_dict as select arrow_cast(x, 'Dictionary(Int32, Utf8)') as x from foo;
0 rows in set. Query took 0.017 seconds.
❯ select upper(x) from foo_dict;
+-------------------+
| upper(foo_dict.x) |
+-------------------+
| FOO |
| BAR |
+-------------------+
2 rows in set. Query took 0.004 seconds. I cannot re-produce the error 🤔 Definitely @alamb Can you verify for me? |
🤔 I got the same results as you did @appletreeisyellow
And in fact the plan looks reasonable (though not performant) by automatically casting the argument to a varchar
Thus I am not sure why it was failing in IOx in https://github.com/influxdata/influxdb_iox/pull/8479 🤔 I think more investigation is needed |
This query now works in DataFusion so closing this ticket. Sorry for the confusion |
Describe the bug
Running
upper(col)
where col is a dictionary results in an internal error:To Reproduce
Requires the arrow_cast function in #5166
Expected behavior
The test should pass and produce
FOO
Additional context
Reported by @sanderson at influxdata/docs-v2#4773 (comment)
The text was updated successfully, but these errors were encountered: