-
Notifications
You must be signed in to change notification settings - Fork 1.9k
feat: Add array concatenation support to concat function #18137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add array concatenation support to concat function #18137
Conversation
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hate to ask this upfront, but how much of this code is LLM generated? Do you have a full understanding of what it does? I find a lot of this code quite baffling and not written in a Rust-like way.
For example in coerce_types, the comments are too verbose are state what is happening (a lot of the time providing no benefit as the code is straightforward enough in what it does) but there are no comments explaining why choices were made. There are also odd choices like defaulting to Int32 type if all inner list types are null.
Not to mention the CI checks aren't passing.
Thanks for the honest review, and sorry this should have been a Draft PR. I was trying out some ideas around concat and list coercion related to issue #18020 and I did use some AI help for boilerplate while experimenting, but I do understand the code and take responsibility for it. I agree the comments read like explanations of what rather than why, the Int32 fallback for all-null inner list types was a quick experiment. I will convert this to Draft now, remove the noisy and misleading comments (including the one that says it delegates to array_concat_inner), avoid duplicating coerce_types logic in return_type since inputs are already coerced, switch to ScalarFunctionArgs::number_rows instead of inferring num_rows, refactor toward idiomatic Rust, and then ask for another review once everything is cleaned up and passing. Thanks again for the direct feedback. |
0ccd138 to
05fe9fd
Compare
|
Hey @comphead , can you please review? |
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still working through this PR to understand it entirely, but some initial thoughts:
- We should prefer adding the tests as SLTs and reserve Rust tests for when its difficult to do the test in SLTs
- Why are we removing details that was present in the existing code? I'm seeing comments be removed for no apparently reason, or simplified to lose details. Was this PR LLM-assisted? If so, to what degree?
| None => plan_err!( | ||
| "Concat function does not support scalar type {}", | ||
| scalar | ||
| )?, | ||
| None => { | ||
| // For non-string types, convert to string representation | ||
| if scalar.is_null() { | ||
| // Skip null values | ||
| } else { | ||
| result.push_str(&format!("{scalar}")); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't quite understand why this change was necessary?
| # test variable length arguments | ||
| query TTTBI rowsort | ||
| select specific_name, data_type, parameter_mode, is_variadic, rid from information_schema.parameters where specific_name = 'concat'; | ||
| ---- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test should be fixed so it has an expected result, not just an empty return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still hasn't been addressed
|
comphead
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @EeshanBembi and @Jefffrey for review
I'll check it out during the weekend
Addresses all reviewer comments from PR apache#18137: - Use ScalarFunctionArgs.number_rows instead of inferring from arrays - Avoid scalar-to-array conversion in concat_arrays_single_row - Handle concat([null], [null]) properly - return empty array not error - Remove unused _num_rows parameter from build_list_array_result - Add validation for mixed List/String inputs in coerce_types - Restore original detailed comments that were removed - Restore original detailed error messages - Fix information_schema.slt test to have expected result
|
I was actually thinking we need to delegate the execution to array_concat if input is arrays rather than implementing it again |
@EeshanBembi did you look into the feasibility of this suggestion? |
fec4aae to
6194796
Compare
|
Apologies for the messy git history, I had some rebase issues when syncing with main. I've cleaned it up and now both concat functions delegate to the same shared implementation rather than duplicating the logic. |
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies it took so long to review again.
I feel this is indeed the general direction we should take, especially since DuckDB seems to have this behaviour (concat works for strings and lists, and they also have a list_concat that only works for lists).
Given we're trying to reuse some code between the concat functions, most of my comments are around this organization. I feel theres quite a few unnecessary changes made (e.g. functions becoming pub); would recommend taking extra care in the refactors being made here.
| Expr::Literal(scalar_val, _) => { | ||
| // Convert non-string, non-array literals to their string representation | ||
| // Skip array literals - they should be handled at runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this handling code now? Wouldn't coerce_types ensure we have the right types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The literal handling is needed because simplification happens before coercion. I've added a comment to clarify:
// This is needed during simplification phase which happens before coercion
This ensures numeric literals like concat('hello', 42) work correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fairly sure type coercion happens before simplification
e.g.
> explain verbose select 1;
+------------------------------------------------------------+--------------------------+
| plan_type | plan |
+------------------------------------------------------------+--------------------------+
| initial_logical_plan | Projection: Int64(1) |
| | EmptyRelation: rows=1 |
| logical_plan after resolve_grouping_function | SAME TEXT AS ABOVE |
| logical_plan after type_coercion | SAME TEXT AS ABOVE |
| analyzed_logical_plan | SAME TEXT AS ABOVE |
| logical_plan after optimize_unions | SAME TEXT AS ABOVE |
| logical_plan after simplify_expressions | SAME TEXT AS ABOVE |
| logical_plan after replace_distinct_aggregate | SAME TEXT AS ABOVE || # test variable length arguments | ||
| query TTTBI rowsort | ||
| select specific_name, data_type, parameter_mode, is_variadic, rid from information_schema.parameters where specific_name = 'concat'; | ||
| ---- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still hasn't been addressed
| # Test array concatenation with empty arrays - Arrow limitation with Null vs Int64 types | ||
| statement error Arrow error: Invalid argument error: It is not possible to concatenate arrays of different data types | ||
| SELECT concat([], [1, 2]); | ||
|
|
||
| statement error Arrow error: Invalid argument error: It is not possible to concatenate arrays of different data types | ||
| SELECT concat([1, 2], []); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a weird bug to have; perhaps can address in a followup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! I'll create a follow-up issue to track this separately rather than expanding the scope here.
|
Actually, the test is working correctly, concat with variadic signature doesn't expose parameters in information_schema, which is why the result is empty. This matches the expected behavior for variadic functions. The test query returns no rows as expected. |
- Move concat_arrays from datafusion-common to datafusion-functions/src/utils.rs - Remove unnecessary pub declarations and thin wrapper functions - Restore find_or_first logic and missing align_array_dimensions test - Add documentation for string type precedence and PostgreSQL compatibility - Use ColumnarValue::values_to_arrays() instead of manual conversion - Simplify return_type/invoke logic to only check first argument after coercion - Fix empty argument handling to require at least one argument Implements array concatenation: concat([1,2], [3,4]) → [1,2,3,4] Supports various data types and multi-dimensional arrays
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting a weird bug where this query seems to fail on this branch:
> select arrow_typeof(concat(a, b, c)) from values (arrow_cast('a', 'Utf8'), arrow_cast('b', 'Utf8View'), arrow_cast('c', 'LargeUtf8')) as t(a,b,c);
Optimizer rule 'simplify_expressions' failed
caused by
Error during planning: concat requires at least one argumentActually, the test is working correctly, concat with variadic signature doesn't expose parameters in information_schema, which is why the result is empty. This matches the expected behavior for variadic functions. The test query returns no rows as expected.
That was not my point. This test was based on the assumption that the function used to be variadic hence the test was constructed around this. This assumption is now invalid. Look at what the test comment says:
# test variable length arguments
This test was checking that a UDF with variadic signature returns certain output; now that concat is no longer variadic, it is confusing to simply hand wave the empty result away as correct behaviour when the point wasn't checkng that concat itself is variadic, but that variadic UDFs return certain information in the query.
| .collect(); | ||
| let arrays = arrays?; | ||
|
|
||
| // Check if all arrays are null - concat errors in this case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this matches PostgreSQL's behavior.
Based on what? Do you have an example query that shows this? Because I tested this against postgres 18 but cannot replicate it:
postgres=# select array_cat(null::integer[], null::integer[]);
array_cat
-----------
(1 row)
postgres=# select array_cat(null, null);
array_cat
-----------
(1 row)
postgres=# select concat(null, null);
concat
--------
(1 row)
postgres=# select concat(null::integer[], null::integer[]);
concat
--------
(1 row)| return plan_err!("concat requires at least one argument"); | ||
| } | ||
|
|
||
| // Convert ColumnarValue arguments to ArrayRef |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please remove these LLM comments that add no value.
| } | ||
|
|
||
| // After coercion, all arguments have the same type category, so check only the first | ||
| let is_array = match &args[0] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ColumnarValue has a datatype method
https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.ColumnarValue.html#method.data_type
| } | ||
|
|
||
| let data_types: Vec<DataType> = args.iter().map(|col| col.data_type()).collect(); | ||
| let return_datatype = self.get_string_type_precedence(&data_types); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can retrieve the return type from ScalarFunctionArgs
| None => plan_err!( | ||
| "Concat function does not support scalar type {}", | ||
| scalar | ||
| )?, | ||
| None => { | ||
| // For non-string types, convert to string representation | ||
| if scalar.is_null() { | ||
| // Skip null values | ||
| } else { | ||
| result.push_str(&format!("{scalar}")); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't quite understand why this change was necessary?
| Expr::Literal(scalar_val, _) => { | ||
| // Convert non-string, non-array literals to their string representation | ||
| // Skip array literals - they should be handled at runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fairly sure type coercion happens before simplification
e.g.
> explain verbose select 1;
+------------------------------------------------------------+--------------------------+
| plan_type | plan |
+------------------------------------------------------------+--------------------------+
| initial_logical_plan | Projection: Int64(1) |
| | EmptyRelation: rows=1 |
| logical_plan after resolve_grouping_function | SAME TEXT AS ABOVE |
| logical_plan after type_coercion | SAME TEXT AS ABOVE |
| analyzed_logical_plan | SAME TEXT AS ABOVE |
| logical_plan after optimize_unions | SAME TEXT AS ABOVE |
| logical_plan after simplify_expressions | SAME TEXT AS ABOVE |
| logical_plan after replace_distinct_aggregate | SAME TEXT AS ABOVE || } | ||
|
|
||
| #[test] | ||
| fn test_concat_with_integers() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how these tests are related to the original goal of adding array concat support to existing string concat?
| } | ||
|
|
||
| #[test] | ||
| fn test_array_concatenation_comprehensive() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move these tests to SLTs please
| return Ok(ColumnarValue::Scalar(ScalarValue::Utf8(None))); | ||
| // First check if we're dealing with array types by delegating to ConcatFunc | ||
| let concat_func = ConcatFunc::new(); | ||
| let return_type = concat_func.return_type( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be getting this information from ScalarFunctionArgs
| )?; | ||
|
|
||
| // Return appropriate null value based on return type | ||
| return Ok(ColumnarValue::Scalar(match return_type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can simplify this using ScalarValue::try_new_null
https://docs.rs/datafusion/latest/datafusion/common/enum.ScalarValue.html#method.try_new_null
Fixes #18020
Summary
Enables
concatfunction to concatenate arrays likearray_concatwhilepreserving all existing string concatenation behavior.
Before:
After:
Implementation
compute functions
Test Coverage
Approach Benefits
Function-level implementation vs planner replacement: