Skip to content

Conversation

@2010YOUY01
Copy link
Contributor

@2010YOUY01 2010YOUY01 commented Oct 20, 2025

Which issue does this PR close?

Rationale for this change

See the above issue and its comment #18070 (comment)

What changes are included in this PR?

In nested loop join, when the join column includes List(Utf8View), use take() instead of to_array_of_size() to avoid deep copying the utf8 buffers inside Utf8View array.

This is the quick fix, avoiding deep copy inside to_array_of_size() is a bit tricky.
Here is ListArray's physical layout: https://arrow.apache.org/rust/arrow/array/struct.GenericListArray.html
If multiple elements is pointing to the same list range, the underlying payload can't be reused.So the potential fix in to_array_of_size can only avoids copying the inner-inner utf8view array buffers, but can't avoid copying the inner array (i.e. views are still copied), and deep copying for other primitive types also can't be avoided. Seems this can be better solved when ListView type is ready 🤔

Benchmark

I tried query 1 in #18070, but only used 3 randomly sampled places parquet file.

49.0.0: 4s
50.0.0: stuck > 1 minute
PR: 4s

Now the performance are similar, I suspect the most time is spend evaluating the expensive array_has so the optimization in #16996 can't help much.

Are these changes tested?

Existing tests

Are there any user-facing changes?

No

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified that with this fix the reproducer from @ianthetechie from this issue is fixed:

However, am not sure this code has test coverage. I checked via

nice cargo llvm-cov --html test --test sqllogictests
nice cargo llvm-cov --html test -p datafusion -p datafusion-physical-plan

And both show no coverage
Screenshot 2025-10-20 at 10 01 37 AM

I will look into adding some coverage

Now the performance are similar, I suspect the most time is spend evaluating the expensive array_has so the optimization in #16996 can't help much.

Yes, I looked at the array_has implementation and it is doing a lot of work. I will file a follow on ticket

Also, it seems to me that the fix / improvement for ScalarValue::to_array_of_size() is more general than just NLJ, so I will also file a ticket about that as well

scalar_value.to_array_of_size(filtered_probe_batch.num_rows())?
// Avoid using `ScalarValue::to_array_of_size()` for `List(Utf8View)` to avoid
// deep copies for buffers inside `Utf8View` array. See below for details.
// https://github.com/apache/datafusion/issues/18159
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +1674 to +1683
DataType::List(field) | DataType::LargeList(field)
if field.data_type() == &DataType::Utf8View =>
{
let indices_iter = std::iter::repeat_n(
build_side_index as u64,
filtered_probe_batch.num_rows(),
);
let indices_array = UInt64Array::from_iter_values(indices_iter);
take(original_left_array.as_ref(), &indices_array, None)?
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach could be ported into ScalarValue::to_array_of_size itself rather than special cased here -- which would improve performance in potentially other places

fn list_to_array_of_size(arr: &dyn Array, size: usize) -> Result<ArrayRef> {
let arrays = repeat_n(arr, size).collect::<Vec<_>>();
let ret = match !arrays.is_empty() {
true => arrow::compute::concat(arrays.as_slice())?,
false => arr.slice(0, 0),
};
Ok(ret)
}

That being said, I think this is a nice point fix that we can safely backport to the datafusion 50 branch, so I think we should merge this PR / backport it and I will file a follow on PR to further improve the code

@alamb
Copy link
Contributor

alamb commented Oct 20, 2025

THANK YOU very much for this fix and diagnosis @2010YOUY01

@2010YOUY01
Copy link
Contributor Author

THANK YOU very much for this fix and diagnosis @2010YOUY01

Thank you for the review. The feedback makes sense to me, but I can only address it tomorrow. If you're waiting on this patch to be included in the release, feel free to push changes directly to the PR.

@alamb
Copy link
Contributor

alamb commented Oct 20, 2025

Thank you @2010YOUY01

I just pushed a test that adds coverage for this case

I verified it is covered using

$ cargo llvm-cov test --html --profile=ci --test sqllogictests -- join_lists

@@ -0,0 +1,63 @@
# Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new test added here

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @2010YOUY01

@alamb alamb added this pull request to the merge queue Oct 20, 2025
Merged via the queue into apache:main with commit 7f75e58 Oct 20, 2025
32 checks passed
alamb added a commit to alamb/datafusion that referenced this pull request Oct 20, 2025
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes apache#18070

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->
See the above issue and its comment
apache#18070 (comment)

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->
In nested loop join, when the join column includes `List(Utf8View)`, use
`take()` instead of `to_array_of_size()` to avoid deep copying the utf8
buffers inside `Utf8View` array.

This is the quick fix, avoiding deep copy inside `to_array_of_size()` is
a bit tricky.
Here is `ListArray`'s physical layout:
https://arrow.apache.org/rust/arrow/array/struct.GenericListArray.html
If multiple elements is pointing to the same list range, the underlying
payload can't be reused.So the potential fix in `to_array_of_size` can
only avoids copying the inner-inner utf8view array buffers, but can't
avoid copying the inner array (i.e. views are still copied), and deep
copying for other primitive types also can't be avoided. Seems this can
be better solved when `ListView` type is ready 🤔

### Benchmark
I tried query 1 in apache#18070,
but only used 3 randomly sampled `places` parquet file.

49.0.0: 4s
50.0.0: stuck > 1 minute
PR: 4s

Now the performance are similar, I suspect the most time is spend
evaluating the expensive `array_has` so the optimization in
apache#16996 can't help much.

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->
Existing tests
## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->
No
<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@alamb
Copy link
Contributor

alamb commented Oct 20, 2025

@alamb
Copy link
Contributor

alamb commented Oct 20, 2025

Filed a ticket to track making array_has faster:

alamb added a commit that referenced this pull request Oct 20, 2025
… (#18179)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123. -->

- Related to #18070
- Part of #18072

## Rationale for this change

Fix performance regression in Datafusion 50

## What changes are included in this PR?

Backport #18161 to `branch-50`

## Are these changes tested?

Yes
## Are there any user-facing changes?

Fix performance regression

Co-authored-by: Yongting You <2010youy01@gmail.com>
tobixdev pushed a commit to tobixdev/datafusion that referenced this pull request Nov 2, 2025
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes apache#18070

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->
See the above issue and its comment
apache#18070 (comment)

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->
In nested loop join, when the join column includes `List(Utf8View)`, use
`take()` instead of `to_array_of_size()` to avoid deep copying the utf8
buffers inside `Utf8View` array.

This is the quick fix, avoiding deep copy inside `to_array_of_size()` is
a bit tricky.
Here is `ListArray`'s physical layout:
https://arrow.apache.org/rust/arrow/array/struct.GenericListArray.html
If multiple elements is pointing to the same list range, the underlying
payload can't be reused.So the potential fix in `to_array_of_size` can
only avoids copying the inner-inner utf8view array buffers, but can't
avoid copying the inner array (i.e. views are still copied), and deep
copying for other primitive types also can't be avoided. Seems this can
be better solved when `ListView` type is ready 🤔

### Benchmark
I tried query 1 in apache#18070,
but only used 3 randomly sampled `places` parquet file.

49.0.0: 4s
50.0.0: stuck > 1 minute
PR: 4s

Now the performance are similar, I suspect the most time is spend
evaluating the expensive `array_has` so the optimization in
apache#16996 can't help much.

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->
Existing tests
## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->
No
<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Datafusion 50 Performance Regression (array_has style filter/join for Parquet data set)

2 participants