Skip to content

Conversation

@kumarUjjawal
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Sort-merge joins assumed both inputs had the same width when applying filter results. With different column counts (like right-anti joins) we built batches that no longer matched the output schema, so Arrow panicked during concat_batches.

What changes are included in this PR?

  • Fix the null-row and filtered-row builders in SortMergeJoinStream to slice the streamed vs buffered columns using the correct widths, keeping every batch consistent with self.schema.
  • Merge the semi/anti projection branch so it always projects the streamed-side columns expected in the output.

Are these changes tested?

Manual testing and ran previous tests.

Screenshot 2025-11-18 at 5 04 20 PM

Are there any user-facing changes?

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Nov 18, 2025
@kumarUjjawal
Copy link
Contributor Author

hmm, why is the test failing?

@rluvaton
Copy link
Member

Looks like this include fixes created by, let's wait for it to be merged

Also, once merged can you update the fuzz tests for join:
similar to what he added here but add the for loop to the other tests as well:
test_right_semi_join_1k_filtered, test_right_anti_join_1k_filtered, test_right_anti_join_1k_binary, test_right_anti_join_1k_binary_filtered
https://github.com/apache/datafusion/pull/18772/files#diff-f8330244c576c142337679ecbeeb99e9016c2c7c54d7bae492144374e138a0b7

@rluvaton
Copy link
Member

also, when fixing a bug, we require tests to be added that failed on main and pass on this pr

@kumarUjjawal
Copy link
Contributor Author

also, when fixing a bug, we require tests to be added that failed on main and pass on this pr

Sure will do.

@github-actions github-actions bot added the core Core DataFusion crate label Nov 19, 2025
.columns()
.iter()
.skip(right_columns_length)
.take(left_columns_length)
Copy link
Member

@rluvaton rluvaton Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should add it as we can lose columns or hide bugs if we the rest of the columns are not left

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, should I Drop the .take(left_columns_length) and add a sanity check like assert_eq_or_internal_err!(..., left_columns_length, "...")) before extending right_columns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@rluvaton rluvaton added this pull request to the merge queue Dec 4, 2025
Merged via the queue into apache:main with commit 434a23b Dec 4, 2025
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Panic when having right anti join with filter in sort_merge_join and different number of columns between each side

2 participants