Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFusion HashJoin LeftAnti doesn't support null aware anti join #10583

Open
viirya opened this issue May 20, 2024 · 1 comment · May be fixed by #10584
Open

DataFusion HashJoin LeftAnti doesn't support null aware anti join #10583

viirya opened this issue May 20, 2024 · 1 comment · May be fixed by #10584
Assignees
Labels
bug Something isn't working

Comments

@viirya
Copy link
Member

viirya commented May 20, 2024

Describe the bug

During working on apache/datafusion-comet#437, a few Spark join tests are failed when delegating to DataFusion HashJoin.

It is because that DataFusion HashJoin LeftAnti Join returns incorrect results when it is a null aware anti join.

To Reproduce

Added a test to join.slt:

statement ok
CREATE TABLE IF NOT EXISTS test_table(c1 INT, c2 INT) AS VALUES
(1, 1),
(2, 2),
(3, 3),
(4, null),
(null, 0);

query II
SELECT * FROM test_table t1 WHERE (c1 NOT IN (SELECT c2 FROM test_table)) = true
----
4 NULL
NULL 0

Expected behavior

Above query should return empty relation.

Additional context

No response

@viirya viirya added the bug Something isn't working label May 20, 2024
@viirya viirya linked a pull request May 20, 2024 that will close this issue
@viirya viirya self-assigned this May 20, 2024
@viirya viirya changed the title HashJoin LeftAnti Join handles nulls incorrectly DataFusion HashJoin LeftAnti doesn't support Null aware anti join May 21, 2024
@viirya viirya changed the title DataFusion HashJoin LeftAnti doesn't support Null aware anti join DataFusion HashJoin LeftAnti doesn't support null aware anti join May 21, 2024
@korowa
Copy link
Contributor

korowa commented Aug 10, 2024

I'll just add one more example here in order to show that the problem is related not only to non-correlated subqueries

CREATE TABLE IF NOT EXISTS test_table(c1 INT, c2 INT) AS VALUES
(1, 1),
(2, 2),
(3, 3),
(4, null),
(4, 7),
(null, 0);

SELECT * FROM test_table t1 WHERE c1 NOT IN (SELECT c2 FROM test_table t2 where t1.c1 = t2.c1);
+----+----+
| c1 | c2 |
+----+----+
| 4  |    |
| 4  | 7  |
|    | 0  |
+----+----+

while PG returns

CREATE TABLE
INSERT 0 6
 c1 | c2 
----+----
    |  0
(1 row)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants