-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
null safe equal support in joins #3069
Comments
Hi @advancedxy, thanks for making an issue here! @universalmind303 is building out SQL support and @kevinzwang has done lots of work on the semi joins. Would love to hear their thoughts |
What do you think about doing an outer join + filter instead? Would this work: SELECT DISTINCT a1, a2
FROM (t1 LEFT OUTER JOIN t2 ON a1 = b1 AND a2 = b2)
WHERE (a1 = b1 OR (a1 IS NULL and b1 IS NULL)) AND (a2 = b2 OR (a2 IS NULL and b2 IS NULL)) |
left outer join + filter might work for normal cases, but it doesn't handle null input on the left side correctly. Consider the following case: -- select * from t1
+----+
| a1|
+----+
| 1|
|NULL|
+----+
-- select * from t2
+---+
| b1|
+---+
| 1|
| 2|
+---+
-- the output of intersect should be
-- select a1 from t1 intersect select b1 from t2
+---+
| a1|
+---+
| 1|
+---+
-- the output of left outer + filter
-- select distinct t1.a1 from (t1 left outer join t2 on a1=b1) where (a1=b1 or (a1 is null and b1 is null)
+----+
| a1|
+----+
| 1|
|NULL|
+----+ Besides the correctness issue, the left outer join + filter might have performance issue considering:
And BTW, I think null safe equal join is applicable to all type of joins, not just for semi joins. It's just the intersect operator could be implemented by semi join with null safe equal. |
Thanks for pinging them and the background. I will reach to them instead when having SQL or join related issues/PRs. |
I did another digging on this: maybe we don't need to pass the SELECT DISTINCT a1, a2 FROM t1 LEFT SEMI join t2 on a1 <> b1 and a2 <> b2
-- is effectively the same as the following supposing a1/b1 is in string type, a2/ b2 is in int type
SELECT DISTINCT a1, a2 FROM t1 LEFT SEMI join t2 on nvl(a1, '') = nvl(b1, '') and is_null(a1) = is_null(b1) and nvl(a2, 0) = nvl(b2, 0) and is_null(a2) = is_null(b2) So I think we only need null safe equal parameter in the DataFrame and logical plan level, which would make the whole change less invasive. Anyway, I think I can add intersect operation first by writing null safe equal in an expanded manner. We can discuss a bit more here about how to support null safe equals in joins generally. |
I think a null safe equality does make sense. We could make it an expression as well, similar to pyspark's eqNullSafe. @advancedxy thanks for bringing this up, and feel free to take a stab at implementation yourself first! |
Good point. I may make a PR about intersect support first to demonstrate how null safe equal could be rewrote, then I can submit PRs about null safe equal in join and then the |
Thank you @advancedxy ! |
This commit consists of the following parts: 1. Make the logical plan join's on condition null safe aware 2. Support translating null safe equal joins into physical plans 3. Make sure the optimization rules: eliminate cross join and push down filter don't break 4. Modifications to physical join ops to support null safe equal: hash and broadcast joins are supported. SMJ is not supported yet. 5. Glue code in Python side to make the whole pipeline work 6. Some UTs in the python side Fixes: #3069 TODOs(in follow-up PRs): - [ ] rewrite null safe equal for SMJ so that SMJ could support null safe equals as well - [ ] SQL supports null safe equal, a.k.a SpaceShip(a<=>b) - [ ] Python's DataFrame API supports null safe equal join
Is your feature request related to a problem?
I'm planning to add
intersect
andintersect all
operator into Daft and get some feature parity with Spark. During the evaluation and code browsing, I noticed that Daft doesn't support null safe equal in joins, it would be great that we can support null safe equal in joins.P.S: an intersect operator could be rewritten as a semi join with null safe equal.
Describe the solution you'd like
is_equal
insrc/daft-table/src/ops/joins/hash_join.rs#hash_semi_anti_join
and other related code, makes the execution null safe awareDescribe alternatives you've considered
Another possible solution is to make join operators to relies on join conditions rather the left_on and right_on keys directly, in that way, null safe equal could be easily added as a new expression.
However, that would require massive code change in daft's current code base and I don't think it's worthy for just null safe equal support.
Additional Context
I can submit a PR to add null safe equal support if we all agree it's the right time to add.
Would you like to implement a fix?
No
The text was updated successfully, but these errors were encountered: