Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

null safe equal support in joins #3069

Closed
advancedxy opened this issue Oct 18, 2024 · 9 comments · Fixed by #3161
Closed

null safe equal support in joins #3069

advancedxy opened this issue Oct 18, 2024 · 9 comments · Fixed by #3161
Labels
enhancement New feature or request needs triage

Comments

@advancedxy
Copy link
Contributor

Is your feature request related to a problem?

I'm planning to add intersect and intersect all operator into Daft and get some feature parity with Spark. During the evaluation and code browsing, I noticed that Daft doesn't support null safe equal in joins, it would be great that we can support null safe equal in joins.

P.S: an intersect operator could be rewritten as a semi join with null safe equal.

SELECT a1, a2 FROM t1 INTERSECT SELECT b1, b2 FROM t2
-- is effective the same as
SELECT DISTINCT a1, a2 FROM t1 LEFT SEMI join t2 on a1 <> b1 and a2 <> b2

Describe the solution you'd like

  1. add one more parameter to indicate whether the equality condition are null safe or not in joins' DataFrame API and various join ops
  2. passing the null safe equality property down to the native execution level
  3. build is_equal in src/daft-table/src/ops/joins/hash_join.rs#hash_semi_anti_join and other related code, makes the execution null safe aware

Describe alternatives you've considered

Another possible solution is to make join operators to relies on join conditions rather the left_on and right_on keys directly, in that way, null safe equal could be easily added as a new expression.
However, that would require massive code change in daft's current code base and I don't think it's worthy for just null safe equal support.

Additional Context

I can submit a PR to add null safe equal support if we all agree it's the right time to add.

Would you like to implement a fix?

No

@advancedxy advancedxy added enhancement New feature or request needs triage labels Oct 18, 2024
@advancedxy
Copy link
Contributor Author

@jaychia @colin-ho it would be much appreciated if you can have some comments on this feature.

@samster25
Copy link
Member

Hi @advancedxy, thanks for making an issue here! @universalmind303 is building out SQL support and @kevinzwang has done lots of work on the semi joins. Would love to hear their thoughts

@kevinzwang
Copy link
Member

What do you think about doing an outer join + filter instead? Would this work:

SELECT DISTINCT a1, a2 
  FROM (t1 LEFT OUTER JOIN t2 ON a1 = b1 AND a2 = b2) 
  WHERE (a1 = b1 OR (a1 IS NULL and b1 IS NULL)) AND (a2 = b2 OR (a2 IS NULL and b2 IS NULL))

@advancedxy
Copy link
Contributor Author

What do you think about doing an outer join + filter instead? Would this work:

SELECT DISTINCT a1, a2 
  FROM (t1 LEFT OUTER JOIN t2 ON a1 = b1 AND a2 = b2) 
  WHERE (a1 = b1 OR (a1 IS NULL and b1 IS NULL)) AND (a2 = b2 OR (a2 IS NULL and b2 IS NULL))

left outer join + filter might work for normal cases, but it doesn't handle null input on the left side correctly. Consider the following case:

-- select * from t1
+----+
|  a1|
+----+
|   1|
|NULL|
+----+

-- select * from t2
+---+
| b1|
+---+
|  1|
|  2|
+---+

-- the output of intersect should be
-- select a1 from t1 intersect select b1 from t2
+---+
| a1|
+---+
|  1|
+---+

-- the output of left outer + filter
-- select distinct t1.a1 from (t1 left outer join t2 on a1=b1) where (a1=b1 or (a1 is null and b1 is null)
+----+
|  a1|
+----+
|   1|
|NULL|
+----+

Besides the correctness issue, the left outer join + filter might have performance issue considering:

  1. for large table/Dataframe that has thousands of columns, the filter expressions will be exploded, which might bring some evaluation overhead
  2. the number of output of left out join might be multiplied if there are many matches in the right, which brings more work to the distinct operator

And BTW, I think null safe equal join is applicable to all type of joins, not just for semi joins. It's just the intersect operator could be implemented by semi join with null safe equal.

@advancedxy
Copy link
Contributor Author

Hi @advancedxy, thanks for making an issue here! @universalmind303 is building out SQL support and @kevinzwang has done lots of work on the semi joins. Would love to hear their thoughts

Thanks for pinging them and the background. I will reach to them instead when having SQL or join related issues/PRs.

@advancedxy
Copy link
Contributor Author

What do you think about doing an outer join + filter instead? Would this work:

SELECT DISTINCT a1, a2 
  FROM (t1 LEFT OUTER JOIN t2 ON a1 = b1 AND a2 = b2) 
  WHERE (a1 = b1 OR (a1 IS NULL and b1 IS NULL)) AND (a2 = b2 OR (a2 IS NULL and b2 IS NULL))

I did another digging on this: maybe we don't need to pass the null_safe_equal parameter down to the physical operator side, the null equal safe a1 <> b1 is effective the same as nvl(a1, default_value) = nvl(b1, default_value) and isnull(a1) = isnull(b1) where the default value is the zero value for the corresponding data type.

SELECT DISTINCT a1, a2 FROM t1 LEFT SEMI join t2 on a1 <> b1 and a2 <> b2
-- is effectively the same as the following supposing a1/b1 is in string type, a2/ b2 is in int type
SELECT DISTINCT a1, a2 FROM t1 LEFT SEMI join t2 on nvl(a1, '') = nvl(b1, '') and is_null(a1) = is_null(b1) and nvl(a2, 0) = nvl(b2, 0) and is_null(a2) = is_null(b2)

So I think we only need null safe equal parameter in the DataFrame and logical plan level, which would make the whole change less invasive.

Anyway, I think I can add intersect operation first by writing null safe equal in an expanded manner. We can discuss a bit more here about how to support null safe equals in joins generally.

@kevinzwang
Copy link
Member

I think a null safe equality does make sense. We could make it an expression as well, similar to pyspark's eqNullSafe. @advancedxy thanks for bringing this up, and feel free to take a stab at implementation yourself first!

@advancedxy
Copy link
Contributor Author

similar to pyspark's eqNullSafe.

Good point. I may make a PR about intersect support first to demonstrate how null safe equal could be rewrote, then I can submit PRs about null safe equal in join and then the eqNullSafe expression.

@kevinzwang
Copy link
Member

Thank you @advancedxy !

universalmind303 pushed a commit that referenced this issue Nov 5, 2024
This commit consists of the following parts:
1. Make the logical plan join's on condition null safe aware
2. Support translating null safe equal joins into physical plans
3. Make sure the optimization rules: eliminate cross join and push down
filter don't break
4. Modifications to physical join ops to support null safe equal: hash
and broadcast joins are supported. SMJ is not supported yet.
5. Glue code in Python side to make the whole pipeline work
6. Some UTs in the python side

Fixes: #3069 

TODOs(in follow-up PRs):
- [ ] rewrite null safe equal for SMJ so that SMJ could support null
safe equals as well
- [ ] SQL supports null safe equal, a.k.a SpaceShip(a<=>b)
- [ ] Python's DataFrame API supports null safe equal join
universalmind303 pushed a commit that referenced this issue Nov 12, 2024
This commit leverages null safe equal support in joins(see #3069 and
#3161) to support intersect API.

Partially fixes #3122.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants