-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add NestedLoopJoinRel definition #561
feat: add NestedLoopJoinRel definition #561
Conversation
proto/substrait/algebra.proto
Outdated
@@ -433,6 +433,7 @@ message Rel { | |||
CrossRel cross = 12; | |||
//Physical relations | |||
HashJoinRel hash_join = 13; | |||
NestedLoopJoinRel nested_loop_join = 18; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this is the third defined join relation perhaps it should come after MergeJoinRel (here in Rel and as defined later)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I don't know why I put it in the middle..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
7b861d4
to
ae534eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to also add this to the markdown? It can go on the physical relations page next to the hash equijoin.
Rel left = 2; | ||
Rel right = 3; | ||
Expression expression = 4; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you add post_join_filter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the existing NLJ definition, it doesn't require one. I think that makes sense because anything that goes in a post_join_filter could also just be included in the expression IIUC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I believe you are correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this same point would apply to the logical operator as well 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, interesting... 😅
JOIN_TYPE_LEFT_ANTI = 7; | ||
JOIN_TYPE_RIGHT_ANTI = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I historically don't like anti joins because they are not well defined.
For example, a semi join can be implemented as:
for l_row in left:
for r_row in right:
result = expression(l_row, r_row):
if result == True:
yield l_row
However, when we look at the anti join:
for l_row in left:
for r_row in right:
result = expression(l_row, r_row)
if ???:
yield l_row
Should ???
be result == False
or should it be result != True
? Specifically the question is around how we handle the case where result
is null
.
Do you want to take a stab at either picking one of these and clearly documenting this or adding a new proto field allowing the user to choose between these two behaviors?
That being said, I won't hold this issue up over this, so if you'd rather punt on the question then that is fine too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very illustrative point, thank you. I am in favor of allowing the user to define NULL behavior for the anti-join. However, I would prefer to push this to a separate PR where JoinType is refactored out for all of NLJ/Hash/SM operators (I assume this should be applied to all three). I can create a separate issue for that if you think it's appropriate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created #563
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @westonpace! Thanks for the review 😁
The NLJ operator is already defined in the markdown: https://substrait.io/relations/physical_relations/#nlj-operator
Is there anything you'd change about it?
Rel left = 2; | ||
Rel right = 3; | ||
Expression expression = 4; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the existing NLJ definition, it doesn't require one. I think that makes sense because anything that goes in a post_join_filter could also just be included in the expression IIUC.
JOIN_TYPE_LEFT_ANTI = 7; | ||
JOIN_TYPE_RIGHT_ANTI = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very illustrative point, thank you. I am in favor of allowing the user to define NULL behavior for the anti-join. However, I would prefer to push this to a separate PR where JoinType is refactored out for all of NLJ/Hash/SM operators (I assume this should be applied to all three). I can create a separate issue for that if you think it's appropriate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last suggestion and then I'm satisfied.
Co-authored-by: Weston Pace <weston.pace@gmail.com>
2e716db
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes looks good to me!
This PR has 3 SMC votes. Will merge it. |
This PR adds the NestedLoopJoinRel proto definition.
Closes #456