Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add NestedLoopJoinRel definition #561

Merged
merged 5 commits into from
Oct 20, 2023

Conversation

danepitkin
Copy link
Contributor

@danepitkin danepitkin commented Oct 2, 2023

This PR adds the NestedLoopJoinRel proto definition.

Closes #456

@@ -433,6 +433,7 @@ message Rel {
CrossRel cross = 12;
//Physical relations
HashJoinRel hash_join = 13;
NestedLoopJoinRel nested_loop_join = 18;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is the third defined join relation perhaps it should come after MergeJoinRel (here in Rel and as defined later)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I don't know why I put it in the middle..

EpsilonPrime
EpsilonPrime previously approved these changes Oct 11, 2023
Copy link
Member

@EpsilonPrime EpsilonPrime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to also add this to the markdown? It can go on the physical relations page next to the hash equijoin.

Rel left = 2;
Rel right = 3;
Expression expression = 4;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you add post_join_filter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the existing NLJ definition, it doesn't require one. I think that makes sense because anything that goes in a post_join_filter could also just be included in the expression IIUC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I believe you are correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this same point would apply to the logical operator as well 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, interesting... 😅

Comment on lines +627 to +628
JOIN_TYPE_LEFT_ANTI = 7;
JOIN_TYPE_RIGHT_ANTI = 8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I historically don't like anti joins because they are not well defined.

For example, a semi join can be implemented as:

for l_row in left:
  for r_row in right:
    result = expression(l_row, r_row):
    if result == True:
      yield l_row

However, when we look at the anti join:

for l_row in left:
  for r_row in right:
    result = expression(l_row, r_row)
    if ???:
      yield l_row

Should ??? be result == False or should it be result != True? Specifically the question is around how we handle the case where result is null.

Do you want to take a stab at either picking one of these and clearly documenting this or adding a new proto field allowing the user to choose between these two behaviors?

That being said, I won't hold this issue up over this, so if you'd rather punt on the question then that is fine too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very illustrative point, thank you. I am in favor of allowing the user to define NULL behavior for the anti-join. However, I would prefer to push this to a separate PR where JoinType is refactored out for all of NLJ/Hash/SM operators (I assume this should be applied to all three). I can create a separate issue for that if you think it's appropriate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created #563

Copy link
Contributor Author

@danepitkin danepitkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @westonpace! Thanks for the review 😁

The NLJ operator is already defined in the markdown: https://substrait.io/relations/physical_relations/#nlj-operator

Is there anything you'd change about it?

Rel left = 2;
Rel right = 3;
Expression expression = 4;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the existing NLJ definition, it doesn't require one. I think that makes sense because anything that goes in a post_join_filter could also just be included in the expression IIUC.

Comment on lines +627 to +628
JOIN_TYPE_LEFT_ANTI = 7;
JOIN_TYPE_RIGHT_ANTI = 8;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very illustrative point, thank you. I am in favor of allowing the user to define NULL behavior for the anti-join. However, I would prefer to push this to a separate PR where JoinType is refactored out for all of NLJ/Hash/SM operators (I assume this should be applied to all three). I can create a separate issue for that if you think it's appropriate?

westonpace
westonpace previously approved these changes Oct 13, 2023
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last suggestion and then I'm satisfied.

proto/substrait/algebra.proto Show resolved Hide resolved
Co-authored-by: Weston Pace <weston.pace@gmail.com>
@danepitkin danepitkin dismissed stale reviews from westonpace and EpsilonPrime via 2e716db October 13, 2023 18:02
Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes looks good to me!

@vbarua
Copy link
Member

vbarua commented Oct 20, 2023

This PR has 3 SMC votes. Will merge it.

@vbarua vbarua merged commit cf32750 into substrait-io:main Oct 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a proto definition for nested loop join
4 participants