-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full outer join is not supported #202
Comments
For implementing FULL OUTER join via broadcast nested loop join, I am currently transforming "T1 full outer join T2" into "T1 left outer join T2 UNION T2 left anti-semi-join T1". I'm in the process of modifying the implementation for the Join plan in strategies.scala (around line 143) that is currently used to match non-equi joins. Does this sound like the correct approach? Thanks |
This is an interesting approach in terms of reusing the existing operators! However, the performance won't be good -- the first join will broadcast T1 to every partition, and the second join will broadcast T2 to every partition. If either of these tables is big, then the broadcast cost will be high. Also, the two joins + a union isn't enough -- you will need an additional projection on top of it. What about supplementing the C++ side to include the FULL OUTER join functionality? Spark has a corresponding implementation that you can reference. |
Are you referring to the code starting on line 286 here? https://github.com/apache/spark/blob/12abfe79173f6ab00b3341f3b31cad5aa26aa6e4/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala#L286 |
Yes, that should be the implementation. |
I've implemented a preliminary version of FullOuter Join using BNLJ and I am now attempting to test my code but I'm facing issues with the current test suite: I tried changing
Any suggestions here? Also, is there any easier way to run just the Join suite test cases instead of having to run all of the tests? Thanks. |
@saharshagrawal you can run the specific test you want using |
@saharshagrawal the fix should be in master |
Thanks! I'll check it out! |
Looks like the fix works; I'm working on implementing the full outer join with the Sort Merge Join now to test the equi-join test cases |
Yes, tackling sort merge join first is good! |
The comment here states that there is a dummy row added with the desired schema:
However, this does not appear to be true. When running the full outer join test suite, the top-level while loop in the current SortMergeJoin implementation does not detect that any of the rows are dummy rows; Without the dummy row, I cannot infer the schema of the secondary table in order to determine how many nulls to append to the row in the case of no match with the primary table group rows. For now, I was able to work around this by assuming that both primary and secondary tables have the same schema (since that is what the test suite assumes) and I've gotten most of the full outer join test cases to pass (still debugging one of them), but once I have some clarification on this, I should be able to finalize the SortMergeJoin implementation. Thanks! |
"Do I need to manually add in a dummy row somewhere in the Scala code before the join is invoked?" Yes, please look at strategies.scala for code that enables adding dummy rows. |
Got it, thanks |
#215 Made a PR |
I'm in the process of addressing some of @octaviansima's comments on the PR, but I also had some questions for the BNLJ implementation: Previously @wzheng said:
By this did you mean that I could exclusively make changes to the C++ code without needing to also add any heavy logic on the Scala side? If so, this seems difficult, since I think that the C++ BroadcastedNestedLoopJoin.cpp code has access to only a portion of the streamed table (since it is partitioned), so it seems like some higher level coordination via the Scala code would be necessary to either somehow intelligently combine the full outer join outputs from each partition or to keep track of a BitSet in the Scala code following the Spark implementation. |
Thanks for the PR! I didn't mean that only C++ code would be required, in this case it would require some Scala code to shuffle the bit maps as well. I think supporting equi-join is good enough for now, we can support a more general full outer join later. |
Okay, thanks for the clarification. I pushed some more changes addressing the comments on the PR. Unfortunately, something went wrong with various file dependencies on the VM I was using, and I wasn't able to build and test the code with my changes before pushing the code (reinstalling Opaque also didn't help so not sure what happened). Hopefully it should still compile and pass the Github build but I will troubleshoot later today. |
#215 should be ready for review as of last Friday |
Closed by #215 |
Opaque's currently implements multiple join types via broadcast nested loop join (#159), but FULL OUTER join is not yet supported.
The text was updated successfully, but these errors were encountered: