Full outer join is not supported #202

wzheng · 2021-04-06T21:44:58Z

Opaque's currently implements multiple join types via broadcast nested loop join (#159), but FULL OUTER join is not yet supported.

saharshagrawal · 2021-04-18T18:32:15Z

For implementing FULL OUTER join via broadcast nested loop join, I am currently transforming "T1 full outer join T2" into "T1 left outer join T2 UNION T2 left anti-semi-join T1".

I'm in the process of modifying the implementation for the Join plan in strategies.scala (around line 143) that is currently used to match non-equi joins.

Does this sound like the correct approach? Thanks
@wzheng @octaviansima

wzheng · 2021-04-18T19:34:41Z

This is an interesting approach in terms of reusing the existing operators! However, the performance won't be good -- the first join will broadcast T1 to every partition, and the second join will broadcast T2 to every partition. If either of these tables is big, then the broadcast cost will be high. Also, the two joins + a union isn't enough -- you will need an additional projection on top of it.

What about supplementing the C++ side to include the FULL OUTER join functionality? Spark has a corresponding implementation that you can reference.

saharshagrawal · 2021-04-18T22:17:23Z

Are you referring to the code starting on line 286 here? https://github.com/apache/spark/blob/12abfe79173f6ab00b3341f3b31cad5aa26aa6e4/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala#L286

wzheng · 2021-04-18T23:41:35Z

Yes, that should be the implementation.

saharshagrawal · 2021-04-19T11:25:39Z

I've implemented a preliminary version of FullOuter Join using BNLJ and I am now attempting to test my code but I'm facing issues with the current test suite:

I tried changing ignore("full outer join") to test("full outer join") on line 175 here but I'm running into the following error message when running build/sbt test:

full outer join *** FAILED ***
[info] org.apache.spark.sql.AnalysisException: cannot resolve 'left.N' given input columns: [L, L, N, N];

Any suggestions here? Also, is there any easier way to run just the Join suite test cases instead of having to run all of the tests? Thanks.

octaviansima · 2021-04-19T17:05:04Z

@saharshagrawal you can run the specific test you want using build/sbt 'test:testOnly *SinglePartitionJoinSuite -- -t "full outer join"', or the entire join suite with build/sbt 'test:testOnly *SinglePartitionJoinSuite'. Looking into that error now -- it's likely an issue with the test since I'm getting the same with no implementation. Will ping back when it's ready.

octaviansima · 2021-04-19T19:21:06Z

@saharshagrawal the fix should be in master

saharshagrawal · 2021-04-19T19:46:23Z

Thanks! I'll check it out!

saharshagrawal · 2021-04-20T05:34:31Z

Looks like the fix works; I'm working on implementing the full outer join with the Sort Merge Join now to test the equi-join test cases

wzheng · 2021-04-20T16:44:10Z

Yes, tackling sort merge join first is good!

saharshagrawal · 2021-04-21T23:20:47Z

The comment here states that there is a dummy row added with the desired schema:

  // A "dummy" row with the desired schema is added for each partition,
  // so last_foreign_row.get() is guaranteed to not be null.

However, this does not appear to be true. When running the full outer join test suite, the top-level while loop in the current SortMergeJoin implementation does not detect that any of the rows are dummy rows; current->is_dummy() returns false for all values of current considered. Is this an issue with the inputs being provided by the testing suite, or do I need to manually add in a dummy row somewhere in the Scala code before the join is invoked?

Without the dummy row, I cannot infer the schema of the secondary table in order to determine how many nulls to append to the row in the case of no match with the primary table group rows.

For now, I was able to work around this by assuming that both primary and secondary tables have the same schema (since that is what the test suite assumes) and I've gotten most of the full outer join test cases to pass (still debugging one of them), but once I have some clarification on this, I should be able to finalize the SortMergeJoin implementation. Thanks!

octaviansima · 2021-04-21T23:39:27Z

"Do I need to manually add in a dummy row somewhere in the Scala code before the join is invoked?"

Yes, please look at strategies.scala for code that enables adding dummy rows.

saharshagrawal · 2021-04-22T00:30:27Z

Got it, thanks

saharshagrawal · 2021-04-22T19:15:11Z

#215 Made a PR

saharshagrawal · 2021-04-22T21:02:26Z

I'm in the process of addressing some of @octaviansima's comments on the PR, but I also had some questions for the BNLJ implementation:

Previously @wzheng said:

What about supplementing the C++ side to include the FULL OUTER join functionality? Spark has a corresponding implementation that you can reference.

By this did you mean that I could exclusively make changes to the C++ code without needing to also add any heavy logic on the Scala side? If so, this seems difficult, since I think that the C++ BroadcastedNestedLoopJoin.cpp code has access to only a portion of the streamed table (since it is partitioned), so it seems like some higher level coordination via the Scala code would be necessary to either somehow intelligently combine the full outer join outputs from each partition or to keep track of a BitSet in the Scala code following the Spark implementation.

wzheng · 2021-04-22T22:36:15Z

Thanks for the PR! I didn't mean that only C++ code would be required, in this case it would require some Scala code to shuffle the bit maps as well. I think supporting equi-join is good enough for now, we can support a more general full outer join later.

saharshagrawal · 2021-04-23T12:48:49Z

Okay, thanks for the clarification.

I pushed some more changes addressing the comments on the PR. Unfortunately, something went wrong with various file dependencies on the VM I was using, and I wasn't able to build and test the code with my changes before pushing the code (reinstalling Opaque also didn't help so not sure what happened). Hopefully it should still compile and pass the Github build but I will troubleshoot later today.

saharshagrawal · 2021-04-26T09:47:33Z

#215 should be ready for review as of last Friday

octaviansima · 2021-04-28T20:47:20Z

Closed by #215

octaviansima closed this as completed Apr 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full outer join is not supported #202

Full outer join is not supported #202

wzheng commented Apr 6, 2021

saharshagrawal commented Apr 18, 2021

wzheng commented Apr 18, 2021

saharshagrawal commented Apr 18, 2021

wzheng commented Apr 18, 2021

saharshagrawal commented Apr 19, 2021

octaviansima commented Apr 19, 2021

octaviansima commented Apr 19, 2021

saharshagrawal commented Apr 19, 2021

saharshagrawal commented Apr 20, 2021

wzheng commented Apr 20, 2021

saharshagrawal commented Apr 21, 2021

octaviansima commented Apr 21, 2021

saharshagrawal commented Apr 22, 2021

saharshagrawal commented Apr 22, 2021

saharshagrawal commented Apr 22, 2021

wzheng commented Apr 22, 2021

saharshagrawal commented Apr 23, 2021

saharshagrawal commented Apr 26, 2021

octaviansima commented Apr 28, 2021

Full outer join is not supported #202

Full outer join is not supported #202

Comments

wzheng commented Apr 6, 2021

saharshagrawal commented Apr 18, 2021

wzheng commented Apr 18, 2021

saharshagrawal commented Apr 18, 2021

wzheng commented Apr 18, 2021

saharshagrawal commented Apr 19, 2021

octaviansima commented Apr 19, 2021

octaviansima commented Apr 19, 2021

saharshagrawal commented Apr 19, 2021

saharshagrawal commented Apr 20, 2021

wzheng commented Apr 20, 2021

saharshagrawal commented Apr 21, 2021

octaviansima commented Apr 21, 2021

saharshagrawal commented Apr 22, 2021

saharshagrawal commented Apr 22, 2021

saharshagrawal commented Apr 22, 2021

wzheng commented Apr 22, 2021

saharshagrawal commented Apr 23, 2021

saharshagrawal commented Apr 26, 2021

octaviansima commented Apr 28, 2021