Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full outer join is not supported #202

Closed
wzheng opened this issue Apr 6, 2021 · 19 comments
Closed

Full outer join is not supported #202

wzheng opened this issue Apr 6, 2021 · 19 comments

Comments

@wzheng
Copy link
Collaborator

wzheng commented Apr 6, 2021

Opaque's currently implements multiple join types via broadcast nested loop join (#159), but FULL OUTER join is not yet supported.

@saharshagrawal
Copy link
Contributor

For implementing FULL OUTER join via broadcast nested loop join, I am currently transforming "T1 full outer join T2" into "T1 left outer join T2 UNION T2 left anti-semi-join T1".

I'm in the process of modifying the implementation for the Join plan in strategies.scala (around line 143) that is currently used to match non-equi joins.

Does this sound like the correct approach? Thanks
@wzheng @octaviansima

@wzheng
Copy link
Collaborator Author

wzheng commented Apr 18, 2021

This is an interesting approach in terms of reusing the existing operators! However, the performance won't be good -- the first join will broadcast T1 to every partition, and the second join will broadcast T2 to every partition. If either of these tables is big, then the broadcast cost will be high. Also, the two joins + a union isn't enough -- you will need an additional projection on top of it.

What about supplementing the C++ side to include the FULL OUTER join functionality? Spark has a corresponding implementation that you can reference.

@wzheng
Copy link
Collaborator Author

wzheng commented Apr 18, 2021

Yes, that should be the implementation.

@saharshagrawal
Copy link
Contributor

I've implemented a preliminary version of FullOuter Join using BNLJ and I am now attempting to test my code but I'm facing issues with the current test suite:

I tried changing ignore("full outer join") to test("full outer join") on line 175 here but I'm running into the following error message when running build/sbt test:

  • full outer join *** FAILED ***
    [info] org.apache.spark.sql.AnalysisException: cannot resolve 'left.N' given input columns: [L, L, N, N];

Any suggestions here? Also, is there any easier way to run just the Join suite test cases instead of having to run all of the tests? Thanks.

@octaviansima
Copy link
Collaborator

@saharshagrawal you can run the specific test you want using build/sbt 'test:testOnly *SinglePartitionJoinSuite -- -t "full outer join"', or the entire join suite with build/sbt 'test:testOnly *SinglePartitionJoinSuite'. Looking into that error now -- it's likely an issue with the test since I'm getting the same with no implementation. Will ping back when it's ready.

@octaviansima
Copy link
Collaborator

@saharshagrawal the fix should be in master

@saharshagrawal
Copy link
Contributor

Thanks! I'll check it out!

@saharshagrawal
Copy link
Contributor

Looks like the fix works; I'm working on implementing the full outer join with the Sort Merge Join now to test the equi-join test cases

@wzheng
Copy link
Collaborator Author

wzheng commented Apr 20, 2021

Yes, tackling sort merge join first is good!

@saharshagrawal
Copy link
Contributor

The comment here states that there is a dummy row added with the desired schema:

  // A "dummy" row with the desired schema is added for each partition,
  // so last_foreign_row.get() is guaranteed to not be null.

However, this does not appear to be true. When running the full outer join test suite, the top-level while loop in the current SortMergeJoin implementation does not detect that any of the rows are dummy rows; current->is_dummy() returns false for all values of current considered. Is this an issue with the inputs being provided by the testing suite, or do I need to manually add in a dummy row somewhere in the Scala code before the join is invoked?

Without the dummy row, I cannot infer the schema of the secondary table in order to determine how many nulls to append to the row in the case of no match with the primary table group rows.

For now, I was able to work around this by assuming that both primary and secondary tables have the same schema (since that is what the test suite assumes) and I've gotten most of the full outer join test cases to pass (still debugging one of them), but once I have some clarification on this, I should be able to finalize the SortMergeJoin implementation. Thanks!

@octaviansima
Copy link
Collaborator

"Do I need to manually add in a dummy row somewhere in the Scala code before the join is invoked?"

Yes, please look at strategies.scala for code that enables adding dummy rows.

@saharshagrawal
Copy link
Contributor

Got it, thanks

@saharshagrawal
Copy link
Contributor

#215 Made a PR

@saharshagrawal
Copy link
Contributor

I'm in the process of addressing some of @octaviansima's comments on the PR, but I also had some questions for the BNLJ implementation:

Previously @wzheng said:

What about supplementing the C++ side to include the FULL OUTER join functionality? Spark has a corresponding implementation that you can reference.

By this did you mean that I could exclusively make changes to the C++ code without needing to also add any heavy logic on the Scala side? If so, this seems difficult, since I think that the C++ BroadcastedNestedLoopJoin.cpp code has access to only a portion of the streamed table (since it is partitioned), so it seems like some higher level coordination via the Scala code would be necessary to either somehow intelligently combine the full outer join outputs from each partition or to keep track of a BitSet in the Scala code following the Spark implementation.

@wzheng
Copy link
Collaborator Author

wzheng commented Apr 22, 2021

Thanks for the PR! I didn't mean that only C++ code would be required, in this case it would require some Scala code to shuffle the bit maps as well. I think supporting equi-join is good enough for now, we can support a more general full outer join later.

@saharshagrawal
Copy link
Contributor

Okay, thanks for the clarification.

I pushed some more changes addressing the comments on the PR. Unfortunately, something went wrong with various file dependencies on the VM I was using, and I wasn't able to build and test the code with my changes before pushing the code (reinstalling Opaque also didn't help so not sure what happened). Hopefully it should still compile and pass the Github build but I will troubleshoot later today.

@saharshagrawal
Copy link
Contributor

#215 should be ready for review as of last Friday

@octaviansima
Copy link
Collaborator

Closed by #215

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants