Broadcast Nested Loop Join - Left Anti and Left Semi #159

octaviansima · 2021-02-18T19:45:52Z

This PR is the first of two parts towards making TPC-H 16 work: the other will be implementing is_distinct for aggregate operations.

BroadcastNestedLoopJoin is Spark's "catch all" for non-equi joins. It works by first picking a side to broadcast, then iterating through every possible row combination and checking the non-equi condition against the pair.

wzheng

Partial review.

wzheng · 2021-02-19T20:28:45Z

src/main/scala/edu/berkeley/cs/rise/opaque/execution/operators.scala

+
+    val leftRDD = left.asInstanceOf[OpaqueOperatorExec].executeBlocked()
+    val rightRDD = right.asInstanceOf[OpaqueOperatorExec].executeBlocked()
+


Can you add something like val (streamed, broadcast) = buildSide match {case BuildRight => (leftRDD, rightRDD)}, xxx to simplify the defaultJoin implementation?

This is what Spark does but I found the existing implementation much easier and shorter. The issue with (streamed, broadcast) is that I would have to convert LeftAnti to RightAnti, etc. for the C++ code otherwise the result will contain the wrong columns.

I see, I didn't realize that you actually swap block.bytes and broadcast based on whether it's BuildRight or BuildLeft. Does this mean you're always assuming that the left side is the outer side? The following code is a bit confusing right now because:

I think there is no RightSemi and RightAnti in Spark according to joinTypes.scala, so if it's an existence join, then only BuildRight will be called

For outer joins, however, both LeftOuter and RightOuter are defined. The build side is assumed to be the opposite of the outer side since you're re-implementing those helper functions, but in that case the broadcast side should always be the "inner side"?

Correct: left side always corresponds to outer_rows and right side always corresponds to inner_rows in C++. I think the strongest case for keeping the code as is comes from strategies.scala: Under the current scheme, if we find a significant performance boost for, as an example, building left when the join is LeftSemi (could be due to left side being much smaller), then there will be no required changes to operators.scala or the C++ code. If we decide to model after the Spark code, then switching outer_rows and inner_rows will mean implementing RightSemi in C++ even though it is not officially supported by Spark.

Edit: after discussion, outer_rows will be streamed and inner_rows will be broadcast.

wzheng · 2021-02-19T20:31:31Z

src/main/scala/edu/berkeley/cs/rise/opaque/execution/operators.scala

+      case x =>
+        throw new OpaqueException(s"$x JoinType is not yet supported")


Nit: I think you can put case _ => throw new OpaqueException(s"$joinType is not yet supported")

wzheng · 2021-02-19T20:33:12Z

src/main/scala/edu/berkeley/cs/rise/opaque/execution/operators.scala

+    buildSide match {
+      // Broadcast right
+      case BuildRight => {
+        val broadcast = Utils.ensureCached(rightRDD.map(block => block.bytes)).collect.flatten


Why do you need to call Utils.ensureCached() here?

wzheng · 2021-02-19T20:58:17Z

src/main/scala/edu/berkeley/cs/rise/opaque/strategies.scala

+
+  private def getBroadcastSideBNLJ(joinType: JoinType): BuildSide = {
+    joinType match {
+      case LeftSemiOrAnti(joinType) => BuildRight


Should it also be set to BuildRight if it's a LeftOuter join?

src/main/scala/edu/berkeley/cs/rise/opaque/strategies.scala

wzheng · 2021-02-19T21:09:55Z

src/main/scala/edu/berkeley/cs/rise/opaque/strategies.scala

+      // For perf reasons, `BroadcastNestedLoopJoinExec` prefers to broadcast left side if
+      // it's a right join, and broadcast right side if it's a left join.
+      // TODO: revisit it. If left side is much smaller than the right side, it may be better
+      // to broadcast the left side even if it's a left join.


Nit: this seems to be a comment copied from Spark, maybe can just summarize their comment?

wzheng · 2021-02-19T21:16:52Z

src/flatbuffers/operators.fbs

+    // In the case of non-equi joins, we pass in a condition
+    // as an expression and evaluate that on each pair of rows.
+    condition:Expr;


Equijoin also supports condition, which we currently handle using an extra filter operation. We might want to merge that into the join code at some point instead of using an extra operator. Doesn't have to be in this PR, but might be good to put a TODO comment here?

wzheng · 2021-02-22T23:46:11Z

src/main/scala/edu/berkeley/cs/rise/opaque/execution/operators.scala

+
+    val leftRDD = left.asInstanceOf[OpaqueOperatorExec].executeBlocked()
+    val rightRDD = right.asInstanceOf[OpaqueOperatorExec].executeBlocked()
+


I see, I didn't realize that you actually swap block.bytes and broadcast based on whether it's BuildRight or BuildLeft. Does this mean you're always assuming that the left side is the outer side? The following code is a bit confusing right now because:

I think there is no RightSemi and RightAnti in Spark according to joinTypes.scala, so if it's an existence join, then only BuildRight will be called

For outer joins, however, both LeftOuter and RightOuter are defined. The build side is assumed to be the opposite of the outer side since you're re-implementing those helper functions, but in that case the broadcast side should always be the "inner side"?

wzheng · 2021-02-23T00:41:39Z

src/enclave/Enclave/ExpressionEvaluation.h

+    builder.Clear();
+    bool row1_equals_row2;
+
+    /* Check equality for equi joins */


This should only be called if is_equi_join is true?

Yes, but the code never actually goes into the for loop because row1_evaluators is empty. I'll add a comment.

src/enclave/Enclave/BroadcastNestedLoopJoin.cpp

set up class thing cleanup added test cases for non-equi left anti join rename to serializeEquiJoinExpression added isEncrypted condition set up keys JoinExpr now has condition rename serialization does not throw compile error for BNLJ split up added condition in ExpressionEvaluation.h zipPartitions cpp put in place typo added func to header two loops in place update tests condition fixed scala loop interchange rows added tags ensure cached == match working comparison decoupling in ExpressionEvalulation save compiles and condition works is printing fix swap outer/inner o_i_match show() has the same result tests pass test cleanup added test cases for different condition BuildLeft works optional keys in scala started C++ passes the operator tests comments, cleanup attemping to do it the ~right~ way comments to distinguish between primary/secondary, operator tests pass cleanup comments, about to begin implementation for distinct agg ops is_distinct added test case serializing with isDistinct is_distinct in ExpressionEvaluation.h removed unused code from join implementation remove RowWriter/Reader in condition evaluation (join) easier test serialization done correct checking in Scala set is set up spaghetti but it finally works function for clearing values condition_eval isntead of condition goto comment started impl of multiple partitions fix added rangepartitionexec that runs partitioning cleanup serialization properly comments, generalization for > 1 distinct function comments about to refactor into logical.Aggregation the new case has distinct in result expressions need to match on distinct removed new case (doesn't make difference?) works remove traces of distinct more cleanup address comments rename equi join split Join.cpp into two files Update App.cpp fixed swap issues one more swap stream/broadcast concatEncryptedBlocks, remove import iostream comment for for loop added comments explaining constraints with broadcast side comments left semi done, existence serializes remove existence serialization fixed

wzheng

Looks great, thanks!

This PR is the first of two parts towards making TPC-H 16 work: the other will be implementing `is_distinct` for aggregate operations. `BroadcastNestedLoopJoin` is Spark's "catch all" for non-equi joins. It works by first picking a side to broadcast, then iterating through every possible row combination and checking the non-equi condition against the pair.

* Support for multiple branched CaseWhen * Interval (#116) * add date_add, interval sql still running into issues * Add Interval SQL support * uncomment out the other tests * resolve comments * change interval equality Co-authored-by: Eric Feng <fengeric11@berkeley.edu> * Remove partition ID argument from enclaves * Fix comments * updates * Modifications to integrate crumb, log-mac, and all-outputs_mac, wip * Store log mac after each output buffer, add all-outputs-mac to each encryptedblocks wip * Add all_outputs_mac to all EncryptedBlocks once all log_macs have been generated * Almost builds * cpp builds * Use ubyte for all_outputs_mac * use Mac for all_outputs_mac * Hopefully this works for flatbuffers all_outputs_mac mutation, cpp builds * Scala builds now too, running into error with union * Stuff builds, error with all outputs mac serialization. this commit uses all_outputs_mac as Mac table * Fixed bug, basic encryption / show works * All single partition tests pass, multiple partiton passes until tpch-9 * All tests pass except tpch-9 and skew join * comment tpch back in * Check same number of ecalls per partition - exception for scanCollectLastPrimary(?) * First attempt at constructing executed DAG * Fix typos * Rework graph * Add log macs to graph nodes * Construct expected DAG and refactor JobNode. Refactor construction of executed DAG. * Implement 'paths to sink' for a DAG * add crumb for last ecall * Fix NULL handling for aggregation (#130) * Modify COUNT and SUM to correctly handle NULL values * Change average to support NULL values * Fix * Changing operator matching from logical to physical (#129) * WIP * Fix * Unapply change * Aggregation rewrite (#132) * updated build/sbt file (#135) * Travis update (#137) * update breeze (#138) * TPC-H test suite added (#136) * added tpch sql files * functions updated to save temp view * main function skeleton done * load and clear done * fix clear * performQuery done * import cleanup, use OPAQUE_HOME * TPC-H 9 refactored to use SQL rather than DF operations * removed : Unit, unused imports * added TestUtils.scala * moved all common initialization to TestUtils * update name * begin rewriting TPCH.scala to store persistent tables * invalid table name error * TPCH conversion to class started * compiles * added second case, cleared up names * added TPC-H 6 to check that persistent state has no issues * added functions for the last two tables * addressed most logic changes * DataFrame only loaded once * apply method in companion object * full test suite added * added testFunc parameter to testAgainstSpark * ignore #18 * Separate IN PR (#124) * finishing the in expression. adding more tests and null support. need confirmation on null behavior and also I wonder why integer field is sufficient for string * adding additional test * adding additional test * saving concat implementation and it's passing basic functionality tests * adding type aware comparison and better error message for IN operator * adding null checking for the concat operator and adding one additional test * cleaning up IN&Concat PR * deleting concat and preping the in branch for in pr * fixing null bahavior now it's only null when there's no match and there's null input * Build failed Co-authored-by: Ubuntu <chenyu@accvm.docqqnvnul2ujd1zaothcdqfqb.bx.internal.cloudapp.net> Co-authored-by: Wenting Zheng <wzheng@eecs.berkeley.edu> Co-authored-by: Wenting Zheng <wzheng13@gmail.com> * Merge new aggregate * Uncomment log_mac_lst clear * Clean up comments * Separate Concat PR (#125) Implementation of the CONCAT expression. Co-authored-by: Ubuntu <chenyu@accvm.docqqnvnul2ujd1zaothcdqfqb.bx.internal.cloudapp.net> Co-authored-by: Wenting Zheng <wzheng@eecs.berkeley.edu> * Clean up comments in other files * Update pathsEqual to be less conservative * Remove print statements from unit tests * Removed calls to toSet in TPC-H tests (#140) * removed calls to toSet * added calls to toSet back where queries are unordered * Documentation update (#148) * Cluster Remote Attestation Fix (#146) The existing code only had RA working when run locally. This PR adds a sleep for 5 seconds to make sure that all executors are spun up successfully before attestation begins. Closes #147 * upgrade to 3.0.1 (#144) * Update two TPC-H queries (#149) Tests for TPC-H 12 and 19 pass. * TPC-H 20 Fix (#142) * string to stringtype error * tpch 20 passes * cleanup * implemented changes * decimal.tofloat Co-authored-by: Wenting Zheng <wzheng@eecs.berkeley.edu> * Add expected operator DAG generation from executedPlan string * Rebase * Join update (#145) * Merge join update * Integrate new join * Add expected operator for sortexec * Merge comp-integrity with join update * Remove some print statements * Migrate from Travis CI to Github Actions (#156) * Upgrade to OE 0.12 (#153) * Update README.md * Support for scalar subquery (#157) This PR implements the scalar subquery expression, which is triggered whenever a subquery returns a scalar value. There were two main problems that needed to be solved. First, support for matching the scalar subquery expression is necessary. Spark implements this by wrapping a SparkPlan within the expression and calls executeCollect. Then it constructs a literal with that value. However, this is problematic for us because that value should not be decrypted by the driver and serialized into an expression, since it's an intermediate value. Therefore, the second issue to be addressed here is supporting an encrypted literal. This is implemented in this PR by serializing an encrypted ciphertext into a base64 encoded string, and wrapping a Decrypt expression on top of it. This expression is then evaluated in the enclave and returns a literal. Note that, in order to test our implementation, we also implement a Decrypt expression in Scala. However, this should never be evaluated on the driver side and serialized into a plaintext literal. This is because Decrypt is designated as a Nondeterministic expression, and therefore will always evaluate on the workers. * Add TPC-H Benchmarks (#139) * logic decoupling in TPCH.scala for easier benchmarking * added TPCHBenchmark.scala * Benchmark.scala rewrite * done adding all support TPC-H query benchmarks * changed commandline arguments that benchmark takes * TPCHBenchmark takes in parameters * fixed issue with spark conf * size error handling, --help flag * add Utils.force, break cluster mode * comment out logistic regression benchmark * ensureCached right before temp view created/replaced * upgrade to 3.0.1 * upgrade to 3.0.1 * 10 scale factor * persistData * almost done refactor * more cleanup * compiles * 9 passes * cleanup * collect instead of force, sf_none * remove sf_none * defaultParallelism * no removing trailing/leading whitespace * add sf_med * hdfs works in local case * cleanup, added new CLI argument * added newly supported tpch queries * function for running all supported tests * Construct expected DAG from dataframe physical plan * Refactor collect and add integrity checking helper function to OpaqueOperatorTest * Float expressions (#160) This PR adds float normalization expressions [implemented in Spark](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala#L170). TPC-H query 2 also passes. * Broadcast Nested Loop Join - Left Anti and Left Semi (#159) This PR is the first of two parts towards making TPC-H 16 work: the other will be implementing `is_distinct` for aggregate operations. `BroadcastNestedLoopJoin` is Spark's "catch all" for non-equi joins. It works by first picking a side to broadcast, then iterating through every possible row combination and checking the non-equi condition against the pair. * Remove addExpectedOperator from JobVerificationEngine, add comments * Implement expected DAG construction by doing graph manipulation on dataframe field instead of string parsing * Fix merge errors in the test cases Co-authored-by: Andrew Law <andrewlaw@sharkfin.local> Co-authored-by: Eric Feng <31462296+eric-feng-2011@users.noreply.github.com> Co-authored-by: Eric Feng <fengeric11@berkeley.edu> Co-authored-by: Chester Leung <chester.leung@berkeley.edu> Co-authored-by: Wenting Zheng <wzheng@eecs.berkeley.edu> Co-authored-by: octaviansima <34696537+octaviansima@users.noreply.github.com> Co-authored-by: Chenyu Shi <32005685+Chenyu-Shi@users.noreply.github.com> Co-authored-by: Ubuntu <chenyu@accvm.docqqnvnul2ujd1zaothcdqfqb.bx.internal.cloudapp.net> Co-authored-by: Wenting Zheng <wzheng13@gmail.com>

* Support for multiple branched CaseWhen * Interval (#116) * add date_add, interval sql still running into issues * Add Interval SQL support * uncomment out the other tests * resolve comments * change interval equality Co-authored-by: Eric Feng <fengeric11@berkeley.edu> * Remove partition ID argument from enclaves * Fix comments * updates * Modifications to integrate crumb, log-mac, and all-outputs_mac, wip * Store log mac after each output buffer, add all-outputs-mac to each encryptedblocks wip * Add all_outputs_mac to all EncryptedBlocks once all log_macs have been generated * Almost builds * cpp builds * Use ubyte for all_outputs_mac * use Mac for all_outputs_mac * Hopefully this works for flatbuffers all_outputs_mac mutation, cpp builds * Scala builds now too, running into error with union * Stuff builds, error with all outputs mac serialization. this commit uses all_outputs_mac as Mac table * Fixed bug, basic encryption / show works * All single partition tests pass, multiple partiton passes until tpch-9 * All tests pass except tpch-9 and skew join * comment tpch back in * Check same number of ecalls per partition - exception for scanCollectLastPrimary(?) * First attempt at constructing executed DAG * Fix typos * Rework graph * Add log macs to graph nodes * Construct expected DAG and refactor JobNode. Refactor construction of executed DAG. * Implement 'paths to sink' for a DAG * add crumb for last ecall * Fix NULL handling for aggregation (#130) * Modify COUNT and SUM to correctly handle NULL values * Change average to support NULL values * Fix * Changing operator matching from logical to physical (#129) * WIP * Fix * Unapply change * Aggregation rewrite (#132) * updated build/sbt file (#135) * Travis update (#137) * update breeze (#138) * TPC-H test suite added (#136) * added tpch sql files * functions updated to save temp view * main function skeleton done * load and clear done * fix clear * performQuery done * import cleanup, use OPAQUE_HOME * TPC-H 9 refactored to use SQL rather than DF operations * removed : Unit, unused imports * added TestUtils.scala * moved all common initialization to TestUtils * update name * begin rewriting TPCH.scala to store persistent tables * invalid table name error * TPCH conversion to class started * compiles * added second case, cleared up names * added TPC-H 6 to check that persistent state has no issues * added functions for the last two tables * addressed most logic changes * DataFrame only loaded once * apply method in companion object * full test suite added * added testFunc parameter to testAgainstSpark * ignore #18 * Separate IN PR (#124) * finishing the in expression. adding more tests and null support. need confirmation on null behavior and also I wonder why integer field is sufficient for string * adding additional test * adding additional test * saving concat implementation and it's passing basic functionality tests * adding type aware comparison and better error message for IN operator * adding null checking for the concat operator and adding one additional test * cleaning up IN&Concat PR * deleting concat and preping the in branch for in pr * fixing null bahavior now it's only null when there's no match and there's null input * Build failed Co-authored-by: Ubuntu <chenyu@accvm.docqqnvnul2ujd1zaothcdqfqb.bx.internal.cloudapp.net> Co-authored-by: Wenting Zheng <wzheng@eecs.berkeley.edu> Co-authored-by: Wenting Zheng <wzheng13@gmail.com> * Merge new aggregate * Uncomment log_mac_lst clear * Clean up comments * Separate Concat PR (#125) Implementation of the CONCAT expression. Co-authored-by: Ubuntu <chenyu@accvm.docqqnvnul2ujd1zaothcdqfqb.bx.internal.cloudapp.net> Co-authored-by: Wenting Zheng <wzheng@eecs.berkeley.edu> * Clean up comments in other files * Update pathsEqual to be less conservative * Remove print statements from unit tests * Removed calls to toSet in TPC-H tests (#140) * removed calls to toSet * added calls to toSet back where queries are unordered * Documentation update (#148) * Cluster Remote Attestation Fix (#146) The existing code only had RA working when run locally. This PR adds a sleep for 5 seconds to make sure that all executors are spun up successfully before attestation begins. Closes #147 * upgrade to 3.0.1 (#144) * Update two TPC-H queries (#149) Tests for TPC-H 12 and 19 pass. * TPC-H 20 Fix (#142) * string to stringtype error * tpch 20 passes * cleanup * implemented changes * decimal.tofloat Co-authored-by: Wenting Zheng <wzheng@eecs.berkeley.edu> * Add expected operator DAG generation from executedPlan string * Rebase * Join update (#145) * Merge join update * Integrate new join * Add expected operator for sortexec * Merge comp-integrity with join update * Remove some print statements * Migrate from Travis CI to Github Actions (#156) * Upgrade to OE 0.12 (#153) * Update README.md * Support for scalar subquery (#157) This PR implements the scalar subquery expression, which is triggered whenever a subquery returns a scalar value. There were two main problems that needed to be solved. First, support for matching the scalar subquery expression is necessary. Spark implements this by wrapping a SparkPlan within the expression and calls executeCollect. Then it constructs a literal with that value. However, this is problematic for us because that value should not be decrypted by the driver and serialized into an expression, since it's an intermediate value. Therefore, the second issue to be addressed here is supporting an encrypted literal. This is implemented in this PR by serializing an encrypted ciphertext into a base64 encoded string, and wrapping a Decrypt expression on top of it. This expression is then evaluated in the enclave and returns a literal. Note that, in order to test our implementation, we also implement a Decrypt expression in Scala. However, this should never be evaluated on the driver side and serialized into a plaintext literal. This is because Decrypt is designated as a Nondeterministic expression, and therefore will always evaluate on the workers. * Add TPC-H Benchmarks (#139) * logic decoupling in TPCH.scala for easier benchmarking * added TPCHBenchmark.scala * Benchmark.scala rewrite * done adding all support TPC-H query benchmarks * changed commandline arguments that benchmark takes * TPCHBenchmark takes in parameters * fixed issue with spark conf * size error handling, --help flag * add Utils.force, break cluster mode * comment out logistic regression benchmark * ensureCached right before temp view created/replaced * upgrade to 3.0.1 * upgrade to 3.0.1 * 10 scale factor * persistData * almost done refactor * more cleanup * compiles * 9 passes * cleanup * collect instead of force, sf_none * remove sf_none * defaultParallelism * no removing trailing/leading whitespace * add sf_med * hdfs works in local case * cleanup, added new CLI argument * added newly supported tpch queries * function for running all supported tests * Construct expected DAG from dataframe physical plan * Refactor collect and add integrity checking helper function to OpaqueOperatorTest * Float expressions (#160) This PR adds float normalization expressions [implemented in Spark](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala#L170). TPC-H query 2 also passes. * Broadcast Nested Loop Join - Left Anti and Left Semi (#159) This PR is the first of two parts towards making TPC-H 16 work: the other will be implementing `is_distinct` for aggregate operations. `BroadcastNestedLoopJoin` is Spark's "catch all" for non-equi joins. It works by first picking a side to broadcast, then iterating through every possible row combination and checking the non-equi condition against the pair. * Move join condition handling for equi-joins into enclave code (#164) * Add in TPC-H 21 * Add condition processing in enclave code * Code clean up * Enable query 18 * WIP * Local tests pass * Apply suggestions from code review Co-authored-by: octaviansima <34696537+octaviansima@users.noreply.github.com> * WIP * Address comments * q21.sql Co-authored-by: octaviansima <34696537+octaviansima@users.noreply.github.com> * Remove addExpectedOperator from JobVerificationEngine, add comments * Implement expected DAG construction by doing graph manipulation on dataframe field instead of string parsing * Fix merge errors in the test cases Co-authored-by: Andrew Law <andrewlaw@sharkfin.local> Co-authored-by: Eric Feng <31462296+eric-feng-2011@users.noreply.github.com> Co-authored-by: Eric Feng <fengeric11@berkeley.edu> Co-authored-by: Chester Leung <chester.leung@berkeley.edu> Co-authored-by: Wenting Zheng <wzheng@eecs.berkeley.edu> Co-authored-by: octaviansima <34696537+octaviansima@users.noreply.github.com> Co-authored-by: Chenyu Shi <32005685+Chenyu-Shi@users.noreply.github.com> Co-authored-by: Ubuntu <chenyu@accvm.docqqnvnul2ujd1zaothcdqfqb.bx.internal.cloudapp.net> Co-authored-by: Wenting Zheng <wzheng13@gmail.com>

* Support for multiple branched CaseWhen * Interval (#116) * add date_add, interval sql still running into issues * Add Interval SQL support * uncomment out the other tests * resolve comments * change interval equality Co-authored-by: Eric Feng <fengeric11@berkeley.edu> * Remove partition ID argument from enclaves * Fix comments * updates * Modifications to integrate crumb, log-mac, and all-outputs_mac, wip * Store log mac after each output buffer, add all-outputs-mac to each encryptedblocks wip * Add all_outputs_mac to all EncryptedBlocks once all log_macs have been generated * Almost builds * cpp builds * Use ubyte for all_outputs_mac * use Mac for all_outputs_mac * Hopefully this works for flatbuffers all_outputs_mac mutation, cpp builds * Scala builds now too, running into error with union * Stuff builds, error with all outputs mac serialization. this commit uses all_outputs_mac as Mac table * Fixed bug, basic encryption / show works * All single partition tests pass, multiple partiton passes until tpch-9 * All tests pass except tpch-9 and skew join * comment tpch back in * Check same number of ecalls per partition - exception for scanCollectLastPrimary(?) * First attempt at constructing executed DAG * Fix typos * Rework graph * Add log macs to graph nodes * Construct expected DAG and refactor JobNode. Refactor construction of executed DAG. * Implement 'paths to sink' for a DAG * add crumb for last ecall * Fix NULL handling for aggregation (#130) * Modify COUNT and SUM to correctly handle NULL values * Change average to support NULL values * Fix * Changing operator matching from logical to physical (#129) * WIP * Fix * Unapply change * Aggregation rewrite (#132) * updated build/sbt file (#135) * Travis update (#137) * update breeze (#138) * TPC-H test suite added (#136) * added tpch sql files * functions updated to save temp view * main function skeleton done * load and clear done * fix clear * performQuery done * import cleanup, use OPAQUE_HOME * TPC-H 9 refactored to use SQL rather than DF operations * removed : Unit, unused imports * added TestUtils.scala * moved all common initialization to TestUtils * update name * begin rewriting TPCH.scala to store persistent tables * invalid table name error * TPCH conversion to class started * compiles * added second case, cleared up names * added TPC-H 6 to check that persistent state has no issues * added functions for the last two tables * addressed most logic changes * DataFrame only loaded once * apply method in companion object * full test suite added * added testFunc parameter to testAgainstSpark * ignore #18 * Separate IN PR (#124) * finishing the in expression. adding more tests and null support. need confirmation on null behavior and also I wonder why integer field is sufficient for string * adding additional test * adding additional test * saving concat implementation and it's passing basic functionality tests * adding type aware comparison and better error message for IN operator * adding null checking for the concat operator and adding one additional test * cleaning up IN&Concat PR * deleting concat and preping the in branch for in pr * fixing null bahavior now it's only null when there's no match and there's null input * Build failed Co-authored-by: Ubuntu <chenyu@accvm.docqqnvnul2ujd1zaothcdqfqb.bx.internal.cloudapp.net> Co-authored-by: Wenting Zheng <wzheng@eecs.berkeley.edu> Co-authored-by: Wenting Zheng <wzheng13@gmail.com> * Merge new aggregate * Uncomment log_mac_lst clear * Clean up comments * Separate Concat PR (#125) Implementation of the CONCAT expression. Co-authored-by: Ubuntu <chenyu@accvm.docqqnvnul2ujd1zaothcdqfqb.bx.internal.cloudapp.net> Co-authored-by: Wenting Zheng <wzheng@eecs.berkeley.edu> * Clean up comments in other files * Update pathsEqual to be less conservative * Remove print statements from unit tests * Removed calls to toSet in TPC-H tests (#140) * removed calls to toSet * added calls to toSet back where queries are unordered * Documentation update (#148) * Cluster Remote Attestation Fix (#146) The existing code only had RA working when run locally. This PR adds a sleep for 5 seconds to make sure that all executors are spun up successfully before attestation begins. Closes #147 * upgrade to 3.0.1 (#144) * Update two TPC-H queries (#149) Tests for TPC-H 12 and 19 pass. * TPC-H 20 Fix (#142) * string to stringtype error * tpch 20 passes * cleanup * implemented changes * decimal.tofloat Co-authored-by: Wenting Zheng <wzheng@eecs.berkeley.edu> * Add expected operator DAG generation from executedPlan string * Rebase * Join update (#145) * Merge join update * Integrate new join * Add expected operator for sortexec * Merge comp-integrity with join update * Remove some print statements * Migrate from Travis CI to Github Actions (#156) * Upgrade to OE 0.12 (#153) * Update README.md * Support for scalar subquery (#157) This PR implements the scalar subquery expression, which is triggered whenever a subquery returns a scalar value. There were two main problems that needed to be solved. First, support for matching the scalar subquery expression is necessary. Spark implements this by wrapping a SparkPlan within the expression and calls executeCollect. Then it constructs a literal with that value. However, this is problematic for us because that value should not be decrypted by the driver and serialized into an expression, since it's an intermediate value. Therefore, the second issue to be addressed here is supporting an encrypted literal. This is implemented in this PR by serializing an encrypted ciphertext into a base64 encoded string, and wrapping a Decrypt expression on top of it. This expression is then evaluated in the enclave and returns a literal. Note that, in order to test our implementation, we also implement a Decrypt expression in Scala. However, this should never be evaluated on the driver side and serialized into a plaintext literal. This is because Decrypt is designated as a Nondeterministic expression, and therefore will always evaluate on the workers. * Add TPC-H Benchmarks (#139) * logic decoupling in TPCH.scala for easier benchmarking * added TPCHBenchmark.scala * Benchmark.scala rewrite * done adding all support TPC-H query benchmarks * changed commandline arguments that benchmark takes * TPCHBenchmark takes in parameters * fixed issue with spark conf * size error handling, --help flag * add Utils.force, break cluster mode * comment out logistic regression benchmark * ensureCached right before temp view created/replaced * upgrade to 3.0.1 * upgrade to 3.0.1 * 10 scale factor * persistData * almost done refactor * more cleanup * compiles * 9 passes * cleanup * collect instead of force, sf_none * remove sf_none * defaultParallelism * no removing trailing/leading whitespace * add sf_med * hdfs works in local case * cleanup, added new CLI argument * added newly supported tpch queries * function for running all supported tests * Construct expected DAG from dataframe physical plan * Refactor collect and add integrity checking helper function to OpaqueOperatorTest * Float expressions (#160) This PR adds float normalization expressions [implemented in Spark](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala#L170). TPC-H query 2 also passes. * Broadcast Nested Loop Join - Left Anti and Left Semi (#159) This PR is the first of two parts towards making TPC-H 16 work: the other will be implementing `is_distinct` for aggregate operations. `BroadcastNestedLoopJoin` is Spark's "catch all" for non-equi joins. It works by first picking a side to broadcast, then iterating through every possible row combination and checking the non-equi condition against the pair. * Move join condition handling for equi-joins into enclave code (#164) * Add in TPC-H 21 * Add condition processing in enclave code * Code clean up * Enable query 18 * WIP * Local tests pass * Apply suggestions from code review Co-authored-by: octaviansima <34696537+octaviansima@users.noreply.github.com> * WIP * Address comments * q21.sql Co-authored-by: octaviansima <34696537+octaviansima@users.noreply.github.com> * Distinct aggregation support (#163) * matching in strategies.scala set up class thing cleanup added test cases for non-equi left anti join rename to serializeEquiJoinExpression added isEncrypted condition set up keys JoinExpr now has condition rename serialization does not throw compile error for BNLJ split up added condition in ExpressionEvaluation.h zipPartitions cpp put in place typo added func to header two loops in place update tests condition fixed scala loop interchange rows added tags ensure cached == match working comparison decoupling in ExpressionEvalulation save compiles and condition works is printing fix swap outer/inner o_i_match show() has the same result tests pass test cleanup added test cases for different condition BuildLeft works optional keys in scala started C++ passes the operator tests comments, cleanup attemping to do it the ~right~ way comments to distinguish between primary/secondary, operator tests pass cleanup comments, about to begin implementation for distinct agg ops is_distinct added test case serializing with isDistinct is_distinct in ExpressionEvaluation.h removed unused code from join implementation remove RowWriter/Reader in condition evaluation (join) easier test serialization done correct checking in Scala set is set up spaghetti but it finally works function for clearing values condition_eval isntead of condition goto comment remove explain from test, need to fix distinct aggregation for >1 partitions started impl of multiple partitions fix added rangepartitionexec that runs partitioning cleanup serialization properly comments, generalization for > 1 distinct function comments about to refactor into logical.Aggregation the new case has distinct in result expressions need to match on distinct removed new case (doesn't make difference?) works Upgrade to OE 0.12 (#153) Update README.md Support for scalar subquery (#157) This PR implements the scalar subquery expression, which is triggered whenever a subquery returns a scalar value. There were two main problems that needed to be solved. First, support for matching the scalar subquery expression is necessary. Spark implements this by wrapping a SparkPlan within the expression and calls executeCollect. Then it constructs a literal with that value. However, this is problematic for us because that value should not be decrypted by the driver and serialized into an expression, since it's an intermediate value. Therefore, the second issue to be addressed here is supporting an encrypted literal. This is implemented in this PR by serializing an encrypted ciphertext into a base64 encoded string, and wrapping a Decrypt expression on top of it. This expression is then evaluated in the enclave and returns a literal. Note that, in order to test our implementation, we also implement a Decrypt expression in Scala. However, this should never be evaluated on the driver side and serialized into a plaintext literal. This is because Decrypt is designated as a Nondeterministic expression, and therefore will always evaluate on the workers. match remove RangePartitionExec inefficient implementation refined Add TPC-H Benchmarks (#139) * logic decoupling in TPCH.scala for easier benchmarking * added TPCHBenchmark.scala * Benchmark.scala rewrite * done adding all support TPC-H query benchmarks * changed commandline arguments that benchmark takes * TPCHBenchmark takes in parameters * fixed issue with spark conf * size error handling, --help flag * add Utils.force, break cluster mode * comment out logistic regression benchmark * ensureCached right before temp view created/replaced * upgrade to 3.0.1 * upgrade to 3.0.1 * 10 scale factor * persistData * almost done refactor * more cleanup * compiles * 9 passes * cleanup * collect instead of force, sf_none * remove sf_none * defaultParallelism * no removing trailing/leading whitespace * add sf_med * hdfs works in local case * cleanup, added new CLI argument * added newly supported tpch queries * function for running all supported tests complete instead of partial -> final removed traces of join cleanup * added test case for one distinct one non, reverted comment * removed C++ level implementation of is_distinct * PartialMerge in operators.scala * stage 1: grouping with distinct expressions * stage 2: WIP * saving, sorting by group expressions ++ name distinct expressions worked * stage 1 & 2 printing the expected results * removed extraneous call to sorted, #3 in place but not working * stage 3 has the final, correct result: refactoring the Aggregate code to not cast aggregate expressions to Partial, PartialMerge, etc will be needed * refactor done, C++ still printing the correct values * need to formalize None case in EncryptedAggregateExec.output, but stage 4 passes * distinct and indistinct passes (git add -u) * general cleanup, None case looks nicer * throw error with >1 distinct, add test case for global distinct * no need for global aggregation case * single partition passes all aggregate tests, multiple partition doesn't * works with global sort first * works with non-global sort first * cleanup * cleanup tests * removed iostream, other nit * added test case for 13 * None case in isPartial match done properly * added test cases for sumDistinct * case-specific namedDistinctExpressions working * distinct sum is done * removed comments * got rid of mode argument * tests include null values * partition followed by local sort instead of first global sort * Remove addExpectedOperator from JobVerificationEngine, add comments * Implement expected DAG construction by doing graph manipulation on dataframe field instead of string parsing * Fix merge errors in the test cases Co-authored-by: Andrew Law <andrewlaw@sharkfin.local> Co-authored-by: Eric Feng <31462296+eric-feng-2011@users.noreply.github.com> Co-authored-by: Eric Feng <fengeric11@berkeley.edu> Co-authored-by: Chester Leung <chester.leung@berkeley.edu> Co-authored-by: Wenting Zheng <wzheng@eecs.berkeley.edu> Co-authored-by: octaviansima <34696537+octaviansima@users.noreply.github.com> Co-authored-by: Chenyu Shi <32005685+Chenyu-Shi@users.noreply.github.com> Co-authored-by: Ubuntu <chenyu@accvm.docqqnvnul2ujd1zaothcdqfqb.bx.internal.cloudapp.net> Co-authored-by: Wenting Zheng <wzheng13@gmail.com>

octaviansima requested a review from wzheng February 18, 2021 20:07

octaviansima marked this pull request as ready for review February 18, 2021 20:10

wzheng reviewed Feb 19, 2021

View reviewed changes

wzheng reviewed Feb 23, 2021

View reviewed changes

octaviansima mentioned this pull request Feb 23, 2021

Left/Right Outer support for equi and non-equi joins #162

Merged

octaviansima force-pushed the bnlj-only branch from 5736938 to 0efa5ad Compare February 23, 2021 19:54

octaviansima force-pushed the bnlj-only branch from adde31c to a06a639 Compare February 23, 2021 20:41

octaviansima force-pushed the bnlj-only branch from 82c9f58 to 358d7a4 Compare February 23, 2021 20:57

octaviansima changed the title ~~Broadcast Nested Loop Join - Left Anti Implementation~~ Broadcast Nested Loop Join - Left Anti and Left Semi Feb 23, 2021

Merge branch 'master' into bnlj-only

3dcc4f1

wzheng approved these changes Feb 24, 2021

View reviewed changes

wzheng merged commit 432eef8 into mc2-project:master Feb 24, 2021

octaviansima deleted the bnlj-only branch March 4, 2021 01:30

wzheng mentioned this pull request Apr 6, 2021

Full outer join is not supported #202

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broadcast Nested Loop Join - Left Anti and Left Semi #159

Broadcast Nested Loop Join - Left Anti and Left Semi #159

octaviansima commented Feb 18, 2021 •

edited

Loading

wzheng left a comment

wzheng Feb 19, 2021

octaviansima Feb 19, 2021 •

edited

Loading

wzheng Feb 22, 2021

octaviansima Feb 23, 2021 •

edited

Loading

wzheng Feb 19, 2021

wzheng Feb 19, 2021

wzheng Feb 19, 2021

wzheng Feb 19, 2021

wzheng Feb 19, 2021

wzheng Feb 22, 2021

wzheng Feb 23, 2021

octaviansima Feb 23, 2021 •

edited

Loading

wzheng left a comment


		val leftRDD = left.asInstanceOf[OpaqueOperatorExec].executeBlocked()
		val rightRDD = right.asInstanceOf[OpaqueOperatorExec].executeBlocked()

		case x =>
		throw new OpaqueException(s"$x JoinType is not yet supported")

Broadcast Nested Loop Join - Left Anti and Left Semi #159

Broadcast Nested Loop Join - Left Anti and Left Semi #159

Conversation

octaviansima commented Feb 18, 2021 • edited Loading

wzheng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

octaviansima Feb 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

octaviansima Feb 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

octaviansima Feb 23, 2021 • edited Loading

Choose a reason for hiding this comment

wzheng left a comment

Choose a reason for hiding this comment

octaviansima commented Feb 18, 2021 •

edited

Loading

octaviansima Feb 19, 2021 •

edited

Loading

octaviansima Feb 23, 2021 •

edited

Loading

octaviansima Feb 23, 2021 •

edited

Loading