[SPARK-3861][SQL] Avoid rebuilding hash tables on each partition #2722

rxin · 2014-10-08T23:57:51Z

BroadcastHashJoin builds a new hash table for each partition. We can build it once per node and reuse the hash table.

rxin · 2014-10-08T23:58:10Z

This is based on #2719. We should merge that one first.

…zation. If we write the filter which is always FALSE like SELECT * from person WHERE FALSE; 200 tasks will run. I think, 1 task is enough. And current optimizer cannot optimize the case NOT is duplicated like SELECT * from person WHERE NOT ( NOT (age > 30)); The filter rule above should be simplified Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes apache#2692 from sarutak/SPARK-3831 and squashes the following commits: 25f3e20 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3831 23c750c [Kousuke Saruta] Improved unsupported predicate test case a11b9f3 [Kousuke Saruta] Modified NOT predicate test case in PartitionBatchPruningSuite 8ea872b [Kousuke Saruta] Fixed the number of tasks when the data of LocalRelation is empty.

SparkQA · 2014-10-09T00:04:35Z

QA tests have started for PR 2722 at commit 90b58c0.

This patch merges cleanly.

This PR uses JSON instead of `toString` to serialize `DataType`s. The latter is not only hard to parse but also flaky in many cases. Since we already write schema information to Parquet metadata in the old style, we have to reserve the old `DataType` parser and ensure downward compatibility. The old parser is now renamed to `CaseClassStringParser` and moved into `object DataType`. JoshRosen davies Please help review PySpark related changes, thanks! Author: Cheng Lian <lian.cs.zju@gmail.com> Closes apache#2563 from liancheng/datatype-to-json and squashes the following commits: fc92eb3 [Cheng Lian] Reverts debugging code, simplifies primitive type JSON representation 438c75f [Cheng Lian] Refactors PySpark DataType JSON SerDe per comments 6b6387b [Cheng Lian] Removes debugging code 6a3ee3a [Cheng Lian] Addresses per review comments dc158b5 [Cheng Lian] Addresses PEP8 issues 99ab4ee [Cheng Lian] Adds compatibility est case for Parquet type conversion a983a6c [Cheng Lian] Adds PySpark support f608c6e [Cheng Lian] De/serializes DataType objects from/to JSON

marmbrus Update README.md to be consistent with Spark 1.1 Author: Liquan Pei <liquanpei@gmail.com> Closes apache#2706 from Ishiihara/SparkSQL-readme and squashes the following commits: 33b9d4b [Liquan Pei] keep README.md up to date

SparkQA · 2014-10-09T00:17:41Z

QA tests have finished for PR 2722 at commit 90b58c0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-09T00:17:43Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21495/Test FAILed.

AmplabJenkins · 2014-10-09T00:27:17Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21497/Test FAILed.

SparkQA · 2014-10-09T00:49:47Z

QA tests have started for PR 2722 at commit 4b9d0c9.

This patch merges cleanly.

Calling `BinaryArithmetic.dataType` will throws exception until it's resolved, but in type coercion rule `Division`, seems doesn't follow this. Author: Cheng Hao <hao.cheng@intel.com> Closes apache#2559 from chenghao-intel/type_coercion and squashes the following commits: 199a85d [Cheng Hao] Simplify the divide rule dc55218 [Cheng Hao] fix bug of type coercion in div

Includes partition keys into account when applying `PreInsertionCasts` rule. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes apache#2672 from liancheng/fix-pre-insert-casts and squashes the following commits: def1a1a [Cheng Lian] Makes PreInsertionCasts handle partitions properly

…inserting Hive values Builds all wrappers at first according to object inspector types to avoid per row costs. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes apache#2592 from liancheng/hive-value-wrapper and squashes the following commits: 9696559 [Cheng Lian] Passes all tests 4998666 [Cheng Lian] Prevents per row dynamic dispatching and pattern matching when inserting Hive values

Author: Reynold Xin <rxin@apache.org> Closes apache#2719 from rxin/sql-join-break and squashes the following commits: 0c0082b [Reynold Xin] Fix line length. cbc664c [Reynold Xin] Rename join -> joins package. a070d44 [Reynold Xin] Fix line length in HashJoin a39be8c [Reynold Xin] [SPARK-3857] Create a join package for various join operators.

SparkQA · 2014-10-09T01:35:29Z

QA tests have finished for PR 2722 at commit 4b9d0c9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-09T01:35:31Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21500/Test PASSed.

…sh-1 Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoin.scala sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoin.scala

rxin · 2014-10-09T04:39:17Z

replacing this with #2727

SparkQA · 2014-10-09T04:39:36Z

QA tests have started for PR 2722 at commit 18eb214.

This patch merges cleanly.

SparkQA · 2014-10-09T05:23:21Z

QA tests have finished for PR 2722 at commit 18eb214.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final class UniqueKeyHashedRelation(hashTable: JavaHashMap[Row, Row])

AmplabJenkins · 2014-10-09T05:23:24Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21519/Test PASSed.

rxin added 5 commits October 8, 2014 15:22

[SPARK-3857] Create a join package for various join operators.

a39be8c

Fix line length in HashJoin

a070d44

Rename join -> joins package.

cbc664c

Fix line length.

0c0082b

[SPARK-3861] Avoid rebuilding hash tables on each partition

90b58c0

BroadcastHashJoin builds a new hash table for each partition. We can build it once per node and reuse the hash table.

liancheng and others added 3 commits October 8, 2014 17:04

Added a test case.

e0ebdd1

[SQL][Doc] Keep Spark SQL README.md up to date

00b7791

marmbrus Update README.md to be consistent with Spark 1.1 Author: Liquan Pei <liquanpei@gmail.com> Closes apache#2706 from Ishiihara/SparkSQL-readme and squashes the following commits: 33b9d4b [Liquan Pei] keep README.md up to date

UniqueKeyHashedRelation.get should return null if the value is null.

4b9d0c9

chenghao-intel and others added 4 commits October 8, 2014 17:52

rxin closed this Oct 9, 2014

rxin deleted the SPARK-3861-broadcast-hash branch October 9, 2014 04:39

[SPARK-3861][SQL] Avoid rebuilding hash tables on each partition #2722

[SPARK-3861][SQL] Avoid rebuilding hash tables on each partition #2722

Uh oh!

Conversation

rxin commented Oct 8, 2014

Uh oh!

rxin commented Oct 8, 2014

Uh oh!

SparkQA commented Oct 9, 2014

Uh oh!

SparkQA commented Oct 9, 2014

Uh oh!

AmplabJenkins commented Oct 9, 2014

Uh oh!

AmplabJenkins commented Oct 9, 2014

Uh oh!

SparkQA commented Oct 9, 2014

Uh oh!

SparkQA commented Oct 9, 2014

Uh oh!

AmplabJenkins commented Oct 9, 2014

Uh oh!

rxin commented Oct 9, 2014

Uh oh!

SparkQA commented Oct 9, 2014

Uh oh!

SparkQA commented Oct 9, 2014

Uh oh!

AmplabJenkins commented Oct 9, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants