[SPARK-27829][SQL] In Dataset.joinWith() inner joins, don't nest data before shuffling #24693

JoshRosen · 2019-05-24T03:48:41Z

What changes were proposed in this pull request?

In order to support outer joins with null top-level objects, SPARK-15441 modified Dataset.joinWith to project both inputs into single-column structs prior to the join.

For inner joins, however, this step is unnecessary and actually harms performance: performing the nesting before the join increases the shuffled data size. As an optimization for inner joins only, we can move this nesting to occur after the join (effectively switching back to the pre-SPARK-15441 behavior; see #13425).

How was this patch tested?

Existing tests, which I strengthened to also make assertions about the join result's nullability (since this guards against a bug I almost introduced during prototyping).

Here's a quick spark-shell experiment demonstrating the reduction in shuffle size:

// With --conf spark.shuffle.compress=false
sql("set spark.sql.autoBroadcastJoinThreshold=-1") // for easier shuffle measurements
case class Foo(a: Long, b: Long)
val left = spark.range(10000).map(x => Foo(x, x))
val right = spark.range(10000).map(x => Foo(x, x))
left.joinWith(right, left("a") === right("a"), "inner").rdd.count()
left.joinWith(right, left("a") === right("a"), "left").rdd.count()

With inner join (which benefits from this PR's optimization) we shuffle 546.9 KiB. With left outer join (whose plan hasn't changed, therefore being a representation of the state before this PR) we shuffle 859.4 KiB. Shuffle compression (which is enabled by default) narrows this gap a bit: with compression, outer joins shuffle about 12% more than inner joins.

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

SparkQA · 2019-05-24T06:48:17Z

Test build #105750 has finished for PR 24693 at commit 33bb4af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-24T09:59:58Z

Test build #105753 has finished for PR 24693 at commit ec2bb32.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-24T10:07:38Z

Test build #105755 has finished for PR 24693 at commit ec4d785.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2019-05-28T14:34:54Z

@cloud-fan, could you take a look at this change since you worked on the PR which originally introduced the nesting here?

cloud-fan · 2019-05-29T08:12:37Z

thanks, merging to master!

joshrosen-stripe and others added 4 commits May 20, 2019 19:33

Add test for left outer; check schemas in joinWith tests.

0f2f67f

Speed up joinWith for inner-joins.

599d992

Fix incorrect merge conflict resolution

831052f

Comment re-word

33bb4af

JoshRosen commented May 24, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala Outdated Show resolved Hide resolved

JoshRosen added 3 commits May 24, 2019 00:02

Deduplicate test.

ec2bb32

Reword

59b4028

fixup

ec4d785

JoshRosen changed the title ~~[SPARK-27829][SQL] In Dataset.joinWith inner joins, don't nest data before shuffling~~ [SPARK-27829][SQL] In Dataset.joinWith() inner joins, don't nest data before shuffling May 24, 2019

cloud-fan closed this in 19aaf0f May 29, 2019

JoshRosen deleted the fast-join-with-for-inner-joins branch May 29, 2019 14:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-27829][SQL] In Dataset.joinWith() inner joins, don't nest data before shuffling #24693

[SPARK-27829][SQL] In Dataset.joinWith() inner joins, don't nest data before shuffling #24693

Uh oh!

JoshRosen commented May 24, 2019 •

edited

Loading

Uh oh!

Uh oh!

SparkQA commented May 24, 2019

Uh oh!

SparkQA commented May 24, 2019

Uh oh!

SparkQA commented May 24, 2019

Uh oh!

JoshRosen commented May 28, 2019

Uh oh!

cloud-fan commented May 29, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-27829][SQL] In Dataset.joinWith() inner joins, don't nest data before shuffling #24693

[SPARK-27829][SQL] In Dataset.joinWith() inner joins, don't nest data before shuffling #24693

Uh oh!

Conversation

JoshRosen commented May 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Uh oh!

SparkQA commented May 24, 2019

Uh oh!

SparkQA commented May 24, 2019

Uh oh!

SparkQA commented May 24, 2019

Uh oh!

JoshRosen commented May 28, 2019

Uh oh!

cloud-fan commented May 29, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JoshRosen commented May 24, 2019 •

edited

Loading