[SPARK-9785] [SQL] HashPartitioning compatibility should consider expression ordering #8074

JoshRosen · 2015-08-10T18:52:13Z

HashPartitioning compatibility is currently defined w.r.t the set of expressions, but the ordering of those expressions matters when computing hash codes; this could lead to incorrect answers if we mistakenly avoided a shuffle based on the assumption that HashPartitionings with the same expressions in different orders will produce equivalent row hashcodes. The first commit adds a regression test which illustrates this problem.

The fix for this is simple: make HashPartitioning.compatibleWith and HashPartitioning.guarantees sensitive to the expression ordering (i.e. do not perform set comparison).

JoshRosen · 2015-08-10T18:52:18Z

/cc @rxin @yhuai

davies · 2015-08-10T19:24:59Z

@JoshRosen Could you add a sql test (join two DataFrame they are already partitioned on the same group of keys but different orders)? otherwise LGTM.

SparkQA · 2015-08-10T20:39:13Z

Test build #40307 has finished for PR 8074 at commit 0b4d7d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-08-10T21:23:16Z

LGTM. In future, it will be good to generate hashcodes for a hash partitioning in a ordering insensitive way.

JoshRosen · 2015-08-10T21:44:37Z

@davies, I don't know whether it's actually straightforward to write an end-to-end DataFrame test case which is partitioned on the same keys in different orders, although it might be achievable by joining together two group-by results.

JoshRosen · 2015-08-11T05:50:56Z

I've had a hard time contriving an end-to-end test where this bug presents a problem, but nevertheless I think that we should merge this fix.

davies · 2015-08-11T05:55:36Z

LGTM

SparkQA · 2015-08-11T07:52:09Z

Test build #40400 has finished for PR 8074 at commit b61412f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-08-11T15:51:30Z

I am merging it to master and branch 1.5.

…ression ordering HashPartitioning compatibility is currently defined w.r.t the _set_ of expressions, but the ordering of those expressions matters when computing hash codes; this could lead to incorrect answers if we mistakenly avoided a shuffle based on the assumption that HashPartitionings with the same expressions in different orders will produce equivalent row hashcodes. The first commit adds a regression test which illustrates this problem. The fix for this is simple: make `HashPartitioning.compatibleWith` and `HashPartitioning.guarantees` sensitive to the expression ordering (i.e. do not perform set comparison). Author: Josh Rosen <joshrosen@databricks.com> Closes #8074 from JoshRosen/hashpartitioning-compatiblewith-fixes and squashes the following commits: b61412f [Josh Rosen] Demonstrate that I haven't cheated in my fix 0b4d7d9 [Josh Rosen] Update so that clusteringSet is only used in satisfies(). dc9c9d7 [Josh Rosen] Add failing regression test for SPARK-9785 (cherry picked from commit dfe347d) Signed-off-by: Yin Huai <yhuai@databricks.com>

yhuai · 2015-08-11T15:54:31Z

I think you can use

SELECT ...
FROM (SELECT key1, key2 FROM t1 GROUP BY key1, key2) tmp1
JOIN (SELECT key1, key2 FROM t1 GROUP BY key2, key1) tmp2
ON (tmp1.key1 = tmp2.key1 AND tmp1.key2 = tmp2.key2)

to expose the problem.

…ression ordering HashPartitioning compatibility is currently defined w.r.t the _set_ of expressions, but the ordering of those expressions matters when computing hash codes; this could lead to incorrect answers if we mistakenly avoided a shuffle based on the assumption that HashPartitionings with the same expressions in different orders will produce equivalent row hashcodes. The first commit adds a regression test which illustrates this problem. The fix for this is simple: make `HashPartitioning.compatibleWith` and `HashPartitioning.guarantees` sensitive to the expression ordering (i.e. do not perform set comparison). Author: Josh Rosen <joshrosen@databricks.com> Closes apache#8074 from JoshRosen/hashpartitioning-compatiblewith-fixes and squashes the following commits: b61412f [Josh Rosen] Demonstrate that I haven't cheated in my fix 0b4d7d9 [Josh Rosen] Update so that clusteringSet is only used in satisfies(). dc9c9d7 [Josh Rosen] Add failing regression test for SPARK-9785

JoshRosen added 2 commits August 10, 2015 11:39

Add failing regression test for SPARK-9785

dc9c9d7

Update so that clusteringSet is only used in satisfies().

0b4d7d9

JoshRosen changed the title ~~[SPARK-9785] HashPartitioning compatibility should consider expression ordering~~ [SPARK-9785] [SQL] HashPartitioning compatibility should consider expression ordering Aug 10, 2015

Demonstrate that I haven't cheated in my fix

b61412f

asfgit closed this in dfe347d Aug 11, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-9785] [SQL] HashPartitioning compatibility should consider expression ordering #8074

[SPARK-9785] [SQL] HashPartitioning compatibility should consider expression ordering #8074

Uh oh!

JoshRosen commented Aug 10, 2015

Uh oh!

JoshRosen commented Aug 10, 2015

Uh oh!

davies commented Aug 10, 2015

Uh oh!

SparkQA commented Aug 10, 2015

Uh oh!

yhuai commented Aug 10, 2015

Uh oh!

JoshRosen commented Aug 10, 2015

Uh oh!

JoshRosen commented Aug 11, 2015

Uh oh!

davies commented Aug 11, 2015

Uh oh!

SparkQA commented Aug 11, 2015

Uh oh!

yhuai commented Aug 11, 2015

Uh oh!

yhuai commented Aug 11, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-9785] [SQL] HashPartitioning compatibility should consider expression ordering #8074

[SPARK-9785] [SQL] HashPartitioning compatibility should consider expression ordering #8074

Uh oh!

Conversation

JoshRosen commented Aug 10, 2015

Uh oh!

JoshRosen commented Aug 10, 2015

Uh oh!

davies commented Aug 10, 2015

Uh oh!

SparkQA commented Aug 10, 2015

Uh oh!

yhuai commented Aug 10, 2015

Uh oh!

JoshRosen commented Aug 10, 2015

Uh oh!

JoshRosen commented Aug 11, 2015

Uh oh!

davies commented Aug 11, 2015

Uh oh!

SparkQA commented Aug 11, 2015

Uh oh!

yhuai commented Aug 11, 2015

Uh oh!

yhuai commented Aug 11, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants