Skip to content

Conversation

@JoshRosen
Copy link
Contributor

HashPartitioning compatibility is currently defined w.r.t the set of expressions, but the ordering of those expressions matters when computing hash codes; this could lead to incorrect answers if we mistakenly avoided a shuffle based on the assumption that HashPartitionings with the same expressions in different orders will produce equivalent row hashcodes. The first commit adds a regression test which illustrates this problem.

The fix for this is simple: make HashPartitioning.compatibleWith and HashPartitioning.guarantees sensitive to the expression ordering (i.e. do not perform set comparison).

@JoshRosen
Copy link
Contributor Author

/cc @rxin @yhuai

@davies
Copy link
Contributor

davies commented Aug 10, 2015

@JoshRosen Could you add a sql test (join two DataFrame they are already partitioned on the same group of keys but different orders)? otherwise LGTM.

@JoshRosen JoshRosen changed the title [SPARK-9785] HashPartitioning compatibility should consider expression ordering [SPARK-9785] [SQL] HashPartitioning compatibility should consider expression ordering Aug 10, 2015
@SparkQA
Copy link

SparkQA commented Aug 10, 2015

Test build #40307 has finished for PR 8074 at commit 0b4d7d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Aug 10, 2015

LGTM. In future, it will be good to generate hashcodes for a hash partitioning in a ordering insensitive way.

@JoshRosen
Copy link
Contributor Author

@davies, I don't know whether it's actually straightforward to write an end-to-end DataFrame test case which is partitioned on the same keys in different orders, although it might be achievable by joining together two group-by results.

@JoshRosen
Copy link
Contributor Author

I've had a hard time contriving an end-to-end test where this bug presents a problem, but nevertheless I think that we should merge this fix.

@davies
Copy link
Contributor

davies commented Aug 11, 2015

LGTM

@SparkQA
Copy link

SparkQA commented Aug 11, 2015

Test build #40400 has finished for PR 8074 at commit b61412f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Aug 11, 2015

I am merging it to master and branch 1.5.

asfgit pushed a commit that referenced this pull request Aug 11, 2015
…ression ordering

HashPartitioning compatibility is currently defined w.r.t the _set_ of expressions, but the ordering of those expressions matters when computing hash codes; this could lead to incorrect answers if we mistakenly avoided a shuffle based on the assumption that HashPartitionings with the same expressions in different orders will produce equivalent row hashcodes. The first commit adds a regression test which illustrates this problem.

The fix for this is simple: make `HashPartitioning.compatibleWith` and `HashPartitioning.guarantees` sensitive to the expression ordering (i.e. do not perform set comparison).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8074 from JoshRosen/hashpartitioning-compatiblewith-fixes and squashes the following commits:

b61412f [Josh Rosen] Demonstrate that I haven't cheated in my fix
0b4d7d9 [Josh Rosen] Update so that clusteringSet is only used in satisfies().
dc9c9d7 [Josh Rosen] Add failing regression test for SPARK-9785

(cherry picked from commit dfe347d)
Signed-off-by: Yin Huai <yhuai@databricks.com>
@asfgit asfgit closed this in dfe347d Aug 11, 2015
@yhuai
Copy link
Contributor

yhuai commented Aug 11, 2015

I think you can use

SELECT ...
FROM (SELECT key1, key2 FROM t1 GROUP BY key1, key2) tmp1
JOIN (SELECT key1, key2 FROM t1 GROUP BY key2, key1) tmp2
ON (tmp1.key1 = tmp2.key1 AND tmp1.key2 = tmp2.key2)

to expose the problem.

CodingCat pushed a commit to CodingCat/spark that referenced this pull request Aug 17, 2015
…ression ordering

HashPartitioning compatibility is currently defined w.r.t the _set_ of expressions, but the ordering of those expressions matters when computing hash codes; this could lead to incorrect answers if we mistakenly avoided a shuffle based on the assumption that HashPartitionings with the same expressions in different orders will produce equivalent row hashcodes. The first commit adds a regression test which illustrates this problem.

The fix for this is simple: make `HashPartitioning.compatibleWith` and `HashPartitioning.guarantees` sensitive to the expression ordering (i.e. do not perform set comparison).

Author: Josh Rosen <joshrosen@databricks.com>

Closes apache#8074 from JoshRosen/hashpartitioning-compatiblewith-fixes and squashes the following commits:

b61412f [Josh Rosen] Demonstrate that I haven't cheated in my fix
0b4d7d9 [Josh Rosen] Update so that clusteringSet is only used in satisfies().
dc9c9d7 [Josh Rosen] Add failing regression test for SPARK-9785
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants