-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-9785] [SQL] HashPartitioning compatibility should consider expression ordering #8074
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-9785] [SQL] HashPartitioning compatibility should consider expression ordering #8074
Conversation
|
@JoshRosen Could you add a sql test (join two DataFrame they are already partitioned on the same group of keys but different orders)? otherwise LGTM. |
|
Test build #40307 has finished for PR 8074 at commit
|
|
LGTM. In future, it will be good to generate hashcodes for a hash partitioning in a ordering insensitive way. |
|
@davies, I don't know whether it's actually straightforward to write an end-to-end DataFrame test case which is partitioned on the same keys in different orders, although it might be achievable by joining together two group-by results. |
|
I've had a hard time contriving an end-to-end test where this bug presents a problem, but nevertheless I think that we should merge this fix. |
|
LGTM |
|
Test build #40400 has finished for PR 8074 at commit
|
|
I am merging it to master and branch 1.5. |
…ression ordering HashPartitioning compatibility is currently defined w.r.t the _set_ of expressions, but the ordering of those expressions matters when computing hash codes; this could lead to incorrect answers if we mistakenly avoided a shuffle based on the assumption that HashPartitionings with the same expressions in different orders will produce equivalent row hashcodes. The first commit adds a regression test which illustrates this problem. The fix for this is simple: make `HashPartitioning.compatibleWith` and `HashPartitioning.guarantees` sensitive to the expression ordering (i.e. do not perform set comparison). Author: Josh Rosen <joshrosen@databricks.com> Closes #8074 from JoshRosen/hashpartitioning-compatiblewith-fixes and squashes the following commits: b61412f [Josh Rosen] Demonstrate that I haven't cheated in my fix 0b4d7d9 [Josh Rosen] Update so that clusteringSet is only used in satisfies(). dc9c9d7 [Josh Rosen] Add failing regression test for SPARK-9785 (cherry picked from commit dfe347d) Signed-off-by: Yin Huai <yhuai@databricks.com>
|
I think you can use to expose the problem. |
…ression ordering HashPartitioning compatibility is currently defined w.r.t the _set_ of expressions, but the ordering of those expressions matters when computing hash codes; this could lead to incorrect answers if we mistakenly avoided a shuffle based on the assumption that HashPartitionings with the same expressions in different orders will produce equivalent row hashcodes. The first commit adds a regression test which illustrates this problem. The fix for this is simple: make `HashPartitioning.compatibleWith` and `HashPartitioning.guarantees` sensitive to the expression ordering (i.e. do not perform set comparison). Author: Josh Rosen <joshrosen@databricks.com> Closes apache#8074 from JoshRosen/hashpartitioning-compatiblewith-fixes and squashes the following commits: b61412f [Josh Rosen] Demonstrate that I haven't cheated in my fix 0b4d7d9 [Josh Rosen] Update so that clusteringSet is only used in satisfies(). dc9c9d7 [Josh Rosen] Add failing regression test for SPARK-9785
HashPartitioning compatibility is currently defined w.r.t the set of expressions, but the ordering of those expressions matters when computing hash codes; this could lead to incorrect answers if we mistakenly avoided a shuffle based on the assumption that HashPartitionings with the same expressions in different orders will produce equivalent row hashcodes. The first commit adds a regression test which illustrates this problem.
The fix for this is simple: make
HashPartitioning.compatibleWithandHashPartitioning.guaranteessensitive to the expression ordering (i.e. do not perform set comparison).