[SPARK-35439][SQL][FOLLOWUP] ExpressionContainmentOrdering should not sort unrelated expressions #32870

viirya · 2021-06-10T23:15:40Z

What changes were proposed in this pull request?

This is a followup of #32586. We introduced ExpressionContainmentOrdering to sort common expressions according to their parent-child relations. For unrelated expressions, previously the ordering returns -1 which is not correct and can possibly lead to transitivity issue.

Why are the changes needed?

To fix the possible transitivity issue of ExpressionContainmentOrdering.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test.

dongjoon-hyun · 2021-06-11T00:16:21Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+ * will be considered as e1 < e2 and e2 < e1 by this ordering. But for the usage here,
+ * the order of irrelevant expressions does not matter.
+ */
+class ExpressionContainmentOrdering extends Ordering[Expression] {


Just curious, is there a reason of this move?

Oh, as it is a nested class, I cannot allocate it separately, but

val equivalence = new EquivalentExpressions val exprOrdering = new equivalence.ExpressionContainmentOrdering

I can revert to nested class if you think it's unnecessary change.

Never mind. New one also looks good~

dongjoon-hyun · 2021-06-11T00:17:06Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+ * we want the child expressions come first than parent expressions, so we can replace
+ * child expressions in parent expressions with subexpression evaluation. Note that
+ * this is not for general expression ordering. For example, two irrelevant expressions
+ * will be considered as e1 < e2 and e2 < e1 by this ordering. But for the usage here,


Shall we change this to 0 according to the new logic?

Right, missing the doc. Fixed.

SparkQA · 2021-06-11T00:34:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44189/

SparkQA · 2021-06-11T01:12:01Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44189/

SparkQA · 2021-06-11T01:30:23Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44193/

SparkQA · 2021-06-11T02:05:30Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44193/

dongjoon-hyun · 2021-06-11T02:46:24Z

Could you rebase to the master branch? The linter failure was fixed on the master branch.

viirya · 2021-06-11T02:50:10Z

Rebased. Thanks!

dongjoon-hyun · 2021-06-11T03:12:24Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+ * child expressions in parent expressions with subexpression evaluation. Note that
+ * this is not for general expression ordering. For example, two irrelevant expressions
+ * will be considered as equal by this ordering. But for the usage here, the order of
+ * irrelevant expressions does not matter.


To be complete, could you add some description about the semantically-equal expressions?

Sure. Added.

dongjoon-hyun · 2021-06-11T03:43:05Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+      // `x` is child expression of `y`.
+      -1
+    } else {
+      // Irrelevant expressions


ditto. We should mention the semantically-equal expression here.

added. thanks.

dongjoon-hyun

+1, LGTM (only one minor comment, https://github.com/apache/spark/pull/32870/files#r649667440)

SparkQA · 2021-06-11T03:52:12Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44199/

SparkQA · 2021-06-11T03:53:40Z

Test build #139661 has finished for PR 32870 at commit 562ab33.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ExpressionContainmentOrdering extends Ordering[Expression]

SparkQA · 2021-06-11T04:30:41Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44202/

SparkQA · 2021-06-11T04:49:28Z

Test build #139683 has finished for PR 32870 at commit 019faba.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-11T05:06:52Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44202/

SparkQA · 2021-06-11T05:10:34Z

Test build #139665 has finished for PR 32870 at commit c27fbfc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2021-06-11T07:13:58Z

Thank you. Merged to master.

SparkQA · 2021-06-11T08:04:27Z

Test build #139673 has finished for PR 32870 at commit 1f8e7e2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Kimahriman · 2021-06-11T11:00:15Z

I think this could theoretically still cause some issues because it doesn't follow the last rule for comparators:

Finally, the implementor must ensure that compare(x, y)==0 implies that sgn(compare(x, z))==sgn(compare(y, z)) for all z.

A simple example I found that doesn't sort correctly:

val add1 = Add(Literal(1), Literal(2))
val add2 = Add(Literal(2), Literal(3))
val addParent = Add(add1, Literal(4))
val exprs = Seq(addParent, add2, add1)
assert(exprs.sorted(exprOrdering) === Seq(add2, add1, addParent))

The result remains addParent, add2, add1. I think because compare(addParent, add2) == 0 and compare(add2, add1) == 0, essentially the list is already sorted. Whether in practice this only could lead to a suboptimal sort or could still cause a sorting exception like I saw previously I'm not sure.

viirya · 2021-06-11T17:34:32Z

I noticed that, but currently I have not better idea to sort the expressions better. For irrelevant expressions, seems no good rule to order them in deterministic way. Right now I just can make it meet transitivity contract so it can avoid the exception. As mentioned in its doc, this is not for general expression ordering but just for the specific usage. I think it seems to be rare to produce suboptimal sort. I'll think if there is better way to sort it.

Kimahriman · 2021-06-11T17:56:09Z

In my fork I just changed it to

.sortBy(_.head.collect({ case e => e }).size)

so it basically just sorts by the number of expressions in the tree (not sure if there's an easier way to get that count than how I did it). Haven't done exhaustive testing on it but I feel like that makes sense to do. Not sure how one expression could contain another if it doesn't have more total expressions

Fix comparator.

562ab33

github-actions bot added the SQL label Jun 10, 2021

maropu changed the title ~~[[SPARK-35439][SQL]][FOLLOWUO] ExpressionContainmentOrdering should not sort unrelated expressions~~ [[SPARK-35439][SQL]][FOLLOWUP] ExpressionContainmentOrdering should not sort unrelated expressions Jun 10, 2021

dongjoon-hyun changed the title ~~[[SPARK-35439][SQL]][FOLLOWUP] ExpressionContainmentOrdering should not sort unrelated expressions~~ [SPARK-35439][SQL][FOLLOWUP] ExpressionContainmentOrdering should not sort unrelated expressions Jun 10, 2021

dongjoon-hyun reviewed Jun 11, 2021

View reviewed changes

Fix doc.

c27fbfc

Merge remote-tracking branch 'upstream/master' into SPARK-35439-followup

308cd91

dongjoon-hyun reviewed Jun 11, 2021

View reviewed changes

Add description about semantically-equal expressions.

1f8e7e2

dongjoon-hyun reviewed Jun 11, 2021

View reviewed changes

dongjoon-hyun approved these changes Jun 11, 2021

View reviewed changes

Put semantically-equal doc.

019faba

maropu approved these changes Jun 11, 2021

View reviewed changes

maropu closed this in c463472 Jun 11, 2021

viirya deleted the SPARK-35439-followup branch December 27, 2023 18:25

[SPARK-35439][SQL][FOLLOWUP] ExpressionContainmentOrdering should not sort unrelated expressions #32870

[SPARK-35439][SQL][FOLLOWUP] ExpressionContainmentOrdering should not sort unrelated expressions #32870

Uh oh!

Conversation

viirya commented Jun 10, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 11, 2021

Uh oh!

SparkQA commented Jun 11, 2021

Uh oh!

SparkQA commented Jun 11, 2021

Uh oh!

SparkQA commented Jun 11, 2021

Uh oh!

dongjoon-hyun commented Jun 11, 2021

Uh oh!

viirya commented Jun 11, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 11, 2021

Uh oh!

SparkQA commented Jun 11, 2021

Uh oh!

SparkQA commented Jun 11, 2021

Uh oh!

SparkQA commented Jun 11, 2021

Uh oh!

SparkQA commented Jun 11, 2021

Uh oh!

SparkQA commented Jun 11, 2021

Uh oh!

maropu commented Jun 11, 2021

Uh oh!

SparkQA commented Jun 11, 2021

Uh oh!

Kimahriman commented Jun 11, 2021

Uh oh!

viirya commented Jun 11, 2021

Uh oh!

Kimahriman commented Jun 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants