Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Jun 10, 2021

What changes were proposed in this pull request?

This is a followup of #32586. We introduced ExpressionContainmentOrdering to sort common expressions according to their parent-child relations. For unrelated expressions, previously the ordering returns -1 which is not correct and can possibly lead to transitivity issue.

Why are the changes needed?

To fix the possible transitivity issue of ExpressionContainmentOrdering.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test.

@github-actions github-actions bot added the SQL label Jun 10, 2021
@maropu maropu changed the title [[SPARK-35439][SQL]][FOLLOWUO] ExpressionContainmentOrdering should not sort unrelated expressions [[SPARK-35439][SQL]][FOLLOWUP] ExpressionContainmentOrdering should not sort unrelated expressions Jun 10, 2021
@dongjoon-hyun dongjoon-hyun changed the title [[SPARK-35439][SQL]][FOLLOWUP] ExpressionContainmentOrdering should not sort unrelated expressions [SPARK-35439][SQL][FOLLOWUP] ExpressionContainmentOrdering should not sort unrelated expressions Jun 10, 2021
* will be considered as e1 < e2 and e2 < e1 by this ordering. But for the usage here,
* the order of irrelevant expressions does not matter.
*/
class ExpressionContainmentOrdering extends Ordering[Expression] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, is there a reason of this move?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, as it is a nested class, I cannot allocate it separately, but

val equivalence = new EquivalentExpressions
val exprOrdering = new equivalence.ExpressionContainmentOrdering

I can revert to nested class if you think it's unnecessary change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. New one also looks good~

* we want the child expressions come first than parent expressions, so we can replace
* child expressions in parent expressions with subexpression evaluation. Note that
* this is not for general expression ordering. For example, two irrelevant expressions
* will be considered as e1 < e2 and e2 < e1 by this ordering. But for the usage here,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we change this to 0 according to the new logic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, missing the doc. Fixed.

@SparkQA
Copy link

SparkQA commented Jun 11, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44189/

@SparkQA
Copy link

SparkQA commented Jun 11, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44189/

@SparkQA
Copy link

SparkQA commented Jun 11, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44193/

@SparkQA
Copy link

SparkQA commented Jun 11, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44193/

@dongjoon-hyun
Copy link
Member

Could you rebase to the master branch? The linter failure was fixed on the master branch.

@viirya
Copy link
Member Author

viirya commented Jun 11, 2021

Rebased. Thanks!

* child expressions in parent expressions with subexpression evaluation. Note that
* this is not for general expression ordering. For example, two irrelevant expressions
* will be considered as equal by this ordering. But for the usage here, the order of
* irrelevant expressions does not matter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be complete, could you add some description about the semantically-equal expressions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Added.

// `x` is child expression of `y`.
-1
} else {
// Irrelevant expressions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. We should mention the semantically-equal expression here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added. thanks.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented Jun 11, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44199/

@SparkQA
Copy link

SparkQA commented Jun 11, 2021

Test build #139661 has finished for PR 32870 at commit 562ab33.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ExpressionContainmentOrdering extends Ordering[Expression]

@SparkQA
Copy link

SparkQA commented Jun 11, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44202/

@SparkQA
Copy link

SparkQA commented Jun 11, 2021

Test build #139683 has finished for PR 32870 at commit 019faba.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 11, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44202/

@SparkQA
Copy link

SparkQA commented Jun 11, 2021

Test build #139665 has finished for PR 32870 at commit c27fbfc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu closed this in c463472 Jun 11, 2021
@maropu
Copy link
Member

maropu commented Jun 11, 2021

Thank you. Merged to master.

@SparkQA
Copy link

SparkQA commented Jun 11, 2021

Test build #139673 has finished for PR 32870 at commit 1f8e7e2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Kimahriman
Copy link
Contributor

I think this could theoretically still cause some issues because it doesn't follow the last rule for comparators:

Finally, the implementor must ensure that compare(x, y)==0 implies that sgn(compare(x, z))==sgn(compare(y, z)) for all z.

A simple example I found that doesn't sort correctly:

val add1 = Add(Literal(1), Literal(2))
val add2 = Add(Literal(2), Literal(3))
val addParent = Add(add1, Literal(4))
val exprs = Seq(addParent, add2, add1)
assert(exprs.sorted(exprOrdering) === Seq(add2, add1, addParent))

The result remains addParent, add2, add1. I think because compare(addParent, add2) == 0 and compare(add2, add1) == 0, essentially the list is already sorted. Whether in practice this only could lead to a suboptimal sort or could still cause a sorting exception like I saw previously I'm not sure.

@viirya
Copy link
Member Author

viirya commented Jun 11, 2021

I noticed that, but currently I have not better idea to sort the expressions better. For irrelevant expressions, seems no good rule to order them in deterministic way. Right now I just can make it meet transitivity contract so it can avoid the exception. As mentioned in its doc, this is not for general expression ordering but just for the specific usage. I think it seems to be rare to produce suboptimal sort. I'll think if there is better way to sort it.

@Kimahriman
Copy link
Contributor

In my fork I just changed it to

.sortBy(_.head.collect({ case e => e }).size)

so it basically just sorts by the number of expressions in the tree (not sure if there's an easier way to get that count than how I did it). Haven't done exhaustive testing on it but I feel like that makes sense to do. Not sure how one expression could contain another if it doesn't have more total expressions

@viirya viirya deleted the SPARK-35439-followup branch December 27, 2023 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants