Skip to content

Conversation

@ptkool
Copy link
Contributor

@ptkool ptkool commented Apr 20, 2017

What changes were proposed in this pull request?

This PR proposes adding a new query hint called NO_COLLAPSE that can be used to prevent adjacent projections from being collapsed.

How was this patch tested?

Test using unit tests, integration tests and manual tests.

@ptkool ptkool changed the title Add new query hint NO_COLLAPSE. [SPARK-20413] Add new query hint NO_COLLAPSE. Apr 20, 2017
@hvanhovell
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Apr 20, 2017

Test build #75995 has finished for PR 17708 at commit 3f1e6a1.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class NoCollapseHint(child: LogicalPlan) extends UnaryNode

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ptkool thinks for submitting the PR. I am not sure this is the best way to avoid projection collapse. The problem is that this approach will also inhibit other optimization from taking place.


@since(2.2)
def no_collapse(df):
"""Marks a DataFrame as small enough for use in broadcast joins."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc is incorrect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix.

* @group normal_funcs
* @since 1.5.0
*/
* Marks a DataFrame as small enough for use in broadcast joins.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please undo this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

}

/**
* Marks a DataFrame as small enough for use in broadcast joins.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the alignment is of by a space, it should be:

/**
 * Text...
 */

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix.


/**
* A hint for the optimizer that we should not merge two projections.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why we want this in the LogicalPlan level and not on the expression level?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this approach is that most other optimizations won't work with this, for example predicate push down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally thought about putting it at the expression level, but ultimately decided it made more sense at the LogicalPlan node level, since the purpose was in fact to disrupt the optimizer. In some respects, it's meant to have the same effect as df.cache(), but without the caching. There may, in fact, be situations where predicate pushdown is not desired because the resulting condition would become complex and expensive to evaluate.

In Spark SQL, I think it also makes more sense to specify the hint at the derived table level, as opposed to a single expression. For instance,

SELECT SNO, PNO, C1 +1, C1 + 2
FROM ( SELECT /*+ NO_COLLAPSE */ SNO, PNO, QTY * 10 AS C1 FROM T ) T

This is similar to the NO_MERGE query hint in Oracle, which prevents the query from being flattened.


comparePlans(
parsePlan("SELECT a FROM (SELECT /*+ NO_COLLAPSE */ * FROM t) t1"),
SubqueryAlias("t1", Hint("NO_COLLAPSE", Seq.empty, table("t").select(star())))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you testing here that is not covered by the other cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, nothing. I will remove it.

@SparkQA
Copy link

SparkQA commented Apr 20, 2017

Test build #76001 has finished for PR 17708 at commit 975cca5.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ptkool ptkool force-pushed the no_collapse_query_hint branch from 1231585 to 3986247 Compare April 20, 2017 20:19
@SparkQA
Copy link

SparkQA commented Apr 20, 2017

Test build #76005 has finished for PR 17708 at commit 3986247.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

gatorsmile commented Apr 23, 2017

Based on the JIRA description, it sounds like we should not simply merge two Projects to avoid calling the same UDF multiple times, instead of adding a new logical plan node.

@viirya
Copy link
Member

viirya commented Apr 24, 2017

I have the same question as Reynold asked in the mailing list. Doesn't common sub expression elimination already address this issue?

@gatorsmile
Copy link
Member

Any update? Maybe we can close this PR at first?

@ptkool
Copy link
Contributor Author

ptkool commented Jun 26, 2017

@gatorsmile I will run a few more tests to determine if subexpression elimination solves this issue.

@gatorsmile
Copy link
Member

We are closing the inactive PRs. After you run more test, please do reopen if you still hit this issue. Thanks!

@asfgit asfgit closed this in b32bd00 Jun 27, 2017
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
## What changes were proposed in this pull request?

This PR proposes to close stale PRs, mostly the same instances with apache#18017

I believe the author in apache#14807 removed his account.

Closes apache#7075
Closes apache#8927
Closes apache#9202
Closes apache#9366
Closes apache#10861
Closes apache#11420
Closes apache#12356
Closes apache#13028
Closes apache#13506
Closes apache#14191
Closes apache#14198
Closes apache#14330
Closes apache#14807
Closes apache#15839
Closes apache#16225
Closes apache#16685
Closes apache#16692
Closes apache#16995
Closes apache#17181
Closes apache#17211
Closes apache#17235
Closes apache#17237
Closes apache#17248
Closes apache#17341
Closes apache#17708
Closes apache#17716
Closes apache#17721
Closes apache#17937

Added:
Closes apache#14739
Closes apache#17139
Closes apache#17445
Closes apache#18042
Closes apache#18359

Added:
Closes apache#16450
Closes apache#16525
Closes apache#17738

Added:
Closes apache#16458
Closes apache#16508
Closes apache#17714

Added:
Closes apache#17830
Closes apache#14742

## How was this patch tested?

N/A

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#18417 from HyukjinKwon/close-stale-pr.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants