Skip to content

Conversation

@allisonwang-db
Copy link
Contributor

What changes were proposed in this pull request?

This PR modifies the optimizer rule OptimizeOneRowRelationSubquery to always collapse projects and inline non-volatile expressions.

Why are the changes needed?

SPARK-39699 made CollpaseProjects more conservative. This has impacted correlated subqueries that Spark used to be able to support. For example, Spark used to be able to execute this correlated subquery:

SELECT (
  SELECT array_sort(a, (i, j) -> rank[i] - rank[j]) AS sorted
  FROM (SELECT MAP('a', 1, 'b', 2) rank)
) FROM t1

But after SPARK-39699, it will throw an exception Unexpected operator Join Inner because the projects inside the subquery can no longer be collapsed. We should always inline expressions if possible to support a broader range of correlated subqueries and avoid adding expensive domain joins.

Does this PR introduce any user-facing change?

Yes. It will allow Spark to execute more types of correlated subqueries.

How was this patch tested?

Unit test.

@github-actions github-actions bot added the SQL label Oct 14, 2022
@allisonwang-db allisonwang-db force-pushed the spark-40800-inline-expr-subquery branch from 3f0b1a3 to 2e641c9 Compare October 19, 2022 18:40
@allisonwang-db
Copy link
Contributor Author

cc @cloud-fan

@cloud-fan
Copy link
Contributor

The test failure seems unrelated, can you retrigger?

@LuciferYang
Copy link
Contributor

LuciferYang commented Oct 21, 2022

@allisonwang-db You can rebase the code. I pin the Java version to 8u345 at #38311 for workaround and GA can pass without waiting for #38317

@allisonwang-db allisonwang-db force-pushed the spark-40800-inline-expr-subquery branch from 2e641c9 to 9a55c95 Compare October 21, 2022 04:03
@allisonwang-db allisonwang-db force-pushed the spark-40800-inline-expr-subquery branch from 9a55c95 to 8122e23 Compare October 21, 2022 22:03
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 58490da Oct 24, 2022
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…nSubquery

### What changes were proposed in this pull request?

This PR modifies the optimizer rule `OptimizeOneRowRelationSubquery` to always collapse projects and inline non-volatile expressions.

### Why are the changes needed?

SPARK-39699 made `CollpaseProjects` more conservative. This has impacted correlated subqueries that Spark used to be able to support. For example, Spark used to be able to execute this correlated subquery:
```sql
SELECT (
  SELECT array_sort(a, (i, j) -> rank[i] - rank[j]) AS sorted
  FROM (SELECT MAP('a', 1, 'b', 2) rank)
) FROM t1
```
But after SPARK-39699, it will throw an exception `Unexpected operator Join Inner` because the projects inside the subquery can no longer be collapsed. We should always inline expressions if possible to support a broader range of correlated subqueries and avoid adding expensive domain joins.

### Does this PR introduce _any_ user-facing change?

Yes. It will allow Spark to execute more types of correlated subqueries.

### How was this patch tested?

Unit test.

Closes apache#38260 from allisonwang-db/spark-40800-inline-expr-subquery.

Authored-by: allisonwang-db <allison.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants