[SPARK-49000][SQL][3.4] Fix "select count(distinct 1) from t" where t is empty table by expanding RewriteDistinctAggregates #47567

uros-db · 2024-08-01T07:41:37Z

What changes were proposed in this pull request?

Fix RewriteDistinctAggregates rule to deal properly with aggregation on DISTINCT literals. Physical plan for select count(distinct 1) from t:

-- count(distinct 1)
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(distinct 1)], output=[count(DISTINCT 1)#2L])
   +- HashAggregate(keys=[], functions=[partial_count(distinct 1)], output=[count#6L])
      +- HashAggregate(keys=[], functions=[], output=[])
         +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=20]
            +- HashAggregate(keys=[], functions=[], output=[])
               +- FileScan parquet spark_catalog.default.t[] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/nikola.mandic/oss-spark/spark-warehouse/org.apache.spark.s..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>

Problem is happening when HashAggregate(keys=[], functions=[], output=[]) node yields one row to partial_count node, which then captures one row. This four-node structure is constructed by AggUtils.planAggregateWithOneDistinct.

To fix the problem, we're adding Expand node which will force non-empty grouping expressions in HashAggregateExec nodes. This will in turn enable streaming zero rows to parent partial_count node, yielding correct final result.

Why are the changes needed?

Aggregation with DISTINCT literal gives wrong results. For example, when running on empty table t:
select count(distinct 1) from t returns 1, while the correct result should be 0.
For reference:
select count(1) from t returns 0, which is the correct and expected result.

Does this PR introduce any user-facing change?

Yes, this fixes a critical bug in Spark.

How was this patch tested?

New e2e SQL tests for aggregates with DISTINCT literals.

Was this patch authored or co-authored using generative AI tooling?

No.

uros-db

backport to 3.4 ready, waiting for CI checks

dongjoon-hyun

Could you re-trigger the failed CI, @uros-db ?

yaooqinn · 2024-08-02T05:59:12Z

Merged to 3.4, thank you all

… is empty table by expanding RewriteDistinctAggregates ### What changes were proposed in this pull request? Fix `RewriteDistinctAggregates` rule to deal properly with aggregation on DISTINCT literals. Physical plan for `select count(distinct 1) from t`: ``` -- count(distinct 1) == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[], functions=[count(distinct 1)], output=[count(DISTINCT 1)#2L]) +- HashAggregate(keys=[], functions=[partial_count(distinct 1)], output=[count#6L]) +- HashAggregate(keys=[], functions=[], output=[]) +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=20] +- HashAggregate(keys=[], functions=[], output=[]) +- FileScan parquet spark_catalog.default.t[] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/nikola.mandic/oss-spark/spark-warehouse/org.apache.spark.s..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> ``` Problem is happening when `HashAggregate(keys=[], functions=[], output=[])` node yields one row to `partial_count` node, which then captures one row. This four-node structure is constructed by `AggUtils.planAggregateWithOneDistinct`. To fix the problem, we're adding `Expand` node which will force non-empty grouping expressions in `HashAggregateExec` nodes. This will in turn enable streaming zero rows to parent `partial_count` node, yielding correct final result. ### Why are the changes needed? Aggregation with DISTINCT literal gives wrong results. For example, when running on empty table `t`: `select count(distinct 1) from t` returns 1, while the correct result should be 0. For reference: `select count(1) from t` returns 0, which is the correct and expected result. ### Does this PR introduce _any_ user-facing change? Yes, this fixes a critical bug in Spark. ### How was this patch tested? New e2e SQL tests for aggregates with DISTINCT literals. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47567 from uros-db/SPARK-49000-3.4. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Kent Yao <yao@apache.org>

Initial commit

54cab8c

github-actions bot added the SQL label Aug 1, 2024

uros-db mentioned this pull request Aug 1, 2024

[SPARK-49000][SQL] Fix "select count(distinct 1) from t" where t is empty table by expanding RewriteDistinctAggregates #47525

Closed

uros-db added 2 commits August 1, 2024 09:46

Update

6edcb0f

Remove collation

6573b96

uros-db commented Aug 1, 2024

View reviewed changes

nikolamand-db approved these changes Aug 1, 2024

View reviewed changes

dbatomic approved these changes Aug 1, 2024

View reviewed changes

dongjoon-hyun reviewed Aug 1, 2024

View reviewed changes

uros-db changed the title ~~[SPARK-49000][SQL] Fix "select count(distinct 1) from t" where t is empty table by expanding RewriteDistinctAggregates~~ [SPARK-49000][SQL] Fix "select count(distinct 1) from t" where t is empty table by expanding RewriteDistinctAggregates (backport to 3.4) Aug 1, 2024

Update comment

3466de5

uros-db requested a review from dongjoon-hyun August 1, 2024 20:33

yaooqinn approved these changes Aug 2, 2024

View reviewed changes

yaooqinn closed this Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-49000][SQL][3.4] Fix "select count(distinct 1) from t" where t is empty table by expanding RewriteDistinctAggregates #47567

[SPARK-49000][SQL][3.4] Fix "select count(distinct 1) from t" where t is empty table by expanding RewriteDistinctAggregates #47567

Uh oh!

uros-db commented Aug 1, 2024 •

edited

Loading

Uh oh!

uros-db left a comment

Uh oh!

dongjoon-hyun left a comment

Uh oh!

yaooqinn commented Aug 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-49000][SQL][3.4] Fix "select count(distinct 1) from t" where t is empty table by expanding RewriteDistinctAggregates #47567

[SPARK-49000][SQL][3.4] Fix "select count(distinct 1) from t" where t is empty table by expanding RewriteDistinctAggregates #47567

Uh oh!

Conversation

uros-db commented Aug 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

uros-db left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

yaooqinn commented Aug 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

uros-db commented Aug 1, 2024 •

edited

Loading