[SPARK-27539][SQL] Fix inaccurate aggregate outputRows estimation with column containing null values #24436

pengbo · 2019-04-22T09:04:48Z

What changes were proposed in this pull request?

This PR is follow up of #24286. As @gatorsmile pointed out that column with null value is inaccurate as well.

> select key from test;
2
NULL
1
spark-sql> desc extended test key;
col_name key
data_type int
comment NULL
min 1
max 2
num_nulls 1
distinct_count 2

The distinct count should be distinct_count + 1 when column contains null value.

How was this patch tested?

Existing tests & new UT added.

…ue column

cloud-fan · 2019-04-22T15:53:03Z

ok to test

SparkQA · 2019-04-22T18:56:06Z

Test build #104807 has finished for PR 24436 at commit 50ae5cf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Merged to master/2.4.
cc @gatorsmile

…h column containing null values ## What changes were proposed in this pull request? This PR is follow up of #24286. As gatorsmile pointed out that column with null value is inaccurate as well. ``` > select key from test; 2 NULL 1 spark-sql> desc extended test key; col_name key data_type int comment NULL min 1 max 2 num_nulls 1 distinct_count 2 ``` The distinct count should be distinct_count + 1 when column contains null value. ## How was this patch tested? Existing tests & new UT added. Closes #24436 from pengbo/aggregation_estimation. Authored-by: pengbo <bo.peng1019@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit d9b2ce0) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

HyukjinKwon · 2019-05-23T18:10:16Z

.../scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AggregateEstimation.scala

-          val distinctValue: BigInt = if (distinctCount == 0 && columnStat.nullCount.get > 0) {
-            1
+          val distinctValue: BigInt = if (columnStat.nullCount.get > 0) {
+            distinctCount + 1


Hm, actually, do we need to count null as distinct value? It's not counted as a distinct value in SQL (F.countDistinct or count(DISTINCT col)) and Pandas (unique() by default) at least.

Looking into the current impl, seems we ignore null as distinct values:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala

Line 270 in 239082d

val numNonNulls = if (col.nullable) Count(col) else Count(one)

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala

Lines 50 to 51 in b1857a4

ColumnStat(distinctCount = Some(0), min = None, max = None, nullCount = Some(rowCount),

avgLen = Some(dataType.defaultSize), maxLen = Some(dataType.defaultSize))

HyukjinKwon · 2019-05-23T18:15:43Z

Reverting this. I don't think this is a correct fix.

dongjoon-hyun · 2019-05-23T19:32:29Z

I added a comment on the original PR. The reverting looks wrong to me.

[SPARK-27351][SQL] Wrong outputRows estimation after AggregateEstimation with only null value column #24286 (comment)

…h column containing null values ## What changes were proposed in this pull request? This PR is follow up of apache#24286. As gatorsmile pointed out that column with null value is inaccurate as well. ``` > select key from test; 2 NULL 1 spark-sql> desc extended test key; col_name key data_type int comment NULL min 1 max 2 num_nulls 1 distinct_count 2 ``` The distinct count should be distinct_count + 1 when column contains null value. ## How was this patch tested? Existing tests & new UT added. Closes apache#24436 from pengbo/aggregation_estimation. Authored-by: pengbo <bo.peng1019@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit d9b2ce0) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

pengbo added 2 commits April 22, 2019 16:39

SPARK-27539: Inaccurate aggregate outputRows estimation with null val…

db8701f

…ue column

style fixing

50ae5cf

srowen approved these changes Apr 22, 2019

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-27539][SQL] Inaccurate aggregate outputRows estimation with column contains null value~~ [SPARK-27539][SQL] Fix inaccurate aggregate outputRows estimation with column contains null value Apr 23, 2019

dongjoon-hyun approved these changes Apr 23, 2019

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-27539][SQL] Fix inaccurate aggregate outputRows estimation with column contains null value~~ [SPARK-27539][SQL] Fix inaccurate aggregate outputRows estimation with column containing null value Apr 23, 2019

dongjoon-hyun changed the title ~~[SPARK-27539][SQL] Fix inaccurate aggregate outputRows estimation with column containing null value~~ [SPARK-27539][SQL] Fix inaccurate aggregate outputRows estimation with column containing null values Apr 23, 2019

dongjoon-hyun closed this in d9b2ce0 Apr 23, 2019

HyukjinKwon reviewed May 23, 2019

View reviewed changes

HyukjinKwon mentioned this pull request May 23, 2019

[SPARK-27351][SQL] Wrong outputRows estimation after AggregateEstimation with only null value column #24286

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-27539][SQL] Fix inaccurate aggregate outputRows estimation with column containing null values #24436

[SPARK-27539][SQL] Fix inaccurate aggregate outputRows estimation with column containing null values #24436

Uh oh!

pengbo commented Apr 22, 2019 •

edited

Loading

Uh oh!

cloud-fan commented Apr 22, 2019

Uh oh!

SparkQA commented Apr 22, 2019

Uh oh!

dongjoon-hyun left a comment •

edited

Loading

Uh oh!

HyukjinKwon May 23, 2019

Uh oh!

HyukjinKwon May 23, 2019

Uh oh!

HyukjinKwon commented May 23, 2019

Uh oh!

dongjoon-hyun commented May 23, 2019 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

	ColumnStat(distinctCount = Some(0), min = None, max = None, nullCount = Some(rowCount),
	avgLen = Some(dataType.defaultSize), maxLen = Some(dataType.defaultSize))

[SPARK-27539][SQL] Fix inaccurate aggregate outputRows estimation with column containing null values #24436

[SPARK-27539][SQL] Fix inaccurate aggregate outputRows estimation with column containing null values #24436

Uh oh!

Conversation

pengbo commented Apr 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Apr 22, 2019

Uh oh!

SparkQA commented Apr 22, 2019

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 23, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 23, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented May 23, 2019

Uh oh!

dongjoon-hyun commented May 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pengbo commented Apr 22, 2019 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun commented May 23, 2019 •

edited

Loading