-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27539][SQL] Fix inaccurate aggregate outputRows estimation with column containing null values #24436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok to test |
|
Test build #104807 has finished for PR 24436 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Merged to master/2.4.
cc @gatorsmile
…h column containing null values ## What changes were proposed in this pull request? This PR is follow up of #24286. As gatorsmile pointed out that column with null value is inaccurate as well. ``` > select key from test; 2 NULL 1 spark-sql> desc extended test key; col_name key data_type int comment NULL min 1 max 2 num_nulls 1 distinct_count 2 ``` The distinct count should be distinct_count + 1 when column contains null value. ## How was this patch tested? Existing tests & new UT added. Closes #24436 from pengbo/aggregation_estimation. Authored-by: pengbo <bo.peng1019@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit d9b2ce0) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
| val distinctValue: BigInt = if (distinctCount == 0 && columnStat.nullCount.get > 0) { | ||
| 1 | ||
| val distinctValue: BigInt = if (columnStat.nullCount.get > 0) { | ||
| distinctCount + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, actually, do we need to count null as distinct value? It's not counted as a distinct value in SQL (F.countDistinct or count(DISTINCT col)) and Pandas (unique() by default) at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into the current impl, seems we ignore null as distinct values:
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
Line 270 in 239082d
| val numNonNulls = if (col.nullable) Count(col) else Count(one) |
Lines 50 to 51 in b1857a4
| ColumnStat(distinctCount = Some(0), min = None, max = None, nullCount = Some(rowCount), | |
| avgLen = Some(dataType.defaultSize), maxLen = Some(dataType.defaultSize)) |
|
Reverting this. I don't think this is a correct fix. |
|
I added a comment on the original PR. The reverting looks wrong to me. |
…h column containing null values ## What changes were proposed in this pull request? This PR is follow up of apache#24286. As gatorsmile pointed out that column with null value is inaccurate as well. ``` > select key from test; 2 NULL 1 spark-sql> desc extended test key; col_name key data_type int comment NULL min 1 max 2 num_nulls 1 distinct_count 2 ``` The distinct count should be distinct_count + 1 when column contains null value. ## How was this patch tested? Existing tests & new UT added. Closes apache#24436 from pengbo/aggregation_estimation. Authored-by: pengbo <bo.peng1019@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit d9b2ce0) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…h column containing null values ## What changes were proposed in this pull request? This PR is follow up of apache#24286. As gatorsmile pointed out that column with null value is inaccurate as well. ``` > select key from test; 2 NULL 1 spark-sql> desc extended test key; col_name key data_type int comment NULL min 1 max 2 num_nulls 1 distinct_count 2 ``` The distinct count should be distinct_count + 1 when column contains null value. ## How was this patch tested? Existing tests & new UT added. Closes apache#24436 from pengbo/aggregation_estimation. Authored-by: pengbo <bo.peng1019@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit d9b2ce0) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…h column containing null values ## What changes were proposed in this pull request? This PR is follow up of apache#24286. As gatorsmile pointed out that column with null value is inaccurate as well. ``` > select key from test; 2 NULL 1 spark-sql> desc extended test key; col_name key data_type int comment NULL min 1 max 2 num_nulls 1 distinct_count 2 ``` The distinct count should be distinct_count + 1 when column contains null value. ## How was this patch tested? Existing tests & new UT added. Closes apache#24436 from pengbo/aggregation_estimation. Authored-by: pengbo <bo.peng1019@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit d9b2ce0) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
This PR is follow up of #24286. As @gatorsmile pointed out that column with null value is inaccurate as well.
The distinct count should be distinct_count + 1 when column contains null value.
How was this patch tested?
Existing tests & new UT added.