Skip to content

Conversation

@pengbo
Copy link
Contributor

@pengbo pengbo commented Apr 22, 2019

What changes were proposed in this pull request?

This PR is follow up of #24286. As @gatorsmile pointed out that column with null value is inaccurate as well.

> select key from test;
2
NULL
1
spark-sql> desc extended test key;
col_name key
data_type int
comment NULL
min 1
max 2
num_nulls 1
distinct_count 2

The distinct count should be distinct_count + 1 when column contains null value.

How was this patch tested?

Existing tests & new UT added.

@cloud-fan
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Apr 22, 2019

Test build #104807 has finished for PR 24436 at commit 50ae5cf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-27539][SQL] Inaccurate aggregate outputRows estimation with column contains null value [SPARK-27539][SQL] Fix inaccurate aggregate outputRows estimation with column contains null value Apr 23, 2019
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to master/2.4.
cc @gatorsmile

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-27539][SQL] Fix inaccurate aggregate outputRows estimation with column contains null value [SPARK-27539][SQL] Fix inaccurate aggregate outputRows estimation with column containing null value Apr 23, 2019
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-27539][SQL] Fix inaccurate aggregate outputRows estimation with column containing null value [SPARK-27539][SQL] Fix inaccurate aggregate outputRows estimation with column containing null values Apr 23, 2019
dongjoon-hyun pushed a commit that referenced this pull request Apr 23, 2019
…h column containing null values

## What changes were proposed in this pull request?
This PR is follow up of #24286. As gatorsmile pointed out that column with null value is inaccurate as well.

```
> select key from test;
2
NULL
1
spark-sql> desc extended test key;
col_name key
data_type int
comment NULL
min 1
max 2
num_nulls 1
distinct_count 2
```

The distinct count should be distinct_count + 1 when column contains null value.
## How was this patch tested?

Existing tests & new UT added.

Closes #24436 from pengbo/aggregation_estimation.

Authored-by: pengbo <bo.peng1019@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit d9b2ce0)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
val distinctValue: BigInt = if (distinctCount == 0 && columnStat.nullCount.get > 0) {
1
val distinctValue: BigInt = if (columnStat.nullCount.get > 0) {
distinctCount + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, actually, do we need to count null as distinct value? It's not counted as a distinct value in SQL (F.countDistinct or count(DISTINCT col)) and Pandas (unique() by default) at least.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into the current impl, seems we ignore null as distinct values:

val numNonNulls = if (col.nullable) Count(col) else Count(one)

ColumnStat(distinctCount = Some(0), min = None, max = None, nullCount = Some(rowCount),
avgLen = Some(dataType.defaultSize), maxLen = Some(dataType.defaultSize))

@HyukjinKwon
Copy link
Member

Reverting this. I don't think this is a correct fix.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 23, 2019

I added a comment on the original PR. The reverting looks wrong to me.

kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 23, 2019
…h column containing null values

## What changes were proposed in this pull request?
This PR is follow up of apache#24286. As gatorsmile pointed out that column with null value is inaccurate as well.

```
> select key from test;
2
NULL
1
spark-sql> desc extended test key;
col_name key
data_type int
comment NULL
min 1
max 2
num_nulls 1
distinct_count 2
```

The distinct count should be distinct_count + 1 when column contains null value.
## How was this patch tested?

Existing tests & new UT added.

Closes apache#24436 from pengbo/aggregation_estimation.

Authored-by: pengbo <bo.peng1019@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit d9b2ce0)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 25, 2019
…h column containing null values

## What changes were proposed in this pull request?
This PR is follow up of apache#24286. As gatorsmile pointed out that column with null value is inaccurate as well.

```
> select key from test;
2
NULL
1
spark-sql> desc extended test key;
col_name key
data_type int
comment NULL
min 1
max 2
num_nulls 1
distinct_count 2
```

The distinct count should be distinct_count + 1 when column contains null value.
## How was this patch tested?

Existing tests & new UT added.

Closes apache#24436 from pengbo/aggregation_estimation.

Authored-by: pengbo <bo.peng1019@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit d9b2ce0)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Aug 1, 2019
…h column containing null values

## What changes were proposed in this pull request?
This PR is follow up of apache#24286. As gatorsmile pointed out that column with null value is inaccurate as well.

```
> select key from test;
2
NULL
1
spark-sql> desc extended test key;
col_name key
data_type int
comment NULL
min 1
max 2
num_nulls 1
distinct_count 2
```

The distinct count should be distinct_count + 1 when column contains null value.
## How was this patch tested?

Existing tests & new UT added.

Closes apache#24436 from pengbo/aggregation_estimation.

Authored-by: pengbo <bo.peng1019@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit d9b2ce0)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants