Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

AFAIK multi-column count is not widely supported by the mainstream databases(postgres doesn't support), and the SQL standard doesn't define it clearly, as near as I can tell.

Since Spark supports it, we should clearly document the current behavior and add tests to verify it.

How was this patch tested?

N/A

@cloud-fan
Copy link
Contributor Author

cc @gatorsmile @mgaido91 @viirya

@mgaido91
Copy link
Contributor

this is indeed the behavior I'd expect. Good to add tests to enforce the behavior. Did you check other RDBMs apart from Postgres?

@cloud-fan
Copy link
Contributor Author

I'm going to try Hive and Presto, but my local environment has some problems and I need to fix it first. Will work on it tomorrow.

@cloud-fan
Copy link
Contributor Author

BTW MySQL doesn't support count(a, b) but supports count(distinct a, b), the result is same as Spark.

@viirya
Copy link
Member

viirya commented Oct 15, 2018

Yea, it is definitely good to add document and test for current behavior.

@SparkQA
Copy link

SparkQA commented Oct 15, 2018

Test build #97401 has finished for PR 22728 at commit 708d7fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 15, 2018

Test build #97400 has finished for PR 22728 at commit 62b4b84.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

FROM testData;

-- count with multiple expressions
SELECT count(a, b), count(b, a), count(testData.*) FROM testData;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also include count(*)

SELECT count(a, b), count(b, a), count(testData.*) FROM testData;

-- distinct count with multiple expressions
SELECT count(DISTINCT a, b), count(DISTINCT b, a), count(DISTINCT testData.*) FROM testData;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also include count(DISTINCT *)

@gatorsmile
Copy link
Member

Let us add one more case in the test suite.

SELECT count(1), count(NULL) FROM testData;

@gatorsmile
Copy link
Member

LGTM except the above comments.

@viirya
Copy link
Member

viirya commented Oct 16, 2018

LGTM

@SparkQA
Copy link

SparkQA commented Oct 16, 2018

Test build #97420 has finished for PR 22728 at commit e3aaa90.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master and branch-2.4.

asfgit pushed a commit that referenced this pull request Oct 16, 2018
…lumn count

## What changes were proposed in this pull request?

AFAIK multi-column count is not widely supported by the mainstream databases(postgres doesn't support), and the SQL standard doesn't define it clearly, as near as I can tell.

Since Spark supports it, we should clearly document the current behavior and add tests to verify it.

## How was this patch tested?

N/A

Closes #22728 from cloud-fan/doc.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
(cherry picked from commit e028fd3)
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
@asfgit asfgit closed this in e028fd3 Oct 16, 2018
@cloud-fan
Copy link
Contributor Author

FYI, I tried both hive and presto, neither of them supports multi-column count.

@mgaido91
Copy link
Contributor

thanks for your work @cloud-fan !

@HyukjinKwon
Copy link
Member

(From #22773 (comment)) @gatorsmile and @cloud-fan, let's say this will break DESCRIBE FUNCTION EXTENDED. Should we update migration guide as well?

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…lumn count

## What changes were proposed in this pull request?

AFAIK multi-column count is not widely supported by the mainstream databases(postgres doesn't support), and the SQL standard doesn't define it clearly, as near as I can tell.

Since Spark supports it, we should clearly document the current behavior and add tests to verify it.

## How was this patch tested?

N/A

Closes apache#22728 from cloud-fan/doc.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants